Self-Host Open WebUI and LibreChat on GPU Cloud: Production Guide (2026)

ChatGPT Business costs $25 per seat per month (formerly ChatGPT Team at $30). At 25 people that's $7,500 per year, and every message your team sends goes through OpenAI's servers. This post walks through deploying Open WebUI or LibreChat as a private team chat interface, backed by a vLLM inference server on GPU cloud, with SSO, RAG, and a cost comparison against ChatGPT Business pricing.

Open WebUI vs LibreChat: When to Pick Each

Both are Docker-first, open-source chat frontends that work with any OpenAI-compatible API. The right choice depends on how your team uses LLMs.

Open WebUI (100K+ GitHub stars) started as the official Ollama web UI and has grown into a full chat platform. Native Ollama support means zero config if you're already running Ollama. For teams that want one vLLM endpoint and a clean interface, it's the fastest path from zero to working product. The built-in RAG pipeline, model permissions, and user management cover most small-to-midsize team needs without configuration overhead.

LibreChat (35K+ stars) is built around multi-provider flexibility. A single librechat.yaml file defines all your endpoints: your vLLM backend, an Anthropic Claude API key for fallback, an Azure OpenAI deployment for compliance, all visible in one model dropdown. Teams that need to mix self-hosted inference with cloud APIs, or that want stronger audit logging and plugin extensibility, should lean toward LibreChat.

Feature	Open WebUI	LibreChat
Model support	OpenAI-compat, Ollama, native	OpenAI-compat, multi-provider YAML
Multi-user auth	Yes	Yes
SSO (OIDC/SAML)	OIDC via env vars	OIDC, social login
RAG pipeline	Built-in (docs upload)	Via RAG_OPENAI_BASEURL
Agents and tools	Yes (built-in tool use)	Yes (plugins)
Code interpreter	In-container Python runner	E2B, Daytona (external)
Audit logging	stdout logs	stdout + plugin hooks
Plugin ecosystem	Community tools	LibreChat plugins
Docker-first setup	Yes	Yes (Compose)
GitHub stars	100K+	35K+

Bottom line: Use Open WebUI if you have a single vLLM or Ollama backend and want a fast setup. Use LibreChat if you're routing to multiple providers or need granular per-user provider access control.

Architecture

The stack has three layers: a lightweight frontend container, a GPU-intensive LLM backend, and optional supporting services for RAG.

[User Browser]
    ↓ HTTPS (nginx or Cloudflare)
[Open WebUI / LibreChat container]   ← CPU-only, ~2 vCPU, 4GB RAM
    ↓ OpenAI-compatible REST API (port 8000)
[vLLM server]                         ← GPU-intensive, H100 or L40S
    ↓ (optional)
[TEI Embedding Server]               ← lightweight, 1-2GB VRAM
[Qdrant / Milvus / Weaviate]         ← vector DB for RAG

Component	Compute	Port	Notes
Open WebUI or LibreChat	CPU only	3000 or 3080	Stateless; data in volume or Postgres
vLLM	GPU (H100/L40S)	8000	Never expose publicly
TEI embedding server	GPU (shared OK)	8080	Optional, for RAG
Vector DB (Qdrant)	CPU or GPU	6333	Optional, for RAG

For a deep dive on the vLLM backend setup, see Build a Self-Hosted OpenAI-Compatible API with vLLM.

Hardware Sizing

The frontend container is negligible: 2 CPU cores and 4GB RAM cover thousands of idle sessions. The GPU is everything.

For GPU sizing across model sizes, see GPU memory requirements for LLMs for the full VRAM calculator. The table below covers the models most teams actually deploy as team chat backends:

Model	VRAM (approx)	Recommended GPU	Max concurrent users
Llama 3.1 8B (FP16)	~16GB	L40S	30-40
Llama 3.3 70B (AWQ)	~38GB	H100 80GB	20-25
Llama 4 Scout 109B (INT4)	~55GB	H100 SXM5 80GB	10-15

"Concurrent" here means simultaneous streaming chats, not registered users. A 100-person team where 20 are actively chatting at once is a 20-concurrent-user workload.

For teams running Llama 3.3 70B or Llama 4 Scout, H100 rental on Spheron is the standard starting point. For smaller models at 8B-14B parameter scale, an L40S rental cuts the per-hour cost roughly in half.

Step 1: Provision Your GPU Instance

Log into app.spheron.ai
Select your GPU tier: H100 SXM5 80GB for 70B+ models, L40S PCIe for 8B-14B models
Choose Ubuntu 22.04 with CUDA 12.4 pre-installed
Set your SSH key and start the instance
SSH in and verify:

bash

nvidia-smi

Install Docker if not already present:

bash

curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER && newgrp docker

Spot vs on-demand: For a chat serving workload running 24/7 or during business hours, prefer on-demand instances. Spot instances can be reclaimed, which terminates all active chat sessions. Reserve spot for batch inference or offline jobs.

Current pricing (as of 17 May 2026):

GPU	On-demand (per GPU/hr)	Spot (per GPU/hr)
H100 SXM5 80GB	from $3.90	from $1.63
L40S PCIe 48GB	from $0.75	from $1.03

Pricing fluctuates based on GPU availability. The prices above are based on 17 May 2026 and may have changed. L40S spot is not currently discounted below on-demand. Check current GPU pricing → for live rates.

Step 2: Start a vLLM Backend

bash

docker run -d \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:v0.6.4 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --served-model-name llama-70b

vLLM exposes its API on host port 8000. The Open WebUI container (Step 3) connects to it using host.docker.internal:8000 from within its own network namespace. On Linux, the --add-host=host.docker.internal:host-gateway flag must be passed to the Open WebUI container (not vLLM) so the frontend can resolve the host address. On macOS and Windows Docker Desktop, host.docker.internal resolves automatically in all containers.

Verify the endpoint is live:

bash

curl http://localhost:8000/v1/models

You should see a JSON response listing llama-70b as an available model. For advanced vLLM tuning including tensor parallelism, FP8 KV cache, and continuous batching config, see vLLM multi-GPU production deployment.

Step 3: Deploy Open WebUI

Generate a secret key and store it in a .env file before running:

bash

echo "WEBUI_SECRET_KEY=$(openssl rand -hex 32)" > .env

bash

docker run -d \
  --name open-webui \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  --env-file .env \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=none \
  -e WEBUI_AUTH=true \
  -e ENABLE_SIGNUP=true \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in a browser. The first account you create becomes the admin. Once the admin account exists, go to Admin Panel > Settings and disable signup to lock down new registrations.

Select your model from the dropdown in the top-left of the chat interface. If llama-70b appears, the vLLM connection is working.

Key environment variables:

Variable	Purpose	Example
`OPENAI_API_BASE_URL`	vLLM endpoint	`http://host.docker.internal:8000/v1`
`OPENAI_API_KEY`	Required but ignored by vLLM	Any non-empty string
`WEBUI_AUTH`	Enable multi-user auth	`true`
`WEBUI_SECRET_KEY`	Session signing key	Random 32-char string
`ENABLE_SIGNUP`	Allow new registrations	`false` after initial setup

Auth lockout warning: If you start with WEBUI_AUTH=false (single-user mode) and later change it to true, existing session cookies are invalidated. Set your auth mode before the first login.

Step 4: Alternative - Deploy LibreChat

LibreChat uses Docker Compose. Here is a minimal docker-compose.yml:

yaml

version: "3.8"
services:
  api:
    image: ghcr.io/danny-avila/librechat-dev:latest
    restart: always
    ports:
      - "3080:3080"
    extra_hosts:
      - "host.docker.internal:host-gateway"
    depends_on:
      mongodb:
        condition: service_healthy
    env_file: .env
    volumes:
      - ./librechat.yaml:/app/librechat.yaml:ro
      - librechat-data:/app/client/public/images

  mongodb:
    image: mongo:6
    restart: always
    healthcheck:
      test: ["CMD", "mongosh", "--eval", "db.adminCommand('ping')"]
      interval: 10s
      timeout: 5s
      retries: 5
    volumes:
      - mongo-data:/data/db

volumes:
  librechat-data:
  mongo-data:

The depends_on with condition: service_healthy is required. Without it, LibreChat starts before MongoDB is ready and fails with an intermittent connection error that is easy to miss.

Your librechat.yaml defines all LLM providers. A setup with vLLM as primary and Anthropic as fallback:

yaml

endpoints:
  custom:
    - name: "vLLM (self-hosted)"
      apiKey: "none"
      baseURL: "http://host.docker.internal:8000/v1"
      models:
        default: ["llama-70b"]
        fetch: false
      titleConvo: true
      titleModel: "llama-70b"

    - name: "Anthropic Claude"
      apiKey: "${ANTHROPIC_API_KEY}"
      baseURL: "https://api.anthropic.com/v1"
      models:
        default: ["claude-sonnet-4-5"]
        fetch: false

Users see both providers in the model dropdown and can switch between them per conversation. Your vLLM backend stays private; only the API key handling is routed through LibreChat.

Step 5: SSO for Team Access

Open WebUI OIDC

Store secrets in a .env file so they are not exposed in shell history or ps aux output:

bash

cat > .env << 'EOF'
WEBUI_SECRET_KEY=your-random-32-char-string
OAUTH_CLIENT_SECRET=your-client-secret
EOF

bash

docker run -d \
  --name open-webui \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  --env-file .env \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=none \
  -e WEBUI_AUTH=true \
  -e ENABLE_SIGNUP=false \
  -e ENABLE_OAUTH_SIGNUP=true \
  -e OAUTH_CLIENT_ID=your-client-id \
  -e OPENID_PROVIDER_URL=https://your-provider.com/.well-known/openid-configuration \
  -e OAUTH_SCOPES="openid email profile" \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

This works with Keycloak, Okta, Google Workspace, and Azure AD. Any identity provider that supports the OIDC discovery endpoint at /.well-known/openid-configuration will work.

LibreChat OIDC

Add these to your LibreChat .env:

ALLOW_SOCIAL_LOGIN=true
OPENID_CLIENT_ID=your-client-id
OPENID_CLIENT_SECRET=your-client-secret
OPENID_ISSUER=https://your-provider.com
OPENID_SCOPE="openid email profile"
OPENID_CALLBACK_URL=/oauth/openid/callback

Rate Limiting

Open WebUI has a built-in rate limit env var:

GLOBAL_RATE_LIMIT_MAX=100

For nginx in front of Open WebUI:

nginx

limit_req_zone $binary_remote_addr zone=webui:10m rate=20r/s;

server {
    location / {
        limit_req zone=webui burst=40 nodelay;
        proxy_pass http://localhost:3000;
        proxy_http_version 1.1;
        proxy_set_header Connection '';
        proxy_buffering off;
        proxy_read_timeout 600s;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

The proxy_buffering off line is critical for streaming. Without it, nginx buffers the response and users see a blank screen until generation is complete.

Audit Logging

Open WebUI logs all requests to stdout. Capture and forward them:

bash

docker logs --follow open-webui 2>&1 | tee -a /var/log/open-webui/access.log

Pipe to your SIEM or logging stack (Loki, Datadog, CloudWatch) from there.

Step 6: Wire Up RAG

Before starting this step, you need a running TEI embedding server and a vector database. For full setup instructions, see Self-Host Embeddings and Rerankers with TEI on GPU Cloud and Self-Host Vector Databases on GPU Cloud.

Open WebUI RAG

Once TEI and Qdrant are running, go to Admin Panel > Settings > Documents:

Embedding Model Endpoint: your TEI server URL (e.g., http://host.docker.internal:8080)
Vector DB: Qdrant at http://host.docker.internal:6333

Upload documents via the UI (the paperclip icon in chat) or via the REST API:

bash

curl -X POST http://localhost:3000/api/v1/documents \
  -H "Authorization: Bearer your-session-token" \
  -F "file=@report.pdf"

LibreChat RAG

Add these to your LibreChat .env:

RAG_OPENAI_BASEURL=http://host.docker.internal:8080/v1
RAG_OPENAI_API_KEY=none
EMBEDDINGS_PROVIDER=huggingfacetei

LibreChat sends documents to TEI for embedding and stores vectors in Qdrant automatically. The retrieval happens at query time using the same embedding model.

Embedding models are lightweight. A BGE-M3 or Qwen3-Embedding-0.6B model uses 1-2GB VRAM. On an H100 80GB running Llama 3.3 70B AWQ at ~38GB, there is comfortably room to co-locate the embedding model on the same GPU without measurable impact on inference latency.

Step 7: Optional Code Interpreter Sandbox

Open WebUI ships with a built-in Python code runner that executes in the container itself. It is convenient for quick data analysis but carries real risk in a multi-user setup: any user who can submit code can run arbitrary Python inside your container.

LibreChat can connect to E2B or Daytona for sandboxed code execution. For production multi-user environments, this is the right call. The AI agent code execution sandbox guide covers setting up E2B and Firecracker-based alternatives.

If you are running Open WebUI for a team, either disable the code runner entirely (ENABLE_CODE_EXECUTION=false) or use it only in single-user mode where you control who has access.

Concurrent User Benchmarks

Setup: Spheron H100 SXM5 80GB, vLLM 0.6+, Llama 3.3 70B AWQ, FP8 KV cache.

Concurrent users	Avg TTFT	Throughput (tok/s)	GPU util	P95 TTFT
10	~350ms	~1,800	65%	~600ms
50	~1.1s	~2,400	88%	~2.8s
100	~2.4s	~2,600	96%	~6.5s

Results are approximate on H100 SXM5 80GB with Llama 3.3 70B AWQ. Numbers vary with prompt and generation length and KV cache pressure.

At 50+ sustained concurrent users, two H100s with tensor parallelism (--tensor-parallel-size 2) is the safer configuration. TTFT stays under 1 second even at 100 users with the expanded KV cache capacity.

Cost vs ChatGPT Business

ChatGPT Business (formerly ChatGPT Team) is $25 per seat per month. Here is how that stacks up against a self-hosted H100 SXM5 setup at the current on-demand rate of $3.90/hr.

Seats	ChatGPT Business	H100 (24/7)	H100 (business hours)
10	$250	$2,808	$686
25	$625	$2,808	$686
40	$1,000	$2,808	$686
50	$1,250	$2,808	$686
75	$1,875	$2,808	$686
113	$2,825	$2,808	$686

Monthly GPU cost: $3.90/hr × 720 hours = $2,808 for 24/7 operation. For business-hours-only use (8 hours/day, 22 working days): $3.90/hr × 176 hours = $686/month.

Break-even at 24/7 uptime: about 113 seats. Break-even for business-hours use: about 28 seats.

If you use spot instances ($1.63/hr for H100 SXM5), the 24/7 monthly cost drops to $1,174 and break-even falls to around 47 seats. Note that spot can be preempted, terminating active sessions.

For smaller models on an L40S PCIe at $0.75/hr: 24/7 monthly cost is $540, breaking even against ChatGPT Business at around 22 seats.

If ChatGPT Enterprise is your comparison point ($60 per seat per month), the break-even drops to 47 seats at 24/7 on-demand operation.

Security Checklist

Network: place vLLM behind a private network; port 8000 must never be publicly accessible. Expose only Open WebUI or LibreChat via HTTPS.
Auth: disable ENABLE_SIGNUP after initial admin setup; enforce OIDC for all users in team deployments.
Secrets: never pass API keys in Docker run arguments; use --env-file or Docker secrets.
Model access: restrict which models users can select via Open WebUI's model permissions in Admin Panel > Models.
TLS: terminate SSL at nginx or Cloudflare; do not run chat frontends on plain HTTP in production.
Updates: pin image tags (not latest) in production; test upgrades on a staging instance before applying to production.

Troubleshooting

Streaming failures (blank responses or cutoff output)

Symptom: Open WebUI shows a blank chat bubble or output cuts off mid-sentence.

Cause: Reverse proxy (nginx) is buffering the server-sent event (SSE) stream.

Fix: Add these directives to your nginx location block:

nginx

proxy_http_version 1.1;
proxy_set_header Connection '';
proxy_buffering off;
proxy_read_timeout 600s;
proxy_cache off;

vLLM connection refused inside container

Symptom: Open WebUI shows "connection refused" or "network error" even though vLLM is running.

Cause: OPENAI_API_BASE_URL points to localhost, which resolves to the container's own network, not the host machine.

Fix: Use host.docker.internal instead of localhost:

OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1

On Linux, ensure the Open WebUI container was started with --add-host=host.docker.internal:host-gateway. This flag is required on Linux; macOS and Windows Docker Desktop handle it automatically. The vLLM container does not need this flag.

RAG returning irrelevant results

Symptom: Document retrieval returns passages unrelated to the query.

Cause: The embedding model used at indexing time does not match the one used at query time.

Fix: Ensure the same model is configured in TEI for both document upload (indexing) and the live query path. Changing the --model flag in TEI after documents have been indexed requires re-indexing all documents. Check your TEI startup command:

bash

docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-embeddings-inference:latest \
  --model-id BAAI/bge-m3

If you are seeing degraded retrieval after an upgrade, the model name may have changed. Re-index affected document collections.

Build a Self-Hosted OpenAI-Compatible API with vLLM: Start here for the vLLM backend this post builds on top of.
Self-Host Embeddings and Rerankers: TEI on GPU Cloud: Add a production embedding pipeline to the RAG step above.
Self-Host Vector Databases on GPU Cloud: Qdrant, Milvus, and Weaviate colocation guide.
vLLM Multi-GPU Deployment 2026: Tensor parallelism, FP8, and production monitoring for the LLM backend.
GPU Memory Requirements for LLMs: VRAM sizing calculator for 7B to 685B models.
AI Agent Code Execution Sandbox: Wire up a secure code interpreter for the optional Step 7.
OpenClaw for agentic workflows: If you need more than a chat UI and want 50+ tool integrations with a fully agentic loop, this covers the OpenClaw self-hosted setup on the same GPU backend.

Teams running Open WebUI or LibreChat at scale need a reliable GPU backend that does not rate-limit or log your prompts. Spheron provides bare-metal H100 and L40S instances on-demand with no token fees. You pay for GPU time, not API calls.
Browse H100 capacity → | Check L40S availability → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

Choose between Open WebUI and LibreChat
Compare the feature matrix: Open WebUI is simpler to set up with native Ollama support and a cleaner interface; LibreChat supports multiple simultaneous providers and is better for teams mixing self-hosted and cloud APIs. Choose Open WebUI for pure vLLM or Ollama setups; choose LibreChat for multi-provider flexibility.
Provision a Spheron H100 or L40S GPU instance
Log into app.spheron.ai, select H100 SXM5 80GB for 70B models or L40S for 8B-13B models, choose Ubuntu 22.04 with CUDA 12.4, set an SSH key, and start the instance. Verify the GPU with nvidia-smi after SSH access.
Start a vLLM OpenAI-compatible backend
Run the vLLM Docker container with --gpus all --ipc=host -p 8000:8000, specify your model with --model, and optionally add --quantization fp8 for memory efficiency. Verify the endpoint with curl http://localhost:8000/v1/models.
Deploy Open WebUI and connect it to vLLM
Run the Open WebUI Docker container with --add-host=host.docker.internal:host-gateway (required on Linux), OPENAI_API_BASE_URL set to http://host.docker.internal:8000/v1, and OPENAI_API_KEY set to any non-empty string. Store WEBUI_SECRET_KEY in a .env file and pass it via --env-file. Open http://localhost:3000 to create your admin account.
Configure SSO and team access controls
Set ENABLE_OAUTH_SIGNUP=true and configure OAUTH_CLIENT_ID, OAUTH_CLIENT_SECRET, and OPENID_PROVIDER_URL for your identity provider (Keycloak, Okta, Google Workspace, or Azure AD). Disable ENABLE_SIGNUP after the first admin account is created to prevent unauthorized registrations.
Wire up RAG with an embedding backend and vector DB
Start a TEI embedding server and a vector database, then configure Open WebUI via Admin Panel > Settings > Documents with the embedding endpoint URL, or set RAG_OPENAI_BASEURL and EMBEDDINGS_PROVIDER in LibreChat's environment. Upload documents via the UI or the POST /api/v1/documents API endpoint.

FAQ / 05

Frequently Asked Questions

Open WebUI is a self-hosted chat interface with 100K+ GitHub stars, built primarily for Ollama but with full OpenAI-compatible API support. It ships with chat history, model switching, and basic RAG out of the box. LibreChat is a more flexible multi-provider platform with 35K+ stars that lets you configure multiple LLM providers, both self-hosted and cloud, in a single dropdown. It is better for teams that want to mix vLLM with OpenAI or Anthropic fallbacks.

The frontend container is CPU-only and negligible. The GPU goes entirely to the vLLM backend. Llama 3.3 70B in AWQ quantization needs about 38GB VRAM, fitting on a single H100 80GB. For 50 simultaneous streaming users, one H100 SXM5 starts showing queuing at peak. Two H100s with tensor parallelism is the safer sizing for sustained 50-user concurrency with Llama 3.3 70B.

Yes. Set OPENAI_API_BASE_URL to your vLLM server address (e.g., http://host.docker.internal:8000/v1) and OPENAI_API_KEY to any non-empty string. Open WebUI treats any OpenAI-compatible endpoint as a first-class provider alongside Ollama.

Set ENABLE_OAUTH_SIGNUP=true and configure OAUTH_CLIENT_ID, OAUTH_CLIENT_SECRET, and OPENID_PROVIDER_URL in Open WebUI's environment variables. It works with Keycloak, Okta, Google Workspace, and Azure AD. After setting up OIDC, disable ENABLE_SIGNUP to prevent new users from creating local accounts.

ChatGPT Business costs $25 per seat per month. An H100 SXM5 on Spheron starts at $3.90/hr on-demand, which works out to about $2,808/month for 24/7 uptime. That breaks even at roughly 113 seats. If your team uses the GPU only during business hours (8 hrs/day, 5 days/week), the monthly cost drops to about $686 and the break-even falls to around 28 seats.

Open WebUI vs LibreChat: When to Pick Each

Architecture

Hardware Sizing

Step 1: Provision Your GPU Instance

Step 2: Start a vLLM Backend

Step 3: Deploy Open WebUI

Step 4: Alternative - Deploy LibreChat

Step 5: SSO for Team Access

Open WebUI OIDC

LibreChat OIDC

Rate Limiting

Audit Logging

Step 6: Wire Up RAG

Open WebUI RAG

LibreChat RAG

Step 7: Optional Code Interpreter Sandbox

Concurrent User Benchmarks

Cost vs ChatGPT Business

Security Checklist

Troubleshooting

Streaming failures (blank responses or cutoff output)

vLLM connection refused inside container

RAG returning irrelevant results

Related Guides

Quick Setup Guide

Choose between Open WebUI and LibreChat

Provision a Spheron H100 or L40S GPU instance

Start a vLLM OpenAI-compatible backend

Deploy Open WebUI and connect it to vLLM

Configure SSO and team access controls

Wire up RAG with an embedding backend and vector DB

Frequently Asked Questions

01What is Open WebUI and how does it differ from LibreChat?

02How much GPU do I need to run Open WebUI with Llama 3.3 70B for a team of 50?

03Can I use Open WebUI with a vLLM backend instead of Ollama?

04How do I add SSO and OIDC to Open WebUI for team access?

05When does self-hosting a ChatGPT alternative become cheaper than ChatGPT Business?

Build what's next.