ChatGPT Business costs $25 per seat per month (formerly ChatGPT Team at $30). At 25 people that's $7,500 per year, and every message your team sends goes through OpenAI's servers. This post walks through deploying Open WebUI or LibreChat as a private team chat interface, backed by a vLLM inference server on GPU cloud, with SSO, RAG, and a cost comparison against ChatGPT Business pricing.
Open WebUI vs LibreChat: When to Pick Each
Both are Docker-first, open-source chat frontends that work with any OpenAI-compatible API. The right choice depends on how your team uses LLMs.
Open WebUI (100K+ GitHub stars) started as the official Ollama web UI and has grown into a full chat platform. Native Ollama support means zero config if you're already running Ollama. For teams that want one vLLM endpoint and a clean interface, it's the fastest path from zero to working product. The built-in RAG pipeline, model permissions, and user management cover most small-to-midsize team needs without configuration overhead.
LibreChat (35K+ stars) is built around multi-provider flexibility. A single librechat.yaml file defines all your endpoints: your vLLM backend, an Anthropic Claude API key for fallback, an Azure OpenAI deployment for compliance, all visible in one model dropdown. Teams that need to mix self-hosted inference with cloud APIs, or that want stronger audit logging and plugin extensibility, should lean toward LibreChat.
| Feature | Open WebUI | LibreChat |
|---|---|---|
| Model support | OpenAI-compat, Ollama, native | OpenAI-compat, multi-provider YAML |
| Multi-user auth | Yes | Yes |
| SSO (OIDC/SAML) | OIDC via env vars | OIDC, social login |
| RAG pipeline | Built-in (docs upload) | Via RAG_OPENAI_BASEURL |
| Agents and tools | Yes (built-in tool use) | Yes (plugins) |
| Code interpreter | In-container Python runner | E2B, Daytona (external) |
| Audit logging | stdout logs | stdout + plugin hooks |
| Plugin ecosystem | Community tools | LibreChat plugins |
| Docker-first setup | Yes | Yes (Compose) |
| GitHub stars | 100K+ | 35K+ |
Bottom line: Use Open WebUI if you have a single vLLM or Ollama backend and want a fast setup. Use LibreChat if you're routing to multiple providers or need granular per-user provider access control.
Architecture
The stack has three layers: a lightweight frontend container, a GPU-intensive LLM backend, and optional supporting services for RAG.
[User Browser]
↓ HTTPS (nginx or Cloudflare)
[Open WebUI / LibreChat container] ← CPU-only, ~2 vCPU, 4GB RAM
↓ OpenAI-compatible REST API (port 8000)
[vLLM server] ← GPU-intensive, H100 or L40S
↓ (optional)
[TEI Embedding Server] ← lightweight, 1-2GB VRAM
[Qdrant / Milvus / Weaviate] ← vector DB for RAG| Component | Compute | Port | Notes |
|---|---|---|---|
| Open WebUI or LibreChat | CPU only | 3000 or 3080 | Stateless; data in volume or Postgres |
| vLLM | GPU (H100/L40S) | 8000 | Never expose publicly |
| TEI embedding server | GPU (shared OK) | 8080 | Optional, for RAG |
| Vector DB (Qdrant) | CPU or GPU | 6333 | Optional, for RAG |
For a deep dive on the vLLM backend setup, see Build a Self-Hosted OpenAI-Compatible API with vLLM.
Hardware Sizing
The frontend container is negligible: 2 CPU cores and 4GB RAM cover thousands of idle sessions. The GPU is everything.
For GPU sizing across model sizes, see GPU memory requirements for LLMs for the full VRAM calculator. The table below covers the models most teams actually deploy as team chat backends:
| Model | VRAM (approx) | Recommended GPU | Max concurrent users |
|---|---|---|---|
| Llama 3.1 8B (FP16) | ~16GB | L40S | 30-40 |
| Llama 3.3 70B (AWQ) | ~38GB | H100 80GB | 20-25 |
| Llama 4 Scout 109B (INT4) | ~55GB | H100 SXM5 80GB | 10-15 |
"Concurrent" here means simultaneous streaming chats, not registered users. A 100-person team where 20 are actively chatting at once is a 20-concurrent-user workload.
For teams running Llama 3.3 70B or Llama 4 Scout, H100 rental on Spheron is the standard starting point. For smaller models at 8B-14B parameter scale, an L40S rental cuts the per-hour cost roughly in half.
Step 1: Provision Your GPU Instance
- Log into app.spheron.ai
- Select your GPU tier: H100 SXM5 80GB for 70B+ models, L40S PCIe for 8B-14B models
- Choose Ubuntu 22.04 with CUDA 12.4 pre-installed
- Set your SSH key and start the instance
- SSH in and verify:
nvidia-smiInstall Docker if not already present:
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER && newgrp dockerSpot vs on-demand: For a chat serving workload running 24/7 or during business hours, prefer on-demand instances. Spot instances can be reclaimed, which terminates all active chat sessions. Reserve spot for batch inference or offline jobs.
Current pricing (as of 17 May 2026):
| GPU | On-demand (per GPU/hr) | Spot (per GPU/hr) |
|---|---|---|
| H100 SXM5 80GB | from $3.90 | from $1.63 |
| L40S PCIe 48GB | from $0.75 | from $1.03 |
Pricing fluctuates based on GPU availability. The prices above are based on 17 May 2026 and may have changed. L40S spot is not currently discounted below on-demand. Check current GPU pricing → for live rates.
Step 2: Start a vLLM Backend
docker run -d \
--gpus all \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai:v0.6.4 \
--model meta-llama/Llama-3.3-70B-Instruct \
--quantization fp8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--served-model-name llama-70bvLLM exposes its API on host port 8000. The Open WebUI container (Step 3) connects to it using host.docker.internal:8000 from within its own network namespace. On Linux, the --add-host=host.docker.internal:host-gateway flag must be passed to the Open WebUI container (not vLLM) so the frontend can resolve the host address. On macOS and Windows Docker Desktop, host.docker.internal resolves automatically in all containers.
Verify the endpoint is live:
curl http://localhost:8000/v1/modelsYou should see a JSON response listing llama-70b as an available model. For advanced vLLM tuning including tensor parallelism, FP8 KV cache, and continuous batching config, see vLLM multi-GPU production deployment.
Step 3: Deploy Open WebUI
Generate a secret key and store it in a .env file before running:
echo "WEBUI_SECRET_KEY=$(openssl rand -hex 32)" > .envdocker run -d \
--name open-webui \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
--env-file .env \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=none \
-e WEBUI_AUTH=true \
-e ENABLE_SIGNUP=true \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:mainOpen http://localhost:3000 in a browser. The first account you create becomes the admin. Once the admin account exists, go to Admin Panel > Settings and disable signup to lock down new registrations.
Select your model from the dropdown in the top-left of the chat interface. If llama-70b appears, the vLLM connection is working.
Key environment variables:
| Variable | Purpose | Example |
|---|---|---|
OPENAI_API_BASE_URL | vLLM endpoint | http://host.docker.internal:8000/v1 |
OPENAI_API_KEY | Required but ignored by vLLM | Any non-empty string |
WEBUI_AUTH | Enable multi-user auth | true |
WEBUI_SECRET_KEY | Session signing key | Random 32-char string |
ENABLE_SIGNUP | Allow new registrations | false after initial setup |
Auth lockout warning: If you start with WEBUI_AUTH=false (single-user mode) and later change it to true, existing session cookies are invalidated. Set your auth mode before the first login.
Step 4: Alternative - Deploy LibreChat
LibreChat uses Docker Compose. Here is a minimal docker-compose.yml:
version: "3.8"
services:
api:
image: ghcr.io/danny-avila/librechat-dev:latest
restart: always
ports:
- "3080:3080"
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on:
mongodb:
condition: service_healthy
env_file: .env
volumes:
- ./librechat.yaml:/app/librechat.yaml:ro
- librechat-data:/app/client/public/images
mongodb:
image: mongo:6
restart: always
healthcheck:
test: ["CMD", "mongosh", "--eval", "db.adminCommand('ping')"]
interval: 10s
timeout: 5s
retries: 5
volumes:
- mongo-data:/data/db
volumes:
librechat-data:
mongo-data:The depends_on with condition: service_healthy is required. Without it, LibreChat starts before MongoDB is ready and fails with an intermittent connection error that is easy to miss.
Your librechat.yaml defines all LLM providers. A setup with vLLM as primary and Anthropic as fallback:
endpoints:
custom:
- name: "vLLM (self-hosted)"
apiKey: "none"
baseURL: "http://host.docker.internal:8000/v1"
models:
default: ["llama-70b"]
fetch: false
titleConvo: true
titleModel: "llama-70b"
- name: "Anthropic Claude"
apiKey: "${ANTHROPIC_API_KEY}"
baseURL: "https://api.anthropic.com/v1"
models:
default: ["claude-sonnet-4-5"]
fetch: falseUsers see both providers in the model dropdown and can switch between them per conversation. Your vLLM backend stays private; only the API key handling is routed through LibreChat.
Step 5: SSO for Team Access
Open WebUI OIDC
Store secrets in a .env file so they are not exposed in shell history or ps aux output:
cat > .env << 'EOF'
WEBUI_SECRET_KEY=your-random-32-char-string
OAUTH_CLIENT_SECRET=your-client-secret
EOFdocker run -d \
--name open-webui \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
--env-file .env \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=none \
-e WEBUI_AUTH=true \
-e ENABLE_SIGNUP=false \
-e ENABLE_OAUTH_SIGNUP=true \
-e OAUTH_CLIENT_ID=your-client-id \
-e OPENID_PROVIDER_URL=https://your-provider.com/.well-known/openid-configuration \
-e OAUTH_SCOPES="openid email profile" \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:mainThis works with Keycloak, Okta, Google Workspace, and Azure AD. Any identity provider that supports the OIDC discovery endpoint at /.well-known/openid-configuration will work.
LibreChat OIDC
Add these to your LibreChat .env:
ALLOW_SOCIAL_LOGIN=true
OPENID_CLIENT_ID=your-client-id
OPENID_CLIENT_SECRET=your-client-secret
OPENID_ISSUER=https://your-provider.com
OPENID_SCOPE="openid email profile"
OPENID_CALLBACK_URL=/oauth/openid/callbackRate Limiting
Open WebUI has a built-in rate limit env var:
GLOBAL_RATE_LIMIT_MAX=100For nginx in front of Open WebUI:
limit_req_zone $binary_remote_addr zone=webui:10m rate=20r/s;
server {
location / {
limit_req zone=webui burst=40 nodelay;
proxy_pass http://localhost:3000;
proxy_http_version 1.1;
proxy_set_header Connection '';
proxy_buffering off;
proxy_read_timeout 600s;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}The proxy_buffering off line is critical for streaming. Without it, nginx buffers the response and users see a blank screen until generation is complete.
Audit Logging
Open WebUI logs all requests to stdout. Capture and forward them:
docker logs --follow open-webui 2>&1 | tee -a /var/log/open-webui/access.logPipe to your SIEM or logging stack (Loki, Datadog, CloudWatch) from there.
Step 6: Wire Up RAG
Before starting this step, you need a running TEI embedding server and a vector database. For full setup instructions, see Self-Host Embeddings and Rerankers with TEI on GPU Cloud and Self-Host Vector Databases on GPU Cloud.
Open WebUI RAG
Once TEI and Qdrant are running, go to Admin Panel > Settings > Documents:
- Embedding Model Endpoint: your TEI server URL (e.g.,
http://host.docker.internal:8080) - Vector DB: Qdrant at
http://host.docker.internal:6333
Upload documents via the UI (the paperclip icon in chat) or via the REST API:
curl -X POST http://localhost:3000/api/v1/documents \
-H "Authorization: Bearer your-session-token" \
-F "file=@report.pdf"LibreChat RAG
Add these to your LibreChat .env:
RAG_OPENAI_BASEURL=http://host.docker.internal:8080/v1
RAG_OPENAI_API_KEY=none
EMBEDDINGS_PROVIDER=huggingfaceteiLibreChat sends documents to TEI for embedding and stores vectors in Qdrant automatically. The retrieval happens at query time using the same embedding model.
Embedding models are lightweight. A BGE-M3 or Qwen3-Embedding-0.6B model uses 1-2GB VRAM. On an H100 80GB running Llama 3.3 70B AWQ at ~38GB, there is comfortably room to co-locate the embedding model on the same GPU without measurable impact on inference latency.
Step 7: Optional Code Interpreter Sandbox
Open WebUI ships with a built-in Python code runner that executes in the container itself. It is convenient for quick data analysis but carries real risk in a multi-user setup: any user who can submit code can run arbitrary Python inside your container.
LibreChat can connect to E2B or Daytona for sandboxed code execution. For production multi-user environments, this is the right call. The AI agent code execution sandbox guide covers setting up E2B and Firecracker-based alternatives.
If you are running Open WebUI for a team, either disable the code runner entirely (ENABLE_CODE_EXECUTION=false) or use it only in single-user mode where you control who has access.
Concurrent User Benchmarks
Setup: Spheron H100 SXM5 80GB, vLLM 0.6+, Llama 3.3 70B AWQ, FP8 KV cache.
| Concurrent users | Avg TTFT | Throughput (tok/s) | GPU util | P95 TTFT |
|---|---|---|---|---|
| 10 | ~350ms | ~1,800 | 65% | ~600ms |
| 50 | ~1.1s | ~2,400 | 88% | ~2.8s |
| 100 | ~2.4s | ~2,600 | 96% | ~6.5s |
Results are approximate on H100 SXM5 80GB with Llama 3.3 70B AWQ. Numbers vary with prompt and generation length and KV cache pressure.
At 50+ sustained concurrent users, two H100s with tensor parallelism (--tensor-parallel-size 2) is the safer configuration. TTFT stays under 1 second even at 100 users with the expanded KV cache capacity.
Cost vs ChatGPT Business
ChatGPT Business (formerly ChatGPT Team) is $25 per seat per month. Here is how that stacks up against a self-hosted H100 SXM5 setup at the current on-demand rate of $3.90/hr.
| Seats | ChatGPT Business | H100 (24/7) | H100 (business hours) |
|---|---|---|---|
| 10 | $250 | $2,808 | $686 |
| 25 | $625 | $2,808 | $686 |
| 40 | $1,000 | $2,808 | $686 |
| 50 | $1,250 | $2,808 | $686 |
| 75 | $1,875 | $2,808 | $686 |
| 113 | $2,825 | $2,808 | $686 |
Monthly GPU cost: $3.90/hr × 720 hours = $2,808 for 24/7 operation. For business-hours-only use (8 hours/day, 22 working days): $3.90/hr × 176 hours = $686/month.
Break-even at 24/7 uptime: about 113 seats. Break-even for business-hours use: about 28 seats.
If you use spot instances ($1.63/hr for H100 SXM5), the 24/7 monthly cost drops to $1,174 and break-even falls to around 47 seats. Note that spot can be preempted, terminating active sessions.
For smaller models on an L40S PCIe at $0.75/hr: 24/7 monthly cost is $540, breaking even against ChatGPT Business at around 22 seats.
If ChatGPT Enterprise is your comparison point ($60 per seat per month), the break-even drops to 47 seats at 24/7 on-demand operation.
Security Checklist
- Network: place vLLM behind a private network; port 8000 must never be publicly accessible. Expose only Open WebUI or LibreChat via HTTPS.
- Auth: disable
ENABLE_SIGNUPafter initial admin setup; enforce OIDC for all users in team deployments. - Secrets: never pass API keys in Docker run arguments; use
--env-fileor Docker secrets. - Model access: restrict which models users can select via Open WebUI's model permissions in Admin Panel > Models.
- TLS: terminate SSL at nginx or Cloudflare; do not run chat frontends on plain HTTP in production.
- Updates: pin image tags (not
latest) in production; test upgrades on a staging instance before applying to production.
Troubleshooting
Streaming failures (blank responses or cutoff output)
Symptom: Open WebUI shows a blank chat bubble or output cuts off mid-sentence.
Cause: Reverse proxy (nginx) is buffering the server-sent event (SSE) stream.
Fix: Add these directives to your nginx location block:
proxy_http_version 1.1;
proxy_set_header Connection '';
proxy_buffering off;
proxy_read_timeout 600s;
proxy_cache off;vLLM connection refused inside container
Symptom: Open WebUI shows "connection refused" or "network error" even though vLLM is running.
Cause: OPENAI_API_BASE_URL points to localhost, which resolves to the container's own network, not the host machine.
Fix: Use host.docker.internal instead of localhost:
OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1On Linux, ensure the Open WebUI container was started with --add-host=host.docker.internal:host-gateway. This flag is required on Linux; macOS and Windows Docker Desktop handle it automatically. The vLLM container does not need this flag.
RAG returning irrelevant results
Symptom: Document retrieval returns passages unrelated to the query.
Cause: The embedding model used at indexing time does not match the one used at query time.
Fix: Ensure the same model is configured in TEI for both document upload (indexing) and the live query path. Changing the --model flag in TEI after documents have been indexed requires re-indexing all documents. Check your TEI startup command:
docker run --gpus all -p 8080:80 \
ghcr.io/huggingface/text-embeddings-inference:latest \
--model-id BAAI/bge-m3If you are seeing degraded retrieval after an upgrade, the model name may have changed. Re-index affected document collections.
Related Guides
- Build a Self-Hosted OpenAI-Compatible API with vLLM: Start here for the vLLM backend this post builds on top of.
- Self-Host Embeddings and Rerankers: TEI on GPU Cloud: Add a production embedding pipeline to the RAG step above.
- Self-Host Vector Databases on GPU Cloud: Qdrant, Milvus, and Weaviate colocation guide.
- vLLM Multi-GPU Deployment 2026: Tensor parallelism, FP8, and production monitoring for the LLM backend.
- GPU Memory Requirements for LLMs: VRAM sizing calculator for 7B to 685B models.
- AI Agent Code Execution Sandbox: Wire up a secure code interpreter for the optional Step 7.
Teams running Open WebUI or LibreChat at scale need a reliable GPU backend that does not rate-limit or log your prompts. Spheron provides bare-metal H100 and L40S instances on-demand with no token fees. You pay for GPU time, not API calls.
Rent H100 on Spheron → | Rent L40S → | View all GPU pricing →
Quick Setup Guide
Compare the feature matrix: Open WebUI is simpler to set up with native Ollama support and a cleaner interface; LibreChat supports multiple simultaneous providers and is better for teams mixing self-hosted and cloud APIs. Choose Open WebUI for pure vLLM or Ollama setups; choose LibreChat for multi-provider flexibility.
Log into app.spheron.ai, select H100 SXM5 80GB for 70B models or L40S for 8B-13B models, choose Ubuntu 22.04 with CUDA 12.4, set an SSH key, and start the instance. Verify the GPU with nvidia-smi after SSH access.
Run the vLLM Docker container with --gpus all --ipc=host -p 8000:8000, specify your model with --model, and optionally add --quantization fp8 for memory efficiency. Verify the endpoint with curl http://localhost:8000/v1/models.
Run the Open WebUI Docker container with --add-host=host.docker.internal:host-gateway (required on Linux), OPENAI_API_BASE_URL set to http://host.docker.internal:8000/v1, and OPENAI_API_KEY set to any non-empty string. Store WEBUI_SECRET_KEY in a .env file and pass it via --env-file. Open http://localhost:3000 to create your admin account.
Set ENABLE_OAUTH_SIGNUP=true and configure OAUTH_CLIENT_ID, OAUTH_CLIENT_SECRET, and OPENID_PROVIDER_URL for your identity provider (Keycloak, Okta, Google Workspace, or Azure AD). Disable ENABLE_SIGNUP after the first admin account is created to prevent unauthorized registrations.
Start a TEI embedding server and a vector database, then configure Open WebUI via Admin Panel > Settings > Documents with the embedding endpoint URL, or set RAG_OPENAI_BASEURL and EMBEDDINGS_PROVIDER in LibreChat's environment. Upload documents via the UI or the POST /api/v1/documents API endpoint.
Frequently Asked Questions
Open WebUI is a self-hosted chat interface with 100K+ GitHub stars, built primarily for Ollama but with full OpenAI-compatible API support. It ships with chat history, model switching, and basic RAG out of the box. LibreChat is a more flexible multi-provider platform with 35K+ stars that lets you configure multiple LLM providers, both self-hosted and cloud, in a single dropdown. It is better for teams that want to mix vLLM with OpenAI or Anthropic fallbacks.
The frontend container is CPU-only and negligible. The GPU goes entirely to the vLLM backend. Llama 3.3 70B in AWQ quantization needs about 38GB VRAM, fitting on a single H100 80GB. For 50 simultaneous streaming users, one H100 SXM5 starts showing queuing at peak. Two H100s with tensor parallelism is the safer sizing for sustained 50-user concurrency with Llama 3.3 70B.
Yes. Set OPENAI_API_BASE_URL to your vLLM server address (e.g., http://host.docker.internal:8000/v1) and OPENAI_API_KEY to any non-empty string. Open WebUI treats any OpenAI-compatible endpoint as a first-class provider alongside Ollama.
Set ENABLE_OAUTH_SIGNUP=true and configure OAUTH_CLIENT_ID, OAUTH_CLIENT_SECRET, and OPENID_PROVIDER_URL in Open WebUI's environment variables. It works with Keycloak, Okta, Google Workspace, and Azure AD. After setting up OIDC, disable ENABLE_SIGNUP to prevent new users from creating local accounts.
ChatGPT Business costs $25 per seat per month. An H100 SXM5 on Spheron starts at $3.90/hr on-demand, which works out to about $2,808/month for 24/7 uptime. That breaks even at roughly 113 seats. If your team uses the GPU only during business hours (8 hrs/day, 5 days/week), the monthly cost drops to about $686 and the break-even falls to around 28 seats.
