Tutorial

Self-Host Open WebUI and LibreChat on GPU Cloud: Production Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 17, 2026
Self Host Open WebUIOpen WebUI GPU CloudLibreChat Self HostedChatGPT Alternative Self HostedOpen WebUI vLLMSelf Hosted ChatGPT for TeamsLibreChat Multi-UserOpen WebUI Ollama ProductionPrivate ChatGPT CompanyGPU Cloud
Self-Host Open WebUI and LibreChat on GPU Cloud: Production Guide (2026)

ChatGPT Business costs $25 per seat per month (formerly ChatGPT Team at $30). At 25 people that's $7,500 per year, and every message your team sends goes through OpenAI's servers. This post walks through deploying Open WebUI or LibreChat as a private team chat interface, backed by a vLLM inference server on GPU cloud, with SSO, RAG, and a cost comparison against ChatGPT Business pricing.

Open WebUI vs LibreChat: When to Pick Each

Both are Docker-first, open-source chat frontends that work with any OpenAI-compatible API. The right choice depends on how your team uses LLMs.

Open WebUI (100K+ GitHub stars) started as the official Ollama web UI and has grown into a full chat platform. Native Ollama support means zero config if you're already running Ollama. For teams that want one vLLM endpoint and a clean interface, it's the fastest path from zero to working product. The built-in RAG pipeline, model permissions, and user management cover most small-to-midsize team needs without configuration overhead.

LibreChat (35K+ stars) is built around multi-provider flexibility. A single librechat.yaml file defines all your endpoints: your vLLM backend, an Anthropic Claude API key for fallback, an Azure OpenAI deployment for compliance, all visible in one model dropdown. Teams that need to mix self-hosted inference with cloud APIs, or that want stronger audit logging and plugin extensibility, should lean toward LibreChat.

FeatureOpen WebUILibreChat
Model supportOpenAI-compat, Ollama, nativeOpenAI-compat, multi-provider YAML
Multi-user authYesYes
SSO (OIDC/SAML)OIDC via env varsOIDC, social login
RAG pipelineBuilt-in (docs upload)Via RAG_OPENAI_BASEURL
Agents and toolsYes (built-in tool use)Yes (plugins)
Code interpreterIn-container Python runnerE2B, Daytona (external)
Audit loggingstdout logsstdout + plugin hooks
Plugin ecosystemCommunity toolsLibreChat plugins
Docker-first setupYesYes (Compose)
GitHub stars100K+35K+

Bottom line: Use Open WebUI if you have a single vLLM or Ollama backend and want a fast setup. Use LibreChat if you're routing to multiple providers or need granular per-user provider access control.

Architecture

The stack has three layers: a lightweight frontend container, a GPU-intensive LLM backend, and optional supporting services for RAG.

[User Browser]
    ↓ HTTPS (nginx or Cloudflare)
[Open WebUI / LibreChat container]   ← CPU-only, ~2 vCPU, 4GB RAM
    ↓ OpenAI-compatible REST API (port 8000)
[vLLM server]                         ← GPU-intensive, H100 or L40S
    ↓ (optional)
[TEI Embedding Server]               ← lightweight, 1-2GB VRAM
[Qdrant / Milvus / Weaviate]         ← vector DB for RAG
ComponentComputePortNotes
Open WebUI or LibreChatCPU only3000 or 3080Stateless; data in volume or Postgres
vLLMGPU (H100/L40S)8000Never expose publicly
TEI embedding serverGPU (shared OK)8080Optional, for RAG
Vector DB (Qdrant)CPU or GPU6333Optional, for RAG

For a deep dive on the vLLM backend setup, see Build a Self-Hosted OpenAI-Compatible API with vLLM.

Hardware Sizing

The frontend container is negligible: 2 CPU cores and 4GB RAM cover thousands of idle sessions. The GPU is everything.

For GPU sizing across model sizes, see GPU memory requirements for LLMs for the full VRAM calculator. The table below covers the models most teams actually deploy as team chat backends:

ModelVRAM (approx)Recommended GPUMax concurrent users
Llama 3.1 8B (FP16)~16GBL40S30-40
Llama 3.3 70B (AWQ)~38GBH100 80GB20-25
Llama 4 Scout 109B (INT4)~55GBH100 SXM5 80GB10-15

"Concurrent" here means simultaneous streaming chats, not registered users. A 100-person team where 20 are actively chatting at once is a 20-concurrent-user workload.

For teams running Llama 3.3 70B or Llama 4 Scout, H100 rental on Spheron is the standard starting point. For smaller models at 8B-14B parameter scale, an L40S rental cuts the per-hour cost roughly in half.

Step 1: Provision Your GPU Instance

  1. Log into app.spheron.ai
  2. Select your GPU tier: H100 SXM5 80GB for 70B+ models, L40S PCIe for 8B-14B models
  3. Choose Ubuntu 22.04 with CUDA 12.4 pre-installed
  4. Set your SSH key and start the instance
  5. SSH in and verify:
bash
nvidia-smi

Install Docker if not already present:

bash
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER && newgrp docker

Spot vs on-demand: For a chat serving workload running 24/7 or during business hours, prefer on-demand instances. Spot instances can be reclaimed, which terminates all active chat sessions. Reserve spot for batch inference or offline jobs.

Current pricing (as of 17 May 2026):

GPUOn-demand (per GPU/hr)Spot (per GPU/hr)
H100 SXM5 80GBfrom $3.90from $1.63
L40S PCIe 48GBfrom $0.75from $1.03

Pricing fluctuates based on GPU availability. The prices above are based on 17 May 2026 and may have changed. L40S spot is not currently discounted below on-demand. Check current GPU pricing → for live rates.

Step 2: Start a vLLM Backend

bash
docker run -d \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:v0.6.4 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --served-model-name llama-70b

vLLM exposes its API on host port 8000. The Open WebUI container (Step 3) connects to it using host.docker.internal:8000 from within its own network namespace. On Linux, the --add-host=host.docker.internal:host-gateway flag must be passed to the Open WebUI container (not vLLM) so the frontend can resolve the host address. On macOS and Windows Docker Desktop, host.docker.internal resolves automatically in all containers.

Verify the endpoint is live:

bash
curl http://localhost:8000/v1/models

You should see a JSON response listing llama-70b as an available model. For advanced vLLM tuning including tensor parallelism, FP8 KV cache, and continuous batching config, see vLLM multi-GPU production deployment.

Step 3: Deploy Open WebUI

Generate a secret key and store it in a .env file before running:

bash
echo "WEBUI_SECRET_KEY=$(openssl rand -hex 32)" > .env
bash
docker run -d \
  --name open-webui \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  --env-file .env \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=none \
  -e WEBUI_AUTH=true \
  -e ENABLE_SIGNUP=true \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in a browser. The first account you create becomes the admin. Once the admin account exists, go to Admin Panel > Settings and disable signup to lock down new registrations.

Select your model from the dropdown in the top-left of the chat interface. If llama-70b appears, the vLLM connection is working.

Key environment variables:

VariablePurposeExample
OPENAI_API_BASE_URLvLLM endpointhttp://host.docker.internal:8000/v1
OPENAI_API_KEYRequired but ignored by vLLMAny non-empty string
WEBUI_AUTHEnable multi-user authtrue
WEBUI_SECRET_KEYSession signing keyRandom 32-char string
ENABLE_SIGNUPAllow new registrationsfalse after initial setup

Auth lockout warning: If you start with WEBUI_AUTH=false (single-user mode) and later change it to true, existing session cookies are invalidated. Set your auth mode before the first login.

Step 4: Alternative - Deploy LibreChat

LibreChat uses Docker Compose. Here is a minimal docker-compose.yml:

yaml
version: "3.8"
services:
  api:
    image: ghcr.io/danny-avila/librechat-dev:latest
    restart: always
    ports:
      - "3080:3080"
    extra_hosts:
      - "host.docker.internal:host-gateway"
    depends_on:
      mongodb:
        condition: service_healthy
    env_file: .env
    volumes:
      - ./librechat.yaml:/app/librechat.yaml:ro
      - librechat-data:/app/client/public/images

  mongodb:
    image: mongo:6
    restart: always
    healthcheck:
      test: ["CMD", "mongosh", "--eval", "db.adminCommand('ping')"]
      interval: 10s
      timeout: 5s
      retries: 5
    volumes:
      - mongo-data:/data/db

volumes:
  librechat-data:
  mongo-data:

The depends_on with condition: service_healthy is required. Without it, LibreChat starts before MongoDB is ready and fails with an intermittent connection error that is easy to miss.

Your librechat.yaml defines all LLM providers. A setup with vLLM as primary and Anthropic as fallback:

yaml
endpoints:
  custom:
    - name: "vLLM (self-hosted)"
      apiKey: "none"
      baseURL: "http://host.docker.internal:8000/v1"
      models:
        default: ["llama-70b"]
        fetch: false
      titleConvo: true
      titleModel: "llama-70b"

    - name: "Anthropic Claude"
      apiKey: "${ANTHROPIC_API_KEY}"
      baseURL: "https://api.anthropic.com/v1"
      models:
        default: ["claude-sonnet-4-5"]
        fetch: false

Users see both providers in the model dropdown and can switch between them per conversation. Your vLLM backend stays private; only the API key handling is routed through LibreChat.

Step 5: SSO for Team Access

Open WebUI OIDC

Store secrets in a .env file so they are not exposed in shell history or ps aux output:

bash
cat > .env << 'EOF'
WEBUI_SECRET_KEY=your-random-32-char-string
OAUTH_CLIENT_SECRET=your-client-secret
EOF
bash
docker run -d \
  --name open-webui \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  --env-file .env \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=none \
  -e WEBUI_AUTH=true \
  -e ENABLE_SIGNUP=false \
  -e ENABLE_OAUTH_SIGNUP=true \
  -e OAUTH_CLIENT_ID=your-client-id \
  -e OPENID_PROVIDER_URL=https://your-provider.com/.well-known/openid-configuration \
  -e OAUTH_SCOPES="openid email profile" \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

This works with Keycloak, Okta, Google Workspace, and Azure AD. Any identity provider that supports the OIDC discovery endpoint at /.well-known/openid-configuration will work.

LibreChat OIDC

Add these to your LibreChat .env:

ALLOW_SOCIAL_LOGIN=true
OPENID_CLIENT_ID=your-client-id
OPENID_CLIENT_SECRET=your-client-secret
OPENID_ISSUER=https://your-provider.com
OPENID_SCOPE="openid email profile"
OPENID_CALLBACK_URL=/oauth/openid/callback

Rate Limiting

Open WebUI has a built-in rate limit env var:

GLOBAL_RATE_LIMIT_MAX=100

For nginx in front of Open WebUI:

nginx
limit_req_zone $binary_remote_addr zone=webui:10m rate=20r/s;

server {
    location / {
        limit_req zone=webui burst=40 nodelay;
        proxy_pass http://localhost:3000;
        proxy_http_version 1.1;
        proxy_set_header Connection '';
        proxy_buffering off;
        proxy_read_timeout 600s;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

The proxy_buffering off line is critical for streaming. Without it, nginx buffers the response and users see a blank screen until generation is complete.

Audit Logging

Open WebUI logs all requests to stdout. Capture and forward them:

bash
docker logs --follow open-webui 2>&1 | tee -a /var/log/open-webui/access.log

Pipe to your SIEM or logging stack (Loki, Datadog, CloudWatch) from there.

Step 6: Wire Up RAG

Before starting this step, you need a running TEI embedding server and a vector database. For full setup instructions, see Self-Host Embeddings and Rerankers with TEI on GPU Cloud and Self-Host Vector Databases on GPU Cloud.

Open WebUI RAG

Once TEI and Qdrant are running, go to Admin Panel > Settings > Documents:

  • Embedding Model Endpoint: your TEI server URL (e.g., http://host.docker.internal:8080)
  • Vector DB: Qdrant at http://host.docker.internal:6333

Upload documents via the UI (the paperclip icon in chat) or via the REST API:

bash
curl -X POST http://localhost:3000/api/v1/documents \
  -H "Authorization: Bearer your-session-token" \
  -F "file=@report.pdf"

LibreChat RAG

Add these to your LibreChat .env:

RAG_OPENAI_BASEURL=http://host.docker.internal:8080/v1
RAG_OPENAI_API_KEY=none
EMBEDDINGS_PROVIDER=huggingfacetei

LibreChat sends documents to TEI for embedding and stores vectors in Qdrant automatically. The retrieval happens at query time using the same embedding model.

Embedding models are lightweight. A BGE-M3 or Qwen3-Embedding-0.6B model uses 1-2GB VRAM. On an H100 80GB running Llama 3.3 70B AWQ at ~38GB, there is comfortably room to co-locate the embedding model on the same GPU without measurable impact on inference latency.

Step 7: Optional Code Interpreter Sandbox

Open WebUI ships with a built-in Python code runner that executes in the container itself. It is convenient for quick data analysis but carries real risk in a multi-user setup: any user who can submit code can run arbitrary Python inside your container.

LibreChat can connect to E2B or Daytona for sandboxed code execution. For production multi-user environments, this is the right call. The AI agent code execution sandbox guide covers setting up E2B and Firecracker-based alternatives.

If you are running Open WebUI for a team, either disable the code runner entirely (ENABLE_CODE_EXECUTION=false) or use it only in single-user mode where you control who has access.

Concurrent User Benchmarks

Setup: Spheron H100 SXM5 80GB, vLLM 0.6+, Llama 3.3 70B AWQ, FP8 KV cache.

Concurrent usersAvg TTFTThroughput (tok/s)GPU utilP95 TTFT
10~350ms~1,80065%~600ms
50~1.1s~2,40088%~2.8s
100~2.4s~2,60096%~6.5s

Results are approximate on H100 SXM5 80GB with Llama 3.3 70B AWQ. Numbers vary with prompt and generation length and KV cache pressure.

At 50+ sustained concurrent users, two H100s with tensor parallelism (--tensor-parallel-size 2) is the safer configuration. TTFT stays under 1 second even at 100 users with the expanded KV cache capacity.

Cost vs ChatGPT Business

ChatGPT Business (formerly ChatGPT Team) is $25 per seat per month. Here is how that stacks up against a self-hosted H100 SXM5 setup at the current on-demand rate of $3.90/hr.

SeatsChatGPT BusinessH100 (24/7)H100 (business hours)
10$250$2,808$686
25$625$2,808$686
40$1,000$2,808$686
50$1,250$2,808$686
75$1,875$2,808$686
113$2,825$2,808$686

Monthly GPU cost: $3.90/hr × 720 hours = $2,808 for 24/7 operation. For business-hours-only use (8 hours/day, 22 working days): $3.90/hr × 176 hours = $686/month.

Break-even at 24/7 uptime: about 113 seats. Break-even for business-hours use: about 28 seats.

If you use spot instances ($1.63/hr for H100 SXM5), the 24/7 monthly cost drops to $1,174 and break-even falls to around 47 seats. Note that spot can be preempted, terminating active sessions.

For smaller models on an L40S PCIe at $0.75/hr: 24/7 monthly cost is $540, breaking even against ChatGPT Business at around 22 seats.

If ChatGPT Enterprise is your comparison point ($60 per seat per month), the break-even drops to 47 seats at 24/7 on-demand operation.

Security Checklist

  • Network: place vLLM behind a private network; port 8000 must never be publicly accessible. Expose only Open WebUI or LibreChat via HTTPS.
  • Auth: disable ENABLE_SIGNUP after initial admin setup; enforce OIDC for all users in team deployments.
  • Secrets: never pass API keys in Docker run arguments; use --env-file or Docker secrets.
  • Model access: restrict which models users can select via Open WebUI's model permissions in Admin Panel > Models.
  • TLS: terminate SSL at nginx or Cloudflare; do not run chat frontends on plain HTTP in production.
  • Updates: pin image tags (not latest) in production; test upgrades on a staging instance before applying to production.

Troubleshooting

Streaming failures (blank responses or cutoff output)

Symptom: Open WebUI shows a blank chat bubble or output cuts off mid-sentence.

Cause: Reverse proxy (nginx) is buffering the server-sent event (SSE) stream.

Fix: Add these directives to your nginx location block:

nginx
proxy_http_version 1.1;
proxy_set_header Connection '';
proxy_buffering off;
proxy_read_timeout 600s;
proxy_cache off;

vLLM connection refused inside container

Symptom: Open WebUI shows "connection refused" or "network error" even though vLLM is running.

Cause: OPENAI_API_BASE_URL points to localhost, which resolves to the container's own network, not the host machine.

Fix: Use host.docker.internal instead of localhost:

OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1

On Linux, ensure the Open WebUI container was started with --add-host=host.docker.internal:host-gateway. This flag is required on Linux; macOS and Windows Docker Desktop handle it automatically. The vLLM container does not need this flag.

RAG returning irrelevant results

Symptom: Document retrieval returns passages unrelated to the query.

Cause: The embedding model used at indexing time does not match the one used at query time.

Fix: Ensure the same model is configured in TEI for both document upload (indexing) and the live query path. Changing the --model flag in TEI after documents have been indexed requires re-indexing all documents. Check your TEI startup command:

bash
docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-embeddings-inference:latest \
  --model-id BAAI/bge-m3

If you are seeing degraded retrieval after an upgrade, the model name may have changed. Re-index affected document collections.


Teams running Open WebUI or LibreChat at scale need a reliable GPU backend that does not rate-limit or log your prompts. Spheron provides bare-metal H100 and L40S instances on-demand with no token fees. You pay for GPU time, not API calls.

Rent H100 on Spheron → | Rent L40S → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

  1. Choose between Open WebUI and LibreChat

    Compare the feature matrix: Open WebUI is simpler to set up with native Ollama support and a cleaner interface; LibreChat supports multiple simultaneous providers and is better for teams mixing self-hosted and cloud APIs. Choose Open WebUI for pure vLLM or Ollama setups; choose LibreChat for multi-provider flexibility.

  2. Provision a Spheron H100 or L40S GPU instance

    Log into app.spheron.ai, select H100 SXM5 80GB for 70B models or L40S for 8B-13B models, choose Ubuntu 22.04 with CUDA 12.4, set an SSH key, and start the instance. Verify the GPU with nvidia-smi after SSH access.

  3. Start a vLLM OpenAI-compatible backend

    Run the vLLM Docker container with --gpus all --ipc=host -p 8000:8000, specify your model with --model, and optionally add --quantization fp8 for memory efficiency. Verify the endpoint with curl http://localhost:8000/v1/models.

  4. Deploy Open WebUI and connect it to vLLM

    Run the Open WebUI Docker container with --add-host=host.docker.internal:host-gateway (required on Linux), OPENAI_API_BASE_URL set to http://host.docker.internal:8000/v1, and OPENAI_API_KEY set to any non-empty string. Store WEBUI_SECRET_KEY in a .env file and pass it via --env-file. Open http://localhost:3000 to create your admin account.

  5. Configure SSO and team access controls

    Set ENABLE_OAUTH_SIGNUP=true and configure OAUTH_CLIENT_ID, OAUTH_CLIENT_SECRET, and OPENID_PROVIDER_URL for your identity provider (Keycloak, Okta, Google Workspace, or Azure AD). Disable ENABLE_SIGNUP after the first admin account is created to prevent unauthorized registrations.

  6. Wire up RAG with an embedding backend and vector DB

    Start a TEI embedding server and a vector database, then configure Open WebUI via Admin Panel > Settings > Documents with the embedding endpoint URL, or set RAG_OPENAI_BASEURL and EMBEDDINGS_PROVIDER in LibreChat's environment. Upload documents via the UI or the POST /api/v1/documents API endpoint.

FAQ / 05

Frequently Asked Questions

Open WebUI is a self-hosted chat interface with 100K+ GitHub stars, built primarily for Ollama but with full OpenAI-compatible API support. It ships with chat history, model switching, and basic RAG out of the box. LibreChat is a more flexible multi-provider platform with 35K+ stars that lets you configure multiple LLM providers, both self-hosted and cloud, in a single dropdown. It is better for teams that want to mix vLLM with OpenAI or Anthropic fallbacks.

The frontend container is CPU-only and negligible. The GPU goes entirely to the vLLM backend. Llama 3.3 70B in AWQ quantization needs about 38GB VRAM, fitting on a single H100 80GB. For 50 simultaneous streaming users, one H100 SXM5 starts showing queuing at peak. Two H100s with tensor parallelism is the safer sizing for sustained 50-user concurrency with Llama 3.3 70B.

Yes. Set OPENAI_API_BASE_URL to your vLLM server address (e.g., http://host.docker.internal:8000/v1) and OPENAI_API_KEY to any non-empty string. Open WebUI treats any OpenAI-compatible endpoint as a first-class provider alongside Ollama.

Set ENABLE_OAUTH_SIGNUP=true and configure OAUTH_CLIENT_ID, OAUTH_CLIENT_SECRET, and OPENID_PROVIDER_URL in Open WebUI's environment variables. It works with Keycloak, Okta, Google Workspace, and Azure AD. After setting up OIDC, disable ENABLE_SIGNUP to prevent new users from creating local accounts.

ChatGPT Business costs $25 per seat per month. An H100 SXM5 on Spheron starts at $3.90/hr on-demand, which works out to about $2,808/month for 24/7 uptime. That breaks even at roughly 113 seats. If your team uses the GPU only during business hours (8 hrs/day, 5 days/week), the monthly cost drops to about $686 and the break-even falls to around 28 seats.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.