Tutorial

Migrate from Hugging Face TGI to vLLM or SGLang on GPU Cloud: A 2026 Move-Off Guide

migrate from TGI to vLLMTGI deprecated alternativeHugging Face TGI maintenance modeTGI to SGLang migrationvLLMSGLangLLM Inference MigrationGPU Cloud
Migrate from Hugging Face TGI to vLLM or SGLang on GPU Cloud: A 2026 Move-Off Guide

Hugging Face announced TGI maintenance mode in December 2025 and archived the GitHub repo read-only on 21 Mar 2026. No new model architectures, no new features, no performance work, and no further bug fixes. If you have teams deploying newer models like Llama 4 or Qwen 3 and hitting model support gaps, the official guidance is to migrate to vLLM or SGLang. For background on TGI's current deployment setup before you move, the TGI production deployment guide covers what it looked like when TGI was actively developed. If you need to decide between vLLM and SGLang before reading further, the vLLM vs SGLang head-to-head covers throughput and latency numbers at four concurrency levels.

What Changed with TGI in 2026

TGI maintenance mode originally meant bugs would get fixed and security patches would land, but no new features. The TGI GitHub repo was archived read-only on 21 Mar 2026, ending even that. The practical consequence for teams running TGI in production:

Model architecture support stops. TGI only supports the model architectures already in its codebase. Llama (all generations through Llama 3.3), Qwen 2/3, Mistral/Mixtral, Falcon, Gemma, Phi, and Command-R are covered. New architectures introduced after TGI's maintenance transition will not get support. Llama 4 Scout and Maverick already shipped and TGI's frozen codebase has no support for their architecture today. When Qwen 4 or DeepSeek V4 ship, TGI users hit the same wall. vLLM and SGLang add architecture support within days of model releases.

Kernel improvements stop. FP8 Tensor Core optimization, attention kernel improvements, and speculative decoding upgrades are ongoing in vLLM and SGLang. TGI's performance stays fixed at the maintenance-mode cutoff. Teams on vLLM get throughput improvements passively as new versions ship.

The recommended migration path. Hugging Face's own docs point to vLLM as the primary migration target for most workloads and SGLang for multi-turn agent workloads with shared prefix structure.

What still works in TGI: existing models load fine, the OpenAI-compatible /v1/chat/completions API remains intact, Hugging Face Hub token integration works, and the Prometheus metrics endpoint at /metrics stays operational. Nothing breaks immediately. The issue is the slow drift as newer models and optimizations only land in vLLM and SGLang.

Decision: vLLM or SGLang?

CriteriavLLMSGLang
Broadest model supportYesSlightly narrower
No compile stepYesYes
Multi-turn / RAG prefix cachingAPC (opt-in via flag)RadixAttention (default, token-level granularity)
Structured JSON outputxgrammarxgrammar + grammar-cache reuse
MoE model optimizationMRV2 + EPLBDeepEP/DeepGEMM kernels
Operational simplicitySimpler configMore cache tuning knobs

Default to vLLM. It covers the widest model set, the configuration is the most direct TGI analog, and the OpenAI-compatible API surface is identical. Switch to SGLang when your workload has prefix overlap above 60%: multi-turn agent conversations, RAG pipelines where the same documents appear in many requests, or chatbots with a fixed system prompt across all users. In those cases, SGLang's RadixAttention delivers 20-40% lower TTFT by reusing KV cache for shared prefixes instead of recomputing them per request.

TGI Flag Translation Table

Every flag you use in your current TGI launch command has an equivalent. The table below covers the complete set of commonly used TGI flags.

TGI FlagvLLM EquivalentSGLang EquivalentNotes
--model-id--model--model-pathSame Hugging Face Hub ID format works
--num-shard N--tensor-parallel-size N--tp NSame semantics: split model across N GPUs
--quantize fp8--quantization fp8--quantization fp8H100+ hardware FP8; use on Hopper and Blackwell
--quantize bitsandbytes--quantization bitsandbytesNot supportedConvert checkpoint to AWQ before migrating
--quantize awq--quantization awq--quantization awqAWQ checkpoints are portable across engines
--quantize gptq--quantization gptq--quantization gptqWorks without conversion
--quantize eetqNo direct equivalentNo direct equivalentEETQ is TGI-specific; re-quantize to AWQ or GPTQ
--max-input-length--max-model-len--context-lengthSets max context window
--max-total-tokens--max-model-len--context-lengthBoth TGI flags map to the same vLLM/SGLang flag
--max-batch-prefill-tokens--max-num-batched-tokens--chunked-prefill-sizeCaps tokens processed per prefill step
--max-concurrent-requests--max-num-seqs--max-running-requestsIn-flight request cap
--speculate N--speculative-config '{"method":"draft_model","model":"<draft-model>","num_speculative_tokens":N}'--num-speculative-tokens NvLLM uses a single JSON config object; TGI uses Medusa heads baked into the model
HUGGING_FACE_HUB_TOKEN envHUGGING_FACE_HUB_TOKEN envHUGGING_FACE_HUB_TOKEN envSame env var; no change needed
--hostname 0.0.0.0--host 0.0.0.0--host 0.0.0.0Minor rename
--port 8080--port 8000--port 8000Default port changes; update client base URLs
--disable-flash-attnNot neededNot neededFlash attention is default in both; no disable flag needed
--shm-size 1g (Docker flag)--ipc=host (Docker flag)--ipc=host (Docker flag)Critical substitution. Using --shm-size without --ipc=host causes CUDA errors under load in multi-process vLLM/SGLang. Replace the Docker flag entirely.

Two edge cases to call out explicitly:

EETQ checkpoints. EETQ is a TGI-specific quantization format. If your current deployment uses --quantize eetq, you cannot pass the checkpoint directly to vLLM or SGLang. You need to re-quantize the weights to AWQ or GPTQ format first using the original full-precision checkpoint. This is the one migration path that requires model conversion work.

Speculative decoding with Medusa. TGI's --speculate N works when Medusa draft heads are baked into the model checkpoint. Current vLLM uses a single --speculative-config JSON, e.g. --speculative-config '{"method":"draft_model","model":"<draft_model_id>","num_speculative_tokens":N}'. If you used Medusa-integrated TGI checkpoints, you need a standalone draft model to enable speculative decoding in vLLM.

Step-by-Step Migration to vLLM

Here is the full before-and-after for a Llama 3.3 70B deployment at FP8 on a single H100 SXM5.

Before (TGI):

bash
docker run --gpus all --shm-size 1g -p 8080:80 \
  -e HUGGING_FACE_HUB_TOKEN=hf_xxx \
  ghcr.io/huggingface/text-generation-inference:3.3.0 \
  --model-id meta-llama/Llama-3.3-70B-Instruct \
  --quantize fp8 \
  --num-shard 1 \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --max-batch-prefill-tokens 4096

After (vLLM):

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=hf_xxx \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.92 \
  --host 0.0.0.0 --port 8000

Three things changed from the TGI command:

  1. --shm-size 1g replaced with --ipc=host. This is not optional. Multi-process vLLM uses CUDA IPC handles for shared memory across worker processes. Without --ipc=host, you hit a CUDA shared memory error the moment vLLM tries to coordinate GPU workers under load.
  2. Port is now 8000, not 8080. TGI internally runs on port 80 and most setups map it to host port 8080. vLLM defaults to port 8000 internally. Update your load balancer, health check probes, and any client code that hard-codes the port.
  3. --gpu-memory-utilization 0.92 is new. TGI handles VRAM budgeting internally. vLLM exposes this as an explicit flag. Setting it to 0.92 leaves 8% for CUDA context and reserves overhead for KV cache.

Client-side change:

python
# Before
client = openai.OpenAI(base_url="http://your-server:8080/v1", api_key="none")

# After
client = openai.OpenAI(base_url="http://your-server:8000/v1", api_key="none")

The endpoint path (/v1/chat/completions), request format, and response format stay identical. Only the port changes if you kept 8080 in TGI.

Gated models. vLLM reads HUGGING_FACE_HUB_TOKEN and handles gated model downloads the same way TGI does. For very large models (405B+), pre-download via huggingface-cli download meta-llama/Llama-3.1-405B-Instruct before running Docker. Container-time downloads of 400GB+ models can time out on slow connections.

--tensor-parallel-size must match available GPUs. TGI auto-detects GPU topology from --num-shard. vLLM's --tensor-parallel-size must exactly match the number of GPUs visible inside Docker. If you pass --tensor-parallel-size 2 but only one GPU is exposed, vLLM fails at startup with a NCCL error rather than degrading gracefully.

Step-by-Step Migration to SGLang

Same Llama 3.3 70B example:

After (SGLang):

bash
docker run --gpus all --ipc=host -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=hf_xxx \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --tp 1 \
  --context-length 8192 \
  --chunked-prefill-size 4096 \
  --mem-fraction-static 0.92 \
  --host 0.0.0.0 --port 8000

SGLang uses --model-path instead of --model. The --mem-fraction-static flag is the equivalent of vLLM's --gpu-memory-utilization: it controls the fraction of GPU memory reserved for static KV cache storage. Set it to 0.88-0.92 as a starting point and tune based on actual memory pressure. For a production SGLang setup covering load balancing, rolling restarts, and router configuration, see the SGLang production deployment guide.

SGLang-specific advantage: RadixAttention is on by default. You do not need to opt in. If your workload has a fixed system prompt across all requests (common for chatbots, API gateways, and agent frameworks), every request after the first benefits from cached KV activations for the shared prefix. Monitor cache effectiveness via the /metrics endpoint using the sglang_cache_hit_rate metric. A rate above 70% means you are getting material TTFT reductions; below 20% means your prompts are too unique and vLLM would serve equally well.

Monitoring cache hit rate:

bash
# Check RadixAttention effectiveness
curl http://localhost:8000/metrics | grep sglang_cache_hit_rate

Validating the Migration

Do not skip this. Run a benchmark against both endpoints before switching production traffic.

Benchmark script (vLLM's built-in tool):

bash
# Against old TGI endpoint
python vllm/benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --base-url http://localhost:8080 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dataset-name sharegpt \
  --num-prompts 200 \
  --request-rate 10

# Against new vLLM endpoint
python vllm/benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --base-url http://localhost:8000 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dataset-name sharegpt \
  --num-prompts 200 \
  --request-rate 10

Metrics to compare:

MetricWhat it tells you
TTFT p50/p95Time to first token; lower is better for interactive workloads
ITL p50/p95Inter-token latency; lower is better for streaming
Throughput (tokens/sec)Total output rate; higher is better for batch
GPU utilizationHow much of the GPU vLLM/SGLang is actually using

Expected outcomes: vLLM typically matches or exceeds TGI throughput at equivalent concurrency on supported models. SGLang shows TTFT improvement specifically when prefix overlap exists; on fully unique prompts, the difference is under 5%.

Cost per million tokens (using live pricing as of 29 Jun 2026):

H100 SXM5 at $4.06/hr on-demand. If a benchmark run shows 1,800 tokens/sec throughput:

cost per million tokens = (4.06 / 3600) / (1800 / 1_000_000)
                        = $0.00113/sec / 0.0018M tokens/sec
                        = $0.628 per million tokens

At $4.06/hr, the H100 SXM5 delivers roughly $0.63 per million output tokens at 1,800 tok/sec. For comparison, A100 80G SXM4 at $1.69/hr running a smaller 13B model at 3,200 tok/sec:

cost per million tokens = (1.69 / 3600) / (3200 / 1_000_000)
                        = $0.147 per million tokens

Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing for live rates.

Running vLLM or SGLang on Spheron GPU Cloud

Provision a GPU instance from app.spheron.ai. Spheron aggregates compute from 5+ providers into a single catalog. You pick the GPU, region, and billing model (on-demand or spot), then SSH in.

Recommended GPUs for common model sizes (on-demand pricing as of 29 Jun 2026):

Model sizePrecisionRecommended GPUOn-demand rate
70B denseFP8H100 SXM5 on Spheron$4.06/hr
70B denseBF162x H100 SXM5$8.12/hr
13B-34BBF16A100 80G on Spheron$1.69/hr
7B-13BBF16L40S on Spheron$0.96/hr
7B-13BBF16A100$1.48-1.69/hr

The vLLM and SGLang Docker commands in this guide work on Spheron instances without modification. Pull the image, run the command with your flags, and the endpoint is live. Per-minute billing means you pay only while the container runs. Spin up a second instance during peak traffic, release it after: the cost scales with actual use.

For setup guides per model, the Spheron LLM quick-start docs cover instance provisioning, Docker setup, and model-specific configurations.

KV Cache and Autoscaling After the Move

vLLM prefix caching. Enable with --enable-prefix-caching. Useful when you have a shared system prompt across all requests. Unlike SGLang's RadixAttention (which is on by default), vLLM's prefix cache is opt-in. Tune --gpu-memory-utilization to leave enough VRAM headroom for KV cache growth under load. If cache fills, vLLM evicts the oldest entries; monitor with the vllm:kv_cache_usage_perc Prometheus metric. For a deeper look at how continuous batching and PagedAttention manage KV cache memory under load, see LLM serving optimization: continuous batching and paged attention.

SGLang cache monitoring. Check sglang_cache_hit_rate on the /metrics endpoint. A cache hit rate above 70% means RadixAttention is delivering meaningful TTFT reductions for your workload. Tune --mem-fraction-static to give more or less VRAM to the static KV cache.

Autoscaling on Spheron. Per-minute billing makes spinning up extra instances during peak traffic economically viable. A simple approach: maintain a base instance count for steady traffic, add instances when queue depth crosses a threshold, and release them after the peak. Use an Nginx upstream with health-check-based routing across the pool:

nginx
upstream vllm_pool {
    server instance-1:8000;
    server instance-2:8000;
    server instance-3:8000;
}

Each Spheron instance runs the same Docker command. Adding a new instance to the pool takes under two minutes from provision to ready.


If you're re-platforming off TGI, Spheron gives you bare-metal H100, A100, and L40S instances with per-minute billing and zero infrastructure management. The vLLM and SGLang Docker commands in this guide work on Spheron instances without modification.

H100 SXM5 on Spheron | A100 GPU pricing | View all GPU pricing

STEPS / 06

Quick Setup Guide

  1. Audit your current TGI configuration

    List every flag in your current TGI launch command: model ID, quantization type, number of shards, max tokens, batch prefill tokens, and any HUGGING_FACE_HUB_TOKEN usage. This audit drives the flag translation step. Save the current command before changing anything.

  2. Choose your migration target: vLLM or SGLang

    Pick vLLM if you want the broadest model support and the simplest drop-in replacement for TGI. Pick SGLang if your workload involves multi-turn agents, RAG with repeated document contexts, or structured JSON output at scale - RadixAttention reduces TTFT by 20-40% on prefix-heavy traffic.

  3. Translate your TGI flags to vLLM

    Key translations: --model-id becomes --model, --num-shard becomes --tensor-parallel-size, --quantize fp8 becomes --quantization fp8, --max-input-length becomes --max-model-len, --max-batch-prefill-tokens maps to --max-num-batched-tokens, --max-total-tokens maps to --max-model-len. Port stays on 8080 or change to 8000 (vLLM default). Run: docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest --model <model-id> --quantization fp8 --tensor-parallel-size <N>.

  4. Translate your TGI flags to SGLang

    Key translations: --model-id becomes --model-path, --num-shard becomes --tp, --quantize fp8 becomes --quantization fp8, --max-input-length becomes --context-length, --max-batch-prefill-tokens maps to --chunked-prefill-size. Run: docker run --gpus all --ipc=host -p 8000:8000 lmsysorg/sglang:latest python -m sglang.launch_server --model-path <model-id> --quantization fp8 --tp <N> --context-length <L>.

  5. Validate throughput and latency after migration

    Run a benchmark against both the old TGI endpoint and the new vLLM/SGLang endpoint using the same prompt set: use vLLM's benchmark_serving.py or wrk2/hey for load generation. Measure TTFT p50/p95, ITL p50/p95, and tokens/sec at your production concurrency. Compare cost per million tokens: (hourly_rate / 3600) / (tokens_per_sec / 1_000_000).

  6. Deploy on Spheron GPU cloud

    Provision an H100 SXM5 or A100 instance from app.spheron.ai. SSH in, verify with nvidia-smi, then run the translated vLLM or SGLang Docker command. Use on-demand instances for always-on serving or spot instances for batch workloads at lower cost. Per-minute billing means you only pay for the time the container is running.

FAQ / 05

Frequently Asked Questions

Hugging Face announced TGI maintenance mode in December 2025 and archived the GitHub repo (read-only) on 21 Mar 2026. That means no new features, no new model architecture support, and no further bug fixes or security patches. Hugging Face's own documentation now recommends migrating to vLLM for general-purpose serving or SGLang for multi-turn and RAG workloads.

In vLLM, use --quantization instead of --quantize. The value names differ slightly: TGI's 'fp8' maps to vLLM's 'fp8', 'bitsandbytes' maps to 'bitsandbytes', and 'awq' maps to 'awq'. TGI's 'eetq' has no direct vLLM equivalent; use 'gptq' or 'awq' checkpoint formats instead.

Yes. Both vLLM and SGLang expose a /v1/chat/completions endpoint that is OpenAI-compatible. Your client code, API keys, and request format stay the same. The only change is the base URL if your port or host changes.

Choose vLLM if you need the broadest model support, no compilation step, and the simplest operational path. Choose SGLang if your workload has high prefix overlap (multi-turn agents, RAG pipelines with shared documents) because RadixAttention delivers 20-40% lower TTFT in those cases.

For a single-GPU deployment, the actual migration - pulling the vLLM Docker image, translating the launch flags, and verifying the endpoint - takes under 30 minutes. Most of the time goes to the Docker pull. The flag translation table in this post covers every common TGI flag with its vLLM and SGLang equivalent.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.