TGI is Hugging Face's native serving engine. It talks directly to the Hugging Face Hub, handles gated model downloads without a manual token-passing dance, and has been quietly accumulating production features since 2022. The deployment details are scattered across GitHub issues, version changelogs, and outdated tutorials. This guide consolidates them. If you're comparing engines before picking one, the vLLM vs TensorRT-LLM vs SGLang benchmarks covers throughput and latency numbers across the three dominant alternatives.
What Is TGI and Why Hugging Face Built It
TGI's core is written in Rust, with Python bindings exposing the server interface. The Rust core handles token streaming with low per-token overhead. Python sits above it for model loading, sampling logic, and OpenAI-compatible API routing.
The main thing TGI does that no other framework matches out of the box: it speaks Hugging Face Hub natively. Pass HUGGING_FACE_HUB_TOKEN and a model ID, and TGI downloads the model, validates the license, and handles gated access automatically. No pre-download step, no manual sharding, no file format conversion.
TGI has native CUDA graph optimization for a specific set of supported model architectures: Llama (all generations), Qwen 2/3, Mistral/Mixtral, Falcon, Command-R, Gemma, and Phi. Outside this list, TGI falls back to eager execution without CUDA graphs, which reduces throughput and TTFT compared to the optimized path.
Where TGI sits relative to the alternatives:
| Engine | Core innovation | Best workload | Requires |
|---|---|---|---|
| TGI | Rust streaming core, native HF Hub | HF gated models, low-to-medium concurrency | HF token for gated models |
| vLLM | PagedAttention, continuous batching | High-throughput batch inference | Python runtime |
| SGLang | RadixAttention prefix cache | Agentic multi-turn with shared prefixes | Python runtime |
At low-to-medium concurrency (under 64 requests), TGI and vLLM deliver similar throughput for supported models. Above 64 concurrent requests, vLLM's PagedAttention memory management typically pulls ahead because it wastes less VRAM per request. If shared prefixes are your bottleneck, SGLang's RadixAttention wins regardless of concurrency.
Hardware and VRAM Requirements
VRAM math is straightforward: count bytes per parameter. BF16 is 2 bytes per param, FP8 is 1 byte, INT4 is 0.5 bytes. A 70B model in BF16 needs 140GB. In FP8, 70GB. You also need headroom for KV cache (typically 10-20% of the total VRAM budget for 8K context). For the full derivation, see the GPU memory requirements for LLMs guide.
| Model | Params (active) | Precision | Min VRAM | Recommended GPU | Multi-shard config |
|---|---|---|---|---|---|
| Llama 3.3 70B | 70B dense | FP8 | 75GB | H100 SXM5 80GB | 1x (tight), 2x (comfortable) |
| Llama 3.3 70B | 70B dense | BF16 | 145GB | 2x H100 SXM5 | --num-shard 2 |
| Llama 4 Scout | 17B active / 109B total MoE | FP8 | 130GB | 2x H100 SXM5 | --num-shard 2 |
| Llama 4 Scout | 17B active / 109B total MoE | BF16 | 220GB | 2x H200 SXM5 | --num-shard 2 |
| Qwen3 30B-A3B | 3B active / 30B MoE | BF16 | 62GB | A100 SXM4 80GB | 1x |
| Mistral Small 3.1 24B | 24B dense | BF16 | 50GB | A100 SXM4 80GB | 1x |
| Mistral Large 2 123B | 123B dense | FP8 | 130GB | 2x H100 SXM5 | --num-shard 2 |
Important note on Llama 4 Scout --num-shard values: Scout's architecture requires --num-shard to evenly divide the number of key-value heads. Valid values for Scout are 2, 4, and 8. Using --num-shard 3 or other non-divisors causes TGI to fail at startup with a shape mismatch error that does not always clearly state the cause.
Setting Up Your Spheron Instance
Step 1: Provision the GPU
For 70B at FP8, an H100 SXM5 rental is the tightest single-card option that fits Llama 3.3 70B. You need --gpu-memory-utilization tuning (use --max-input-length and --max-total-tokens to limit KV cache) to get it below 80GB. For 70B at BF16, provision a 2x H100 instance. For budget 7B-13B work, L40S instances on Spheron cost less per hour than A100 and have enough memory for 13B at BF16 with room for KV cache.
Log in to app.spheron.ai, select your GPU model, region, and provider, then launch. SSH into the instance once it's running.
Step 2: Verify GPU Access
nvidia-smiConfirm the GPU count matches what you provisioned, VRAM appears at expected capacity, and the driver version is recent (590.x or later for H100/H200).
Step 3: Verify Docker GPU Access
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smiShould print the same GPU table as the host. If Docker cannot see the GPUs, the NVIDIA Container Toolkit is not configured:
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerSingle-GPU TGI Deployment
The following deploys Llama 3.3 70B Instruct at FP8 on a single H100. We pin to TGI version 3.3.0 as a known-good checkpoint. The latest tag advances regularly and can break working configurations. The 3.3.x patch line is backward compatible, so bumping to 3.3.7 (the current patch release) is safe.
docker run \
--gpus all \
--shm-size 1g \
--ipc=host \
-p 8080:80 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
ghcr.io/huggingface/text-generation-inference:3.3.0 \
--model-id meta-llama/Llama-3.3-70B-Instruct \
--quantize fp8 \
--max-input-length 4096 \
--max-total-tokens 8192 \
--max-batch-prefill-tokens 4096Flag-by-flag breakdown:
--gpus all: expose all host GPUs to the container--shm-size 1g: required. TGI uses CUDA IPC for inter-process communication. Without this flag, TGI appears to work on the first few requests but fails silently under load with CUDA IPC errors. This is one of the most common production footguns.--ipc=host: allows shared memory access between GPU worker processes. Required when--num-shard > 1; safe to include for single-GPU deployments--quantize fp8: uses FP8 Tensor Cores on H100 and Blackwell. Not supported on A100. A100 lacks hardware FP8 Tensor Cores. Use--quantize bitsandbytes(INT8) or--quantize gptq/--quantize awqon A100 instead.--max-input-length 4096: maximum tokens per input prompt. Controls prefill memory reservation.--max-total-tokens 8192: input + output token budget per request. The KV cache allocation is sized against this.--max-batch-prefill-tokens 4096: total token budget for prefill across the current batch. Higher values increase throughput at the cost of TTFT for other waiting requests.
Health check (TGI returns 503 during model loading, 200 when ready):
curl http://localhost:8080/healthTest inference:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tgi",
"messages": [{"role": "user", "content": "What is flash attention?"}],
"max_tokens": 256
}'Multi-GPU Deployment with --num-shard
--num-shard N tells TGI to split the model across N GPUs using tensor parallelism. TGI partitions attention heads and MLP layers across the shard count; communication happens through NCCL all-reduce operations between GPUs.
Two-GPU deployment (70B at BF16, 2x H100):
docker run \
--gpus all \
--shm-size 1g \
--ipc=host \
-p 8080:80 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
ghcr.io/huggingface/text-generation-inference:3.3.0 \
--model-id meta-llama/Llama-3.3-70B-Instruct \
--num-shard 2Four-GPU deployment (130B+ models):
docker run \
--gpus all \
--shm-size 1g \
--ipc=host \
-p 8080:80 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
ghcr.io/huggingface/text-generation-inference:3.3.0 \
--model-id mistralai/Mistral-Large-Instruct-2407 \
--num-shard 4 \
--quantize fp8NVLink (present on SXM instances) dramatically reduces all-reduce latency. On a 2x H100 SXM5 node with NVLink, the all-reduce for a 70B model's attention heads takes under 10 microseconds. On PCIe setups without NVLink, the same operation goes over the PCIe bus and takes 60-100 microseconds, which adds measurable TTFT overhead at every layer boundary. For --num-shard > 2, the NVLink advantage compounds. If you are running --num-shard 4 or higher, the difference between SXM and PCIe instances in latency is significant.
A single H200 (141GB) handles Llama 3.3 70B at BF16 on one card, eliminating the need for multi-shard communication overhead entirely. For models that fit, single-card is almost always faster than two-card.
To debug NCCL communication issues, add -e NCCL_DEBUG=INFO to the Docker command. NCCL prints collective operation timing and error details to stdout.
TGI-Specific Optimizations
Flash Attention
TGI enables Flash Attention 2 by default on any GPU where the kernel is available: H100, A100, L40S, and most Ampere+ cards. No flag is needed. TGI auto-detects the CUDA compute capability at startup and loads the appropriate attention kernel.
If you see unexpected memory behavior or need to isolate an attention-related bug, disable it explicitly:
--disable-flash-attentionThis forces fallback to standard scaled dot-product attention, which is correct but slower and uses significantly more VRAM at long context lengths.
Quantization: FP8, GPTQ, AWQ, and bitsandbytes
TGI supports four quantization modes:
FP8 (--quantize fp8): hardware-native on H100 and Blackwell (B100, B200). Approximately 50% VRAM reduction relative to BF16, with roughly 1.5x throughput improvement on supported models. Do not use on A100. A100 lacks FP8 Tensor Cores; TGI will either error or fall back to a software path that is slower than BF16.
GPTQ (--quantize gptq): works on any CUDA GPU including A100 and RTX series. Requires a pre-quantized model checkpoint (the original model must have been quantized and saved in GPTQ format before you load it). Good option when you already have a GPTQ checkpoint and need to serve it without reprocessing.
AWQ (--quantize awq): lower quality loss than GPTQ at similar 4-bit compression. TGI supports AWQ kernels natively. For the quantization process itself, see the AWQ quantization guide.
bitsandbytes (--quantize bitsandbytes): on-the-fly INT8 quantization during model load. Works on any GPU. Slower than FP8 and AWQ at inference time, but requires no pre-quantized checkpoint. Useful for A100 when VRAM is tight and you don't have a GPTQ/AWQ checkpoint ready.
Continuous Batching and Prefill Tokens
--max-batch-prefill-tokens controls the total token budget allocated to the prefill phase per batch cycle. Lower values (4096) prioritize TTFT for interactive use. Higher values (16384) increase throughput at the cost of TTFT for other queued requests.
--waiting-served-ratio (default: 1.2) controls how aggressively TGI waits to form larger batches before serving. At 1.2, TGI will wait slightly longer than the current serving rate to allow more requests to join the batch. Raise to 2.0 for batch throughput workloads where request latency is less important than overall tokens-per-second. For a deeper look at continuous batching mechanics and how these parameters interact, see LLM Serving Optimization: Continuous Batching, PagedAttention, and Chunked Prefill.
Speculative Decoding
--speculate N enables speculative decoding with N draft tokens. TGI checks each speculative token against the base model and accepts runs of correct tokens without a full forward pass, reducing ITL at low concurrency.
Speculative decoding helps most when concurrency is below 8. Above that, the overhead of evaluating draft tokens per request starts to outweigh the gains. At high concurrency, the batch is already full, so there is nothing to speculate into.
When the model checkpoint includes bundled Medusa heads, TGI uses them automatically with --speculate N. To use a separate small draft model instead:
--draft-model-id meta-llama/Llama-3.2-1B-Instruct \
--speculate 3For a full production treatment of speculative decoding across engines, see the speculative decoding production guide.
Tensor Parallelism Notes
Add --ipc=host to the Docker command for every multi-GPU deployment. This gives GPU worker processes access to host shared memory for NCCL operations. Missing it causes intermittent failures on large all-reduce operations that only appear under sustained load.
To diagnose NCCL errors:
-e NCCL_DEBUG=INFOOutputs collective operation names, sizes, and timing. Look for NCCL WARN or NCCL error lines in the container stdout.
TGI vs vLLM vs SGLang: Benchmark Comparison
The numbers below are representative of the throughput and latency characteristics described in published benchmarks from Hugging Face, LMSys, and community testing on H100 hardware. These are not measurements from a single controlled run on Spheron infrastructure. Treat them as directional guidance, not production guarantees. For exhaustive head-to-head measurements on Spheron H100 hardware, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.
All figures assume Llama 3.3 70B at FP8, single H100 SXM5.
| Metric | TGI | vLLM | SGLang |
|---|---|---|---|
| Throughput (128 concurrent, tok/s) | ~1,600 | ~2,400 | ~1,900 |
| TTFT (single request, ms) | ~95 | ~110 | ~112 |
| p99 latency (32 concurrent, ms) | ~820 | ~780 | ~750 |
| Prefix cache | No native prefix cache | APC (opt-in flag) | RadixAttention (default) |
| Flash Attention | FA2 (auto) | FA3/FA2 | FA2 |
| HF Hub integration | Native, automatic | Manual download | Manual download |
| Prometheus metrics | Built-in, always on | Built-in | Requires --enable-metrics |
TGI leads on single-request TTFT (lowest latency for a single isolated request) because the Rust core has lower per-token overhead than Python-based engines at near-zero concurrency. At 128 concurrent requests, vLLM's PagedAttention memory management pulls ahead on throughput. SGLang closes the gap on p99 latency for workloads where requests share prefixes, but the numbers above use unique prompts throughout.
The practical summary: TGI is the fastest engine per request at single-digit concurrency. vLLM wins on aggregate throughput at high concurrency. SGLang wins on TTFT whenever shared prefixes are in play.
Production Hardening
Health Checks
TGI's /health endpoint returns 503 during model loading and 200 when the server is ready to serve requests. For Docker deployments:
docker run \
--health-cmd "curl -f http://localhost:80/health || exit 1" \
--health-interval 15s \
--health-start-period 120s \
--health-retries 3 \
--gpus all \
--shm-size 1g \
--ipc=host \
-p 8080:80 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
ghcr.io/huggingface/text-generation-inference:3.3.0 \
--model-id meta-llama/Llama-3.3-70B-Instruct \
--quantize fp8--health-start-period 120s gives TGI time to download and load the model before Docker starts evaluating health. A 70B model takes 60-90 seconds to load from disk on a fast NVMe drive after the first download.
Prometheus Metrics
TGI exposes /metrics at the same port as the API, automatically, with no configuration flag needed. Key metrics and alert thresholds:
| Metric | Alert condition | Action |
|---|---|---|
tgi_queue_size | > 50 | Add capacity or enable rate limiting |
tgi_batch_current_size | Approaching batch cap | Scale horizontally (tune --max-batch-total-tokens or --max-concurrent-requests) |
tgi_request_queue_duration | p99 rising | Queue pressure building, lower --max-batch-prefill-tokens or scale out |
tgi_request_mean_time_per_token_duration | > 100ms | GPU overload/saturation or decode contention |
Hugging Face documents the full TGI metrics reference, which you can build Grafana panels on top of.
Pair the TGI metrics with nvidia-smi dmon -s pum -d 10 or DCGM for GPU-level memory and utilization. TGI's queue metrics tell you about request pressure; GPU metrics tell you about hardware saturation.
Request Queueing and Backpressure
TGI queues requests up to --max-waiting-tokens. When the queue fills, TGI returns 429s. For production, put TGI behind nginx or an L7 load balancer for connection-level queueing and graceful backpressure:
upstream tgi_backend {
server 127.0.0.1:8080;
keepalive 64;
}
server {
listen 443 ssl;
ssl_certificate /path/to/fullchain.pem;
ssl_certificate_key /path/to/privkey.pem;
location / {
proxy_pass http://tgi_backend;
proxy_http_version 1.1;
proxy_set_header Connection '';
proxy_buffering off;
proxy_read_timeout 300;
proxy_send_timeout 300;
}
}For horizontal scaling, TGI instances are stateless. Run multiple instances behind a load balancer; each instance holds its own KV cache but request routing is independent.
Autoscaling
TGI has no native autoscaling. Wire Kubernetes HPA to tgi_queue_size via the Prometheus adapter:
metrics:
- type: Pods
pods:
metric:
name: tgi_queue_size
target:
type: AverageValue
averageValue: "10"Scale out when average queue depth per pod exceeds 10. For the full Kubernetes GPU autoscaling setup including node pool management and spot instance integration, see the Kubernetes GPU orchestration guide.
GPU Pricing on Spheron
| GPU | VRAM | TGI Use Case | On-Demand | Spot |
|---|---|---|---|---|
| RTX 4090 | 24GB | 7B FP16, 13B GPTQ | from $0.79/hr | N/A |
| L40S | 48GB | 13B FP16, 34B AWQ | from $0.72/hr | N/A |
| A100 SXM4 80GB | 80GB | 30B-40B FP16, 70B GPTQ/AWQ | from $1.64/hr | N/A |
| H100 SXM5 80GB | 80GB | 70B FP8, 130B with --num-shard 2 | from $4.34/hr | N/A |
| H200 SXM5 | 141GB | 70B FP16 single-card, 130B FP8 | from $4.54/hr | N/A |
Pricing fluctuates based on GPU availability. The prices above are based on 24 Apr 2026 and may have changed. Check current GPU pricing for live rates.
For context on cost-per-token across these GPUs at production scale, see the AI inference cost economics guide.
When to Pick TGI Over vLLM or SGLang
| Criteria | Use TGI | Use vLLM | Use SGLang |
|---|---|---|---|
| Model source | Hugging Face Hub (gated) | Any (local or Hub) | Any |
| Concurrency level | Low-medium (under 64 req) | High (over 64 req) | Multi-turn agents |
| Prefix reuse | No specific requirement | Some prefix sharing | High prefix overlap |
| Metrics out-of-box | Yes (always on) | Yes (always on) | Requires --enable-metrics |
| Quantization priority | GPTQ/AWQ (existing checkpoints) | FP8 / AWQ | FP8 / AWQ |
| A100 FP8 support | No (use GPTQ/AWQ/bitsandbytes) | No | No |
| Streaming latency at 1 req | Best (Rust core) | Competitive | Competitive |
For the full framework decision path including how TGI, vLLM, llama.cpp, and Ollama compare across deployment scenarios, see the LLM deployment guide.
TGI runs cheapest on bare-metal GPU cloud where you control the serving stack without managed API markup. Rent H100 on Spheron | Rent H200 on Spheron | View all GPU pricing
