Deploy Hugging Face TGI on GPU Cloud: Production Text Generation Inference Setup Guide (2026)

TGI is Hugging Face's native serving engine. It talks directly to the Hugging Face Hub, handles gated model downloads without a manual token-passing dance, and has been quietly accumulating production features since 2022. The deployment details are scattered across GitHub issues, version changelogs, and outdated tutorials. This guide consolidates them. If you're comparing engines before picking one, the vLLM vs TensorRT-LLM vs SGLang benchmarks covers throughput and latency numbers across the three dominant alternatives. If you're currently using Hugging Face Inference Endpoints and evaluating whether to self-host, see the Hugging Face Inference Endpoints alternatives comparison for a cost and feature breakdown. If TGI is in maintenance mode and you need to move, the TGI to vLLM and SGLang migration guide covers every flag translation and validation step.

What Is TGI and Why Hugging Face Built It

TGI's core is written in Rust, with Python bindings exposing the server interface. The Rust core handles token streaming with low per-token overhead. Python sits above it for model loading, sampling logic, and OpenAI-compatible API routing.

The main thing TGI does that no other framework matches out of the box: it speaks Hugging Face Hub natively. Pass HUGGING_FACE_HUB_TOKEN and a model ID, and TGI downloads the model, validates the license, and handles gated access automatically. No pre-download step, no manual sharding, no file format conversion.

TGI has native CUDA graph optimization for a specific set of supported model architectures: Llama (all generations), Qwen 2/3, Mistral/Mixtral, Falcon, Command-R, Gemma, and Phi. Outside this list, TGI falls back to eager execution without CUDA graphs, which reduces throughput and TTFT compared to the optimized path.

Where TGI sits relative to the alternatives:

Engine	Core innovation	Best workload	Requires
TGI	Rust streaming core, native HF Hub	HF gated models, low-to-medium concurrency	HF token for gated models
vLLM	PagedAttention, continuous batching	High-throughput batch inference	Python runtime
SGLang	RadixAttention prefix cache	Agentic multi-turn with shared prefixes	Python runtime

At low-to-medium concurrency (under 64 requests), TGI and vLLM deliver similar throughput for supported models. Above 64 concurrent requests, vLLM's PagedAttention memory management typically pulls ahead because it wastes less VRAM per request. If shared prefixes are your bottleneck, SGLang's RadixAttention wins regardless of concurrency.

Hardware and VRAM Requirements

VRAM math is straightforward: count bytes per parameter. BF16 is 2 bytes per param, FP8 is 1 byte, INT4 is 0.5 bytes. A 70B model in BF16 needs 140GB. In FP8, 70GB. You also need headroom for KV cache (typically 10-20% of the total VRAM budget for 8K context). For the full derivation, see the GPU memory requirements for LLMs guide.

Model	Params (active)	Precision	Min VRAM	Recommended GPU	Multi-shard config
Llama 3.3 70B	70B dense	FP8	75GB	H100 SXM5 80GB	1x (tight), 2x (comfortable)
Llama 3.3 70B	70B dense	BF16	145GB	2x H100 SXM5	`--num-shard 2`
Llama 4 Scout	17B active / 109B total MoE	FP8	130GB	2x H100 SXM5	`--num-shard 2`
Llama 4 Scout	17B active / 109B total MoE	BF16	220GB	2x H200 SXM5	`--num-shard 2`
Qwen3 30B-A3B	3B active / 30B MoE	BF16	62GB	A100 SXM4 80GB	1x
Mistral Small 3.1 24B	24B dense	BF16	50GB	A100 SXM4 80GB	1x
Mistral Large 2 123B	123B dense	FP8	130GB	2x H100 SXM5	`--num-shard 2`

Important note on Llama 4 Scout --num-shard values: Scout's architecture requires --num-shard to evenly divide the number of key-value heads. Valid values for Scout are 2, 4, and 8. Using --num-shard 3 or other non-divisors causes TGI to fail at startup with a shape mismatch error that does not always clearly state the cause.

Setting Up Your Spheron Instance

Step 1: Provision the GPU

For 70B at FP8, an H100 SXM5 rental is the tightest single-card option that fits Llama 3.3 70B. You need --gpu-memory-utilization tuning (use --max-input-length and --max-total-tokens to limit KV cache) to get it below 80GB. For 70B at BF16, provision a 2x H100 instance. For budget 7B-13B work, L40S instances on Spheron cost less per hour than A100 and have enough memory for 13B at BF16 with room for KV cache.

Log in to app.spheron.ai, select your GPU model, region, and provider, then launch. SSH into the instance once it's running.

Step 2: Verify GPU Access

bash

nvidia-smi

Confirm the GPU count matches what you provisioned, VRAM appears at expected capacity, and the driver version is recent (590.x or later for H100/H200).

Step 3: Verify Docker GPU Access

bash

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Should print the same GPU table as the host. If Docker cannot see the GPUs, the NVIDIA Container Toolkit is not configured:

bash

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Single-GPU TGI Deployment

The following deploys Llama 3.3 70B Instruct at FP8 on a single H100. We pin to TGI version 3.3.0 as a known-good checkpoint. The latest tag advances regularly and can break working configurations. The 3.3.x patch line is backward compatible, so bumping to 3.3.7 (the current patch release) is safe.

bash

docker run \
  --gpus all \
  --shm-size 1g \
  --ipc=host \
  -p 8080:80 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  ghcr.io/huggingface/text-generation-inference:3.3.0 \
  --model-id meta-llama/Llama-3.3-70B-Instruct \
  --quantize fp8 \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --max-batch-prefill-tokens 4096

Flag-by-flag breakdown:

--gpus all: expose all host GPUs to the container
--shm-size 1g: required. TGI uses CUDA IPC for inter-process communication. Without this flag, TGI appears to work on the first few requests but fails silently under load with CUDA IPC errors. This is one of the most common production footguns.
--ipc=host: allows shared memory access between GPU worker processes. Required when --num-shard > 1; safe to include for single-GPU deployments
--quantize fp8: uses FP8 Tensor Cores on H100 and Blackwell. Not supported on A100. A100 lacks hardware FP8 Tensor Cores. Use --quantize bitsandbytes (INT8) or --quantize gptq/--quantize awq on A100 instead.
--max-input-length 4096: maximum tokens per input prompt. Controls prefill memory reservation.
--max-total-tokens 8192: input + output token budget per request. The KV cache allocation is sized against this.
--max-batch-prefill-tokens 4096: total token budget for prefill across the current batch. Higher values increase throughput at the cost of TTFT for other waiting requests.

Health check (TGI returns 503 during model loading, 200 when ready):

bash

curl http://localhost:8080/health

Test inference:

bash

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tgi",
    "messages": [{"role": "user", "content": "What is flash attention?"}],
    "max_tokens": 256
  }'

Multi-GPU Deployment with --num-shard

--num-shard N tells TGI to split the model across N GPUs using tensor parallelism. TGI partitions attention heads and MLP layers across the shard count; communication happens through NCCL all-reduce operations between GPUs.

Two-GPU deployment (70B at BF16, 2x H100):

bash

docker run \
  --gpus all \
  --shm-size 1g \
  --ipc=host \
  -p 8080:80 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  ghcr.io/huggingface/text-generation-inference:3.3.0 \
  --model-id meta-llama/Llama-3.3-70B-Instruct \
  --num-shard 2

Four-GPU deployment (130B+ models):

bash

docker run \
  --gpus all \
  --shm-size 1g \
  --ipc=host \
  -p 8080:80 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  ghcr.io/huggingface/text-generation-inference:3.3.0 \
  --model-id mistralai/Mistral-Large-Instruct-2407 \
  --num-shard 4 \
  --quantize fp8

NVLink (present on SXM instances) dramatically reduces all-reduce latency. On a 2x H100 SXM5 node with NVLink, the all-reduce for a 70B model's attention heads takes under 10 microseconds. On PCIe setups without NVLink, the same operation goes over the PCIe bus and takes 60-100 microseconds, which adds measurable TTFT overhead at every layer boundary. For --num-shard > 2, the NVLink advantage compounds. If you are running --num-shard 4 or higher, the difference between SXM and PCIe instances in latency is significant.

A single H200 (141GB) handles Llama 3.3 70B at BF16 on one card, eliminating the need for multi-shard communication overhead entirely. For models that fit, single-card is almost always faster than two-card.

To debug NCCL communication issues, add -e NCCL_DEBUG=INFO to the Docker command. NCCL prints collective operation timing and error details to stdout.

TGI-Specific Optimizations

Flash Attention

TGI enables Flash Attention 2 by default on any GPU where the kernel is available: H100, A100, L40S, and most Ampere+ cards. No flag is needed. TGI auto-detects the CUDA compute capability at startup and loads the appropriate attention kernel.

If you see unexpected memory behavior or need to isolate an attention-related bug, disable it explicitly:

bash

--disable-flash-attention

This forces fallback to standard scaled dot-product attention, which is correct but slower and uses significantly more VRAM at long context lengths.

Quantization: FP8, GPTQ, AWQ, and bitsandbytes

TGI supports four quantization modes:

FP8 (--quantize fp8): hardware-native on H100 and Blackwell (B100, B200). Approximately 50% VRAM reduction relative to BF16, with roughly 1.5x throughput improvement on supported models. Do not use on A100. A100 lacks FP8 Tensor Cores; TGI will either error or fall back to a software path that is slower than BF16.

GPTQ (--quantize gptq): works on any CUDA GPU including A100 and RTX series. Requires a pre-quantized model checkpoint (the original model must have been quantized and saved in GPTQ format before you load it). Good option when you already have a GPTQ checkpoint and need to serve it without reprocessing.

AWQ (--quantize awq): lower quality loss than GPTQ at similar 4-bit compression. TGI supports AWQ kernels natively. For the quantization process itself, see the AWQ quantization guide.

bitsandbytes (--quantize bitsandbytes): on-the-fly INT8 quantization during model load. Works on any GPU. Slower than FP8 and AWQ at inference time, but requires no pre-quantized checkpoint. Useful for A100 when VRAM is tight and you don't have a GPTQ/AWQ checkpoint ready.

Continuous Batching and Prefill Tokens

--max-batch-prefill-tokens controls the total token budget allocated to the prefill phase per batch cycle. Lower values (4096) prioritize TTFT for interactive use. Higher values (16384) increase throughput at the cost of TTFT for other queued requests.

--waiting-served-ratio (default: 1.2) controls how aggressively TGI waits to form larger batches before serving. At 1.2, TGI will wait slightly longer than the current serving rate to allow more requests to join the batch. Raise to 2.0 for batch throughput workloads where request latency is less important than overall tokens-per-second. For a deeper look at continuous batching mechanics and how these parameters interact, see LLM Serving Optimization: Continuous Batching, PagedAttention, and Chunked Prefill.

Speculative Decoding

--speculate N enables speculative decoding with N draft tokens. TGI checks each speculative token against the base model and accepts runs of correct tokens without a full forward pass, reducing ITL at low concurrency.

Speculative decoding helps most when concurrency is below 8. Above that, the overhead of evaluating draft tokens per request starts to outweigh the gains. At high concurrency, the batch is already full, so there is nothing to speculate into.

When the model checkpoint includes bundled Medusa heads, TGI uses them automatically with --speculate N. To use a separate small draft model instead:

bash

--draft-model-id meta-llama/Llama-3.2-1B-Instruct \
--speculate 3

For a full production treatment of speculative decoding across engines, see the speculative decoding production guide.

Tensor Parallelism Notes

Add --ipc=host to the Docker command for every multi-GPU deployment. This gives GPU worker processes access to host shared memory for NCCL operations. Missing it causes intermittent failures on large all-reduce operations that only appear under sustained load.

To diagnose NCCL errors:

bash

-e NCCL_DEBUG=INFO

Outputs collective operation names, sizes, and timing. Look for NCCL WARN or NCCL error lines in the container stdout.

TGI vs vLLM vs SGLang: Benchmark Comparison

The numbers below are representative of the throughput and latency characteristics described in published benchmarks from Hugging Face, LMSys, and community testing on H100 hardware. These are not measurements from a single controlled run on Spheron infrastructure. Treat them as directional guidance, not production guarantees. For exhaustive head-to-head measurements on Spheron H100 hardware, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.

All figures assume Llama 3.3 70B at FP8, single H100 SXM5.

Metric	TGI	vLLM	SGLang
Throughput (128 concurrent, tok/s)	~1,600	~2,400	~1,900
TTFT (single request, ms)	~95	~110	~112
p99 latency (32 concurrent, ms)	~820	~780	~750
Prefix cache	No native prefix cache	APC (opt-in flag)	RadixAttention (default)
Flash Attention	FA2 (auto)	FA3/FA2	FA2
HF Hub integration	Native, automatic	Manual download	Manual download
Prometheus metrics	Built-in, always on	Built-in	Requires `--enable-metrics`

TGI leads on single-request TTFT (lowest latency for a single isolated request) because the Rust core has lower per-token overhead than Python-based engines at near-zero concurrency. At 128 concurrent requests, vLLM's PagedAttention memory management pulls ahead on throughput. SGLang closes the gap on p99 latency for workloads where requests share prefixes, but the numbers above use unique prompts throughout.

The practical summary: TGI is the fastest engine per request at single-digit concurrency. vLLM wins on aggregate throughput at high concurrency. SGLang wins on TTFT whenever shared prefixes are in play.

Production Hardening

Health Checks

TGI's /health endpoint returns 503 during model loading and 200 when the server is ready to serve requests. For Docker deployments:

bash

docker run \
  --health-cmd "curl -f http://localhost:80/health || exit 1" \
  --health-interval 15s \
  --health-start-period 120s \
  --health-retries 3 \
  --gpus all \
  --shm-size 1g \
  --ipc=host \
  -p 8080:80 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  ghcr.io/huggingface/text-generation-inference:3.3.0 \
  --model-id meta-llama/Llama-3.3-70B-Instruct \
  --quantize fp8

--health-start-period 120s gives TGI time to download and load the model before Docker starts evaluating health. A 70B model takes 60-90 seconds to load from disk on a fast NVMe drive after the first download.

Prometheus Metrics

TGI exposes /metrics at the same port as the API, automatically, with no configuration flag needed. Key metrics and alert thresholds:

Metric	Alert condition	Action
`tgi_queue_size`	`> 50`	Add capacity or enable rate limiting
`tgi_batch_current_size`	Approaching batch cap	Scale horizontally (tune `--max-batch-total-tokens` or `--max-concurrent-requests`)
`tgi_request_queue_duration`	p99 rising	Queue pressure building, lower `--max-batch-prefill-tokens` or scale out
`tgi_request_mean_time_per_token_duration`	`> 100ms`	GPU overload/saturation or decode contention

Hugging Face documents the full TGI metrics reference, which you can build Grafana panels on top of.

Pair the TGI metrics with nvidia-smi dmon -s pum -d 10 or DCGM for GPU-level memory and utilization. TGI's queue metrics tell you about request pressure; GPU metrics tell you about hardware saturation.

Request Queueing and Backpressure

TGI queues requests up to --max-waiting-tokens. When the queue fills, TGI returns 429s. For production, put TGI behind nginx or an L7 load balancer for connection-level queueing and graceful backpressure:

nginx

upstream tgi_backend {
    server 127.0.0.1:8080;
    keepalive 64;
}

server {
    listen 443 ssl;
    ssl_certificate /path/to/fullchain.pem;
    ssl_certificate_key /path/to/privkey.pem;
    location / {
        proxy_pass http://tgi_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection '';
        proxy_buffering off;
        proxy_read_timeout 300;
        proxy_send_timeout 300;
    }
}

For horizontal scaling, TGI instances are stateless. Run multiple instances behind a load balancer; each instance holds its own KV cache but request routing is independent.

Autoscaling

TGI has no native autoscaling. Wire Kubernetes HPA to tgi_queue_size via the Prometheus adapter:

yaml

metrics:
  - type: Pods
    pods:
      metric:
        name: tgi_queue_size
      target:
        type: AverageValue
        averageValue: "10"

Scale out when average queue depth per pod exceeds 10. For the full Kubernetes GPU autoscaling setup including node pool management and spot instance integration, see the Kubernetes GPU orchestration guide.

GPU Pricing on Spheron

GPU	VRAM	TGI Use Case	On-Demand	Spot
RTX 4090	24GB	7B FP16, 13B GPTQ	from $0.79/hr	N/A
L40S	48GB	13B FP16, 34B AWQ	from $0.72/hr	N/A
A100 SXM4 80GB	80GB	30B-40B FP16, 70B GPTQ/AWQ	from $1.64/hr	N/A
H100 SXM5 80GB	80GB	70B FP8, 130B with `--num-shard 2`	from $4.34/hr	N/A
H200 SXM5	141GB	70B FP16 single-card, 130B FP8	from $4.54/hr	N/A

Pricing fluctuates based on GPU availability. The prices above are based on 24 Apr 2026 and may have changed. Check current GPU pricing for live rates.

For context on cost-per-token across these GPUs at production scale, see the AI inference cost economics guide.

When to Pick TGI Over vLLM or SGLang

Criteria	Use TGI	Use vLLM	Use SGLang
Model source	Hugging Face Hub (gated)	Any (local or Hub)	Any
Concurrency level	Low-medium (under 64 req)	High (over 64 req)	Multi-turn agents
Prefix reuse	No specific requirement	Some prefix sharing	High prefix overlap
Metrics out-of-box	Yes (always on)	Yes (always on)	Requires `--enable-metrics`
Quantization priority	GPTQ/AWQ (existing checkpoints)	FP8 / AWQ	FP8 / AWQ
A100 FP8 support	No (use GPTQ/AWQ/bitsandbytes)	No	No
Streaming latency at 1 req	Best (Rust core)	Competitive	Competitive

For the full framework decision path including how TGI, vLLM, llama.cpp, and Ollama compare across deployment scenarios, see the LLM deployment guide.

TGI runs cheapest on bare-metal GPU cloud where you control the serving stack without managed API markup. Rent H100 on Spheron | Rent H200 on Spheron | View all GPU pricing
Get started on Spheron

STEPS / 06

Quick Setup Guide

Choose your GPU and provision a Spheron instance
Select GPU based on model size: RTX 4090 or L40S for 7B-34B models, H100 SXM5 for 70B at FP8, H200 for 70B at FP16 or 130B at FP8. Provision through the Spheron GPU catalog at app.spheron.ai, SSH in, and verify with nvidia-smi.
Install Docker with NVIDIA Container Toolkit
Most Spheron GPU instances include the NVIDIA Container Toolkit pre-installed. Validate with: docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi. If not present, install via: curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && sudo apt-get install -y nvidia-container-toolkit.
Deploy TGI on a single GPU
Run the TGI container: docker run --gpus all --shm-size 1g -p 8080:80 -e HUGGING_FACE_HUB_TOKEN=your_token ghcr.io/huggingface/text-generation-inference:3.3.0 --model-id meta-llama/Llama-3.3-70B-Instruct --quantize fp8 --max-input-length 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 4096. Test: curl http://localhost:8080/v1/chat/completions with a simple JSON body.
Scale to multi-GPU with --num-shard
Add --num-shard N to split the model across N GPUs. For a 4x H100 instance: docker run --gpus all --ipc=host --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-generation-inference:3.3.0 --model-id meta-llama/Llama-3.3-70B-Instruct --num-shard 4 --quantize fp8. All-reduce communication uses nccl; ensure --ipc=host is set for GPU direct communication on the same node.
Enable speculative decoding (Medusa or draft model)
For latency-critical workloads, enable speculative decoding: --speculate N specifies the number of candidate tokens. With a Medusa head already baked into the model, TGI uses the bundled draft heads automatically. Alternatively, use --draft-model-id to specify a small draft model. This reduces TTFT and ITL at low concurrency settings where the speculative overhead is negligible.
Set up production monitoring with Prometheus
TGI exposes /metrics automatically (no flag needed). Key metrics: tgi_queue_size (backpressure), tgi_batch_current_size (active batch size), tgi_request_mean_time_per_token_duration (ITL latency), tgi_request_queue_duration (time in queue per request, proxy for TTFT pressure). Pair with nvidia-smi dmon -s pum -d 10 or DCGM for GPU memory and utilization.

FAQ / 05

Frequently Asked Questions

TGI (Text Generation Inference) is Hugging Face's production LLM serving engine built in Rust with Python bindings. It focuses on Hugging Face Hub model compatibility, continuous batching, and flash attention. vLLM, written in Python with custom CUDA kernels, focuses on PagedAttention and higher GPU utilization. TGI is the better pick when you need seamless Hugging Face Hub integration or when running models that TGI has optimized kernels for. vLLM generally delivers higher throughput on homogeneous request batches.

Llama 4 Scout (17B active, 109B total MoE) requires approximately 220GB VRAM at BF16 for all expert weights (weights only; KV cache requires additional headroom on top). In practice, with FP8 quantization on TGI, a 2x H100 SXM5 (160GB combined) handles Scout comfortably with --quantize fp8 and --num-shard 2. A single H200 SXM5 (141GB) also works for FP8.

Yes. Set --num-shard N in the TGI launch command, where N is the number of GPUs. TGI uses nccl under the hood and handles model partitioning automatically across N GPUs. NVLink-connected SXM instances (H100, H200) have significantly lower all-reduce latency than PCIe setups.

Provision an H100 or H200 GPU instance from the Spheron GPU catalog, SSH in, verify GPUs with nvidia-smi, then run: docker run --gpus all --shm-size 1g -p 8080:80 -e HUGGING_FACE_HUB_TOKEN=your_token ghcr.io/huggingface/text-generation-inference:3.3.0 --model-id meta-llama/Llama-3.3-70B-Instruct --quantize fp8 --num-shard 1 --max-batch-prefill-tokens 4096. The OpenAI-compatible endpoint is available at /v1/chat/completions.

Pick TGI when: you rely on Hugging Face Hub gated model access and token management, you need TGI's built-in Prometheus metrics without additional setup, or your model is in TGI's supported optimized model list with native CUDA graphs. Pick vLLM for higher throughput on homogeneous batches. Pick SGLang for multi-turn agent workloads with high prefix overlap.

What Is TGI and Why Hugging Face Built It

Hardware and VRAM Requirements

Setting Up Your Spheron Instance

Step 1: Provision the GPU

Step 2: Verify GPU Access

Step 3: Verify Docker GPU Access

Single-GPU TGI Deployment

Multi-GPU Deployment with --num-shard

TGI-Specific Optimizations

Flash Attention

Quantization: FP8, GPTQ, AWQ, and bitsandbytes

Continuous Batching and Prefill Tokens

Speculative Decoding

Tensor Parallelism Notes

TGI vs vLLM vs SGLang: Benchmark Comparison

Production Hardening

Health Checks

Prometheus Metrics

Request Queueing and Backpressure

Autoscaling

GPU Pricing on Spheron

When to Pick TGI Over vLLM or SGLang

Quick Setup Guide

Choose your GPU and provision a Spheron instance

Install Docker with NVIDIA Container Toolkit

Deploy TGI on a single GPU

Scale to multi-GPU with --num-shard

Enable speculative decoding (Medusa or draft model)

Set up production monitoring with Prometheus

Frequently Asked Questions

01What is Hugging Face TGI and how does it differ from vLLM?

02How much VRAM do I need to run Llama 4 Scout on TGI?

03Does TGI support tensor parallelism?

04How do I deploy Hugging Face TGI on Spheron?

05When should I pick TGI over vLLM or SGLang?

Try It on Real GPUs