Tutorial

Deploy Hugging Face TGI on GPU Cloud: Production Text Generation Inference Setup Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 24, 2026
Hugging Face TGIText Generation InferenceTGI DeploymentLLM InferenceHugging Face HubFlash AttentionH100vLLM vs TGITGI Multi-GPULLM Quantization
Deploy Hugging Face TGI on GPU Cloud: Production Text Generation Inference Setup Guide (2026)

TGI is Hugging Face's native serving engine. It talks directly to the Hugging Face Hub, handles gated model downloads without a manual token-passing dance, and has been quietly accumulating production features since 2022. The deployment details are scattered across GitHub issues, version changelogs, and outdated tutorials. This guide consolidates them. If you're comparing engines before picking one, the vLLM vs TensorRT-LLM vs SGLang benchmarks covers throughput and latency numbers across the three dominant alternatives.

What Is TGI and Why Hugging Face Built It

TGI's core is written in Rust, with Python bindings exposing the server interface. The Rust core handles token streaming with low per-token overhead. Python sits above it for model loading, sampling logic, and OpenAI-compatible API routing.

The main thing TGI does that no other framework matches out of the box: it speaks Hugging Face Hub natively. Pass HUGGING_FACE_HUB_TOKEN and a model ID, and TGI downloads the model, validates the license, and handles gated access automatically. No pre-download step, no manual sharding, no file format conversion.

TGI has native CUDA graph optimization for a specific set of supported model architectures: Llama (all generations), Qwen 2/3, Mistral/Mixtral, Falcon, Command-R, Gemma, and Phi. Outside this list, TGI falls back to eager execution without CUDA graphs, which reduces throughput and TTFT compared to the optimized path.

Where TGI sits relative to the alternatives:

EngineCore innovationBest workloadRequires
TGIRust streaming core, native HF HubHF gated models, low-to-medium concurrencyHF token for gated models
vLLMPagedAttention, continuous batchingHigh-throughput batch inferencePython runtime
SGLangRadixAttention prefix cacheAgentic multi-turn with shared prefixesPython runtime

At low-to-medium concurrency (under 64 requests), TGI and vLLM deliver similar throughput for supported models. Above 64 concurrent requests, vLLM's PagedAttention memory management typically pulls ahead because it wastes less VRAM per request. If shared prefixes are your bottleneck, SGLang's RadixAttention wins regardless of concurrency.

Hardware and VRAM Requirements

VRAM math is straightforward: count bytes per parameter. BF16 is 2 bytes per param, FP8 is 1 byte, INT4 is 0.5 bytes. A 70B model in BF16 needs 140GB. In FP8, 70GB. You also need headroom for KV cache (typically 10-20% of the total VRAM budget for 8K context). For the full derivation, see the GPU memory requirements for LLMs guide.

ModelParams (active)PrecisionMin VRAMRecommended GPUMulti-shard config
Llama 3.3 70B70B denseFP875GBH100 SXM5 80GB1x (tight), 2x (comfortable)
Llama 3.3 70B70B denseBF16145GB2x H100 SXM5--num-shard 2
Llama 4 Scout17B active / 109B total MoEFP8130GB2x H100 SXM5--num-shard 2
Llama 4 Scout17B active / 109B total MoEBF16220GB2x H200 SXM5--num-shard 2
Qwen3 30B-A3B3B active / 30B MoEBF1662GBA100 SXM4 80GB1x
Mistral Small 3.1 24B24B denseBF1650GBA100 SXM4 80GB1x
Mistral Large 2 123B123B denseFP8130GB2x H100 SXM5--num-shard 2

Important note on Llama 4 Scout --num-shard values: Scout's architecture requires --num-shard to evenly divide the number of key-value heads. Valid values for Scout are 2, 4, and 8. Using --num-shard 3 or other non-divisors causes TGI to fail at startup with a shape mismatch error that does not always clearly state the cause.

Setting Up Your Spheron Instance

Step 1: Provision the GPU

For 70B at FP8, an H100 SXM5 rental is the tightest single-card option that fits Llama 3.3 70B. You need --gpu-memory-utilization tuning (use --max-input-length and --max-total-tokens to limit KV cache) to get it below 80GB. For 70B at BF16, provision a 2x H100 instance. For budget 7B-13B work, L40S instances on Spheron cost less per hour than A100 and have enough memory for 13B at BF16 with room for KV cache.

Log in to app.spheron.ai, select your GPU model, region, and provider, then launch. SSH into the instance once it's running.

Step 2: Verify GPU Access

bash
nvidia-smi

Confirm the GPU count matches what you provisioned, VRAM appears at expected capacity, and the driver version is recent (590.x or later for H100/H200).

Step 3: Verify Docker GPU Access

bash
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Should print the same GPU table as the host. If Docker cannot see the GPUs, the NVIDIA Container Toolkit is not configured:

bash
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Single-GPU TGI Deployment

The following deploys Llama 3.3 70B Instruct at FP8 on a single H100. We pin to TGI version 3.3.0 as a known-good checkpoint. The latest tag advances regularly and can break working configurations. The 3.3.x patch line is backward compatible, so bumping to 3.3.7 (the current patch release) is safe.

bash
docker run \
  --gpus all \
  --shm-size 1g \
  --ipc=host \
  -p 8080:80 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  ghcr.io/huggingface/text-generation-inference:3.3.0 \
  --model-id meta-llama/Llama-3.3-70B-Instruct \
  --quantize fp8 \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --max-batch-prefill-tokens 4096

Flag-by-flag breakdown:

  • --gpus all: expose all host GPUs to the container
  • --shm-size 1g: required. TGI uses CUDA IPC for inter-process communication. Without this flag, TGI appears to work on the first few requests but fails silently under load with CUDA IPC errors. This is one of the most common production footguns.
  • --ipc=host: allows shared memory access between GPU worker processes. Required when --num-shard > 1; safe to include for single-GPU deployments
  • --quantize fp8: uses FP8 Tensor Cores on H100 and Blackwell. Not supported on A100. A100 lacks hardware FP8 Tensor Cores. Use --quantize bitsandbytes (INT8) or --quantize gptq/--quantize awq on A100 instead.
  • --max-input-length 4096: maximum tokens per input prompt. Controls prefill memory reservation.
  • --max-total-tokens 8192: input + output token budget per request. The KV cache allocation is sized against this.
  • --max-batch-prefill-tokens 4096: total token budget for prefill across the current batch. Higher values increase throughput at the cost of TTFT for other waiting requests.

Health check (TGI returns 503 during model loading, 200 when ready):

bash
curl http://localhost:8080/health

Test inference:

bash
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tgi",
    "messages": [{"role": "user", "content": "What is flash attention?"}],
    "max_tokens": 256
  }'

Multi-GPU Deployment with --num-shard

--num-shard N tells TGI to split the model across N GPUs using tensor parallelism. TGI partitions attention heads and MLP layers across the shard count; communication happens through NCCL all-reduce operations between GPUs.

Two-GPU deployment (70B at BF16, 2x H100):

bash
docker run \
  --gpus all \
  --shm-size 1g \
  --ipc=host \
  -p 8080:80 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  ghcr.io/huggingface/text-generation-inference:3.3.0 \
  --model-id meta-llama/Llama-3.3-70B-Instruct \
  --num-shard 2

Four-GPU deployment (130B+ models):

bash
docker run \
  --gpus all \
  --shm-size 1g \
  --ipc=host \
  -p 8080:80 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  ghcr.io/huggingface/text-generation-inference:3.3.0 \
  --model-id mistralai/Mistral-Large-Instruct-2407 \
  --num-shard 4 \
  --quantize fp8

NVLink (present on SXM instances) dramatically reduces all-reduce latency. On a 2x H100 SXM5 node with NVLink, the all-reduce for a 70B model's attention heads takes under 10 microseconds. On PCIe setups without NVLink, the same operation goes over the PCIe bus and takes 60-100 microseconds, which adds measurable TTFT overhead at every layer boundary. For --num-shard > 2, the NVLink advantage compounds. If you are running --num-shard 4 or higher, the difference between SXM and PCIe instances in latency is significant.

A single H200 (141GB) handles Llama 3.3 70B at BF16 on one card, eliminating the need for multi-shard communication overhead entirely. For models that fit, single-card is almost always faster than two-card.

To debug NCCL communication issues, add -e NCCL_DEBUG=INFO to the Docker command. NCCL prints collective operation timing and error details to stdout.

TGI-Specific Optimizations

Flash Attention

TGI enables Flash Attention 2 by default on any GPU where the kernel is available: H100, A100, L40S, and most Ampere+ cards. No flag is needed. TGI auto-detects the CUDA compute capability at startup and loads the appropriate attention kernel.

If you see unexpected memory behavior or need to isolate an attention-related bug, disable it explicitly:

bash
--disable-flash-attention

This forces fallback to standard scaled dot-product attention, which is correct but slower and uses significantly more VRAM at long context lengths.

Quantization: FP8, GPTQ, AWQ, and bitsandbytes

TGI supports four quantization modes:

FP8 (--quantize fp8): hardware-native on H100 and Blackwell (B100, B200). Approximately 50% VRAM reduction relative to BF16, with roughly 1.5x throughput improvement on supported models. Do not use on A100. A100 lacks FP8 Tensor Cores; TGI will either error or fall back to a software path that is slower than BF16.

GPTQ (--quantize gptq): works on any CUDA GPU including A100 and RTX series. Requires a pre-quantized model checkpoint (the original model must have been quantized and saved in GPTQ format before you load it). Good option when you already have a GPTQ checkpoint and need to serve it without reprocessing.

AWQ (--quantize awq): lower quality loss than GPTQ at similar 4-bit compression. TGI supports AWQ kernels natively. For the quantization process itself, see the AWQ quantization guide.

bitsandbytes (--quantize bitsandbytes): on-the-fly INT8 quantization during model load. Works on any GPU. Slower than FP8 and AWQ at inference time, but requires no pre-quantized checkpoint. Useful for A100 when VRAM is tight and you don't have a GPTQ/AWQ checkpoint ready.

Continuous Batching and Prefill Tokens

--max-batch-prefill-tokens controls the total token budget allocated to the prefill phase per batch cycle. Lower values (4096) prioritize TTFT for interactive use. Higher values (16384) increase throughput at the cost of TTFT for other queued requests.

--waiting-served-ratio (default: 1.2) controls how aggressively TGI waits to form larger batches before serving. At 1.2, TGI will wait slightly longer than the current serving rate to allow more requests to join the batch. Raise to 2.0 for batch throughput workloads where request latency is less important than overall tokens-per-second. For a deeper look at continuous batching mechanics and how these parameters interact, see LLM Serving Optimization: Continuous Batching, PagedAttention, and Chunked Prefill.

Speculative Decoding

--speculate N enables speculative decoding with N draft tokens. TGI checks each speculative token against the base model and accepts runs of correct tokens without a full forward pass, reducing ITL at low concurrency.

Speculative decoding helps most when concurrency is below 8. Above that, the overhead of evaluating draft tokens per request starts to outweigh the gains. At high concurrency, the batch is already full, so there is nothing to speculate into.

When the model checkpoint includes bundled Medusa heads, TGI uses them automatically with --speculate N. To use a separate small draft model instead:

bash
--draft-model-id meta-llama/Llama-3.2-1B-Instruct \
--speculate 3

For a full production treatment of speculative decoding across engines, see the speculative decoding production guide.

Tensor Parallelism Notes

Add --ipc=host to the Docker command for every multi-GPU deployment. This gives GPU worker processes access to host shared memory for NCCL operations. Missing it causes intermittent failures on large all-reduce operations that only appear under sustained load.

To diagnose NCCL errors:

bash
-e NCCL_DEBUG=INFO

Outputs collective operation names, sizes, and timing. Look for NCCL WARN or NCCL error lines in the container stdout.

TGI vs vLLM vs SGLang: Benchmark Comparison

The numbers below are representative of the throughput and latency characteristics described in published benchmarks from Hugging Face, LMSys, and community testing on H100 hardware. These are not measurements from a single controlled run on Spheron infrastructure. Treat them as directional guidance, not production guarantees. For exhaustive head-to-head measurements on Spheron H100 hardware, see the vLLM vs TensorRT-LLM vs SGLang benchmarks.

All figures assume Llama 3.3 70B at FP8, single H100 SXM5.

MetricTGIvLLMSGLang
Throughput (128 concurrent, tok/s)~1,600~2,400~1,900
TTFT (single request, ms)~95~110~112
p99 latency (32 concurrent, ms)~820~780~750
Prefix cacheNo native prefix cacheAPC (opt-in flag)RadixAttention (default)
Flash AttentionFA2 (auto)FA3/FA2FA2
HF Hub integrationNative, automaticManual downloadManual download
Prometheus metricsBuilt-in, always onBuilt-inRequires --enable-metrics

TGI leads on single-request TTFT (lowest latency for a single isolated request) because the Rust core has lower per-token overhead than Python-based engines at near-zero concurrency. At 128 concurrent requests, vLLM's PagedAttention memory management pulls ahead on throughput. SGLang closes the gap on p99 latency for workloads where requests share prefixes, but the numbers above use unique prompts throughout.

The practical summary: TGI is the fastest engine per request at single-digit concurrency. vLLM wins on aggregate throughput at high concurrency. SGLang wins on TTFT whenever shared prefixes are in play.

Production Hardening

Health Checks

TGI's /health endpoint returns 503 during model loading and 200 when the server is ready to serve requests. For Docker deployments:

bash
docker run \
  --health-cmd "curl -f http://localhost:80/health || exit 1" \
  --health-interval 15s \
  --health-start-period 120s \
  --health-retries 3 \
  --gpus all \
  --shm-size 1g \
  --ipc=host \
  -p 8080:80 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  ghcr.io/huggingface/text-generation-inference:3.3.0 \
  --model-id meta-llama/Llama-3.3-70B-Instruct \
  --quantize fp8

--health-start-period 120s gives TGI time to download and load the model before Docker starts evaluating health. A 70B model takes 60-90 seconds to load from disk on a fast NVMe drive after the first download.

Prometheus Metrics

TGI exposes /metrics at the same port as the API, automatically, with no configuration flag needed. Key metrics and alert thresholds:

MetricAlert conditionAction
tgi_queue_size> 50Add capacity or enable rate limiting
tgi_batch_current_sizeApproaching batch capScale horizontally (tune --max-batch-total-tokens or --max-concurrent-requests)
tgi_request_queue_durationp99 risingQueue pressure building, lower --max-batch-prefill-tokens or scale out
tgi_request_mean_time_per_token_duration> 100msGPU overload/saturation or decode contention

Hugging Face documents the full TGI metrics reference, which you can build Grafana panels on top of.

Pair the TGI metrics with nvidia-smi dmon -s pum -d 10 or DCGM for GPU-level memory and utilization. TGI's queue metrics tell you about request pressure; GPU metrics tell you about hardware saturation.

Request Queueing and Backpressure

TGI queues requests up to --max-waiting-tokens. When the queue fills, TGI returns 429s. For production, put TGI behind nginx or an L7 load balancer for connection-level queueing and graceful backpressure:

nginx
upstream tgi_backend {
    server 127.0.0.1:8080;
    keepalive 64;
}

server {
    listen 443 ssl;
    ssl_certificate /path/to/fullchain.pem;
    ssl_certificate_key /path/to/privkey.pem;
    location / {
        proxy_pass http://tgi_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection '';
        proxy_buffering off;
        proxy_read_timeout 300;
        proxy_send_timeout 300;
    }
}

For horizontal scaling, TGI instances are stateless. Run multiple instances behind a load balancer; each instance holds its own KV cache but request routing is independent.

Autoscaling

TGI has no native autoscaling. Wire Kubernetes HPA to tgi_queue_size via the Prometheus adapter:

yaml
metrics:
  - type: Pods
    pods:
      metric:
        name: tgi_queue_size
      target:
        type: AverageValue
        averageValue: "10"

Scale out when average queue depth per pod exceeds 10. For the full Kubernetes GPU autoscaling setup including node pool management and spot instance integration, see the Kubernetes GPU orchestration guide.

GPU Pricing on Spheron

GPUVRAMTGI Use CaseOn-DemandSpot
RTX 409024GB7B FP16, 13B GPTQfrom $0.79/hrN/A
L40S48GB13B FP16, 34B AWQfrom $0.72/hrN/A
A100 SXM4 80GB80GB30B-40B FP16, 70B GPTQ/AWQfrom $1.64/hrN/A
H100 SXM5 80GB80GB70B FP8, 130B with --num-shard 2from $4.34/hrN/A
H200 SXM5141GB70B FP16 single-card, 130B FP8from $4.54/hrN/A

Pricing fluctuates based on GPU availability. The prices above are based on 24 Apr 2026 and may have changed. Check current GPU pricing for live rates.

For context on cost-per-token across these GPUs at production scale, see the AI inference cost economics guide.

When to Pick TGI Over vLLM or SGLang

CriteriaUse TGIUse vLLMUse SGLang
Model sourceHugging Face Hub (gated)Any (local or Hub)Any
Concurrency levelLow-medium (under 64 req)High (over 64 req)Multi-turn agents
Prefix reuseNo specific requirementSome prefix sharingHigh prefix overlap
Metrics out-of-boxYes (always on)Yes (always on)Requires --enable-metrics
Quantization priorityGPTQ/AWQ (existing checkpoints)FP8 / AWQFP8 / AWQ
A100 FP8 supportNo (use GPTQ/AWQ/bitsandbytes)NoNo
Streaming latency at 1 reqBest (Rust core)CompetitiveCompetitive

For the full framework decision path including how TGI, vLLM, llama.cpp, and Ollama compare across deployment scenarios, see the LLM deployment guide.

TGI runs cheapest on bare-metal GPU cloud where you control the serving stack without managed API markup. Rent H100 on Spheron | Rent H200 on Spheron | View all GPU pricing

Get started on Spheron

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.