What is NVIDIA Dynamo used for?

NVIDIA Dynamo is an open-source distributed inference framework that disaggregates LLM serving into separate prefill and decode worker pools. It routes requests intelligently, manages KV cache transfer between nodes, and can deliver up to 7x higher throughput compared to monolithic serving.

What is disaggregated inference?

Disaggregated inference splits the two phases of LLM generation: prefill (processing the input prompt) and decode (generating each output token), onto separate GPU pools. This lets you scale each phase independently and match the hardware to the workload characteristics of each phase.

Which GPUs work best for disaggregated inference with Dynamo?

Prefill nodes are compute-bound and benefit from high-FLOPs GPUs like H100 SXM5. Decode nodes are memory-bandwidth-bound and work well with H100 PCIe or A100 80G. You can mix GPU tiers to reduce cost without sacrificing throughput.

How much faster is disaggregated inference vs monolithic?

In NVIDIA's benchmarks, Dynamo with disaggregated serving delivers up to 7x higher throughput (measured on DeepSeek R1 on Blackwell). Time to first token (TTFT) drops significantly because dedicated prefill nodes are never interrupted by decode work.

Does NVIDIA Dynamo work with vLLM?

Yes. Dynamo's primary integration is with vLLM. Dynamo sits above vLLM as an orchestration and routing layer, dispatching prefill work to prefill workers and decode continuation to decode workers, with NIXL handling KV cache transfer between them.

NVIDIA Dynamo 1.0: Disaggregated LLM Inference Deployment Guide (2026)

NVIDIA Dynamo 1.0 went GA on March 16, 2026 at GTC. Disaggregated inference is not a future trend: NVIDIA is shipping it now with up to 7x reported throughput gains (per NVIDIA's published Dynamo benchmarks on DeepSeek R1 on Blackwell). If you need the basics of vLLM setup before adding Dynamo on top, start with vLLM Multi-GPU Production Deployment 2026.

What Is NVIDIA Dynamo

Dynamo is an open-source distributed inference framework. It sits above vLLM as an orchestration layer and routes prefill and decode work to dedicated worker pools rather than running both phases on the same GPU. The NIXL KV cache transport layer handles moving key-value tensors between prefill and decode workers over NVLink or InfiniBand.

The three core components shipped with Dynamo 1.0:

dynamo-router: accepts OpenAI-compatible API requests, routes them to the right workers, and tracks KV cache placement
dynamo-worker: wraps vLLM (or another backend) and exposes a prefill or decode role
NIXL: a low-latency KV cache transfer protocol optimized for NVLink, InfiniBand RDMA, and fallback TCP

For context on the inference engine landscape Dynamo sits above, see the vLLM vs TensorRT-LLM vs SGLang comparison.

Disaggregated Inference: Prefill and Decode Are Different Problems

The Two Phases of LLM Generation

Every LLM generation request has two distinct phases. Prefill processes all input tokens at once in a single forward pass. Because attention is computed over the full prompt length, prefill is O(n) in memory and highly compute-intensive, with FLOPs scaling with prompt token count.

Decode generates one token per step, reading the KV cache for every prior token at each step. With a 4K-token context and 256 output tokens, that means 256 separate KV cache reads covering 4K+ tokens each. Decode is dominated by memory bandwidth, not compute.

In monolithic serving, both phases share the same GPU. A long prefill request blocks decode batches from continuing, which inflates latency for other users in the queue. This head-of-line blocking is what disaggregation solves.

Why Separating Them Improves Both

Dedicated prefill workers can process incoming prompts at full compute utilization without waiting for decode batches to drain. Dedicated decode workers maintain consistent token generation throughput without being interrupted by compute-heavy prefill spikes.

This also lets you match hardware to the bottleneck. Prefill nodes benefit from raw compute (H100 SXM5 at 3,958 TFLOPS FP8). Decode nodes benefit from memory bandwidth (A100 80G at 2 TB/s is competitive with H100 PCIe at a lower hourly rate).

Phase	Bottleneck	Ops per Step	Best GPU Trait
Prefill	Compute (FLOPs)	Full attention over prompt	High TFLOPS
Decode	Memory bandwidth	KV cache reads per token	High HBM BW

GPU Hardware Requirements

Prefill Nodes

H100 SXM5 is the natural choice for prefill: 3,958 TFLOPS FP8, 3.35 TB/s HBM3. For workloads with very long prompts (128K+ tokens), the H200 SXM5 with 141 GB HBM3e gives you the capacity to hold much larger KV caches before evicting.

For a comparison of H100, H200, and other GPUs for inference workloads, see Best GPU for AI Inference in 2026.

Rent prefill nodes: H100 GPU rental | H200 GPU rental

Decode Nodes

H100 PCIe and A100 80G are both solid choices for decode workers. The A100 80G SXM4 from $1.05/hr on Spheron is the most cost-effective option for decode. Its 2 TB/s HBM2e bandwidth handles token generation well, and the lower hourly rate compounds across the many decode workers a production cluster needs.

Mixing GPU tiers is a legitimate production strategy: H100 SXM5 for prefill, A100 80G for decode.

GPU	Role	Spheron Price (lowest)	Why
H100 SXM5	Prefill	from $2.40/hr	3,958 TFLOPS FP8, 3.35 TB/s HBM3
H200 SXM5	Prefill (long context)	from $1.72/hr (spot)	141GB HBM3e handles 128K+ prompts
H100 PCIe	Decode	from $2.01/hr	2 TB/s HBM3, cost-efficient decode
A100 80G SXM4	Decode	from $1.05/hr	2 TB/s HBM2e, lowest cost per decode

Pricing fluctuates based on GPU availability. The prices above are based on 25 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Step-by-Step: Deploy Dynamo with vLLM on Spheron

Step 1: Provision GPU Instances

Rent at least two H100 SXM5 instances for prefill and two H100 PCIe or A100 80G instances for decode on Spheron. All instances need to be on the same high-speed network fabric so NIXL KV transfers stay low-latency. Spheron bare-metal instances get dedicated public IPs with no NAT overhead, which matters for NIXL's KV transfer performance. See the Spheron instance types guide for bare-metal vs VM options, and Spheron GPU pricing for current rates.

Step 2: Install Dynamo and vLLM

bash

# Pull the official Dynamo container (NVIDIA NGC)
docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0

# Or install via pip (requires CUDA 12.9+)
pip install "ai-dynamo[vllm]" "vllm>=0.16.0"

Step 3: Configure the Topology

yaml

cluster:
  backend: vllm
  model: meta-llama/Llama-3.1-70B-Instruct
  dtype: fp8

prefill:
  workers: 2
  gpus_per_worker: 1
  kv_transfer: nixl

decode:
  workers: 4
  gpus_per_worker: 1
  max_num_seqs: 256

router:
  port: 8000
  strategy: least-kv-pressure

Step 4: Start Prefill Workers

bash

python3 -m dynamo.vllm \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode prefill \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
  --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'

Step 5: Start Decode Workers

bash

python3 -m dynamo.vllm \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode decode \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

Step 6: Start the Router

bash

python3 -m dynamo.frontend --port 8000

# Test the endpoint
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-70B-Instruct",
       "prompt": "Explain disaggregated inference in one sentence.",
       "max_tokens": 100}'

Benchmarks: Disaggregated vs Monolithic Inference

The numbers below use Llama 3.1 70B FP8 on H100 SXM5 bare-metal on Spheron, consistent with the methodology from the vLLM vs TensorRT-LLM vs SGLang benchmarks post.

Configuration	Concurrency	Throughput (tok/s)	TTFT (ms)	Cost/1M tokens
vLLM monolithic (1x H100)	32	~820	~320	~$2.80
Dynamo disaggregated (2+2)	32	~3,200	~85	~$1.90
Dynamo disaggregated (4+8)	128	~5,200	~90	~$1.60

TTFT drops sharply in the disaggregated setup because prefill nodes are never interrupted by decode work. Once prefill completes and the KV cache transfers to a decode worker, the decode worker starts generating tokens immediately without any queued prefill work blocking it.

Methodology note: These are representative estimates based on NVIDIA-published Dynamo benchmarks and vLLM baseline numbers, not direct Spheron-measured results. Always run your own benchmarks with your specific model and request distribution before committing to a topology.

Dynamo vs Manual Disaggregation with SGLang and TensorRT-LLM

SGLang RadixAttention vs Dynamo KV Routing

SGLang's RadixAttention handles repeated prefixes on a single node: when multiple requests share a common prefix (system prompt, RAG context, few-shot examples), SGLang computes attention once and caches it. Dynamo routes across nodes but does not focus on prefix sharing.

Dynamo wins at scale with multi-node deployments. SGLang wins for single-node setups where shared prefixes are common. They solve adjacent problems. See the SGLang and vLLM comparison for single-node benchmarks.

TensorRT-LLM Disaggregation vs Dynamo

TensorRT-LLM's Executor API has a disaggregated mode, but it requires ahead-of-time engine compilation. The compile step takes 20-30 minutes for a 70B model on H100 and must be repeated when you update the model or change batch size configurations.

Dynamo uses vLLM as the backend (more backends coming), which loads models directly without compilation. For teams that update models frequently or want faster iteration, Dynamo's flexibility matters more than TRT-LLM's maximum throughput ceiling.

Approach	Best For	Limitation
Dynamo + vLLM	Multi-node disaggregated serving	Requires NIXL-capable interconnect
SGLang (single-node)	Shared-prefix workloads, simplicity	No cross-node KV routing
TRT-LLM disaggregated	Maximum throughput, fixed workload	Slow compilation, less flexible
Manual routing (custom)	Specialized architectures	High maintenance cost

Cost Analysis: Disaggregated Serving vs Monolithic

Mixing H100 SXM5 for prefill with A100 80G for decode cuts hourly cost compared to all-H100 monolithic setups, while matching or exceeding throughput.

Setup	GPUs	Hourly Cost	Cost/1M tokens (est.)
Monolithic vLLM (4x H100 SXM5)	4x H100 SXM5 @ $2.40/hr	$9.60/hr	~$1.80
Dynamo: 2 prefill H100 + 4 decode A100	2x H100 + 4x A100 80G	$9.00/hr	~$1.10
Dynamo: 2 prefill H100 + 4 decode H100 PCIe	2x H100 SXM5 + 4x H100 PCIe	$12.84/hr	~$0.85

Pricing fluctuates based on GPU availability. The prices above are based on 25 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Actual savings depend on your prompt-to-output ratio. Long prompts relative to output favor a larger prefill pool. Short prompts with long outputs favor more decode capacity.

How Dynamo Cuts GPU Spend

In a monolithic setup, prefill and decode compete for the same GPU. Prefill work is bursty and compute-intensive, which means GPU utilization is uneven. With Dynamo, decode workers spend nearly all their time on token generation with minimal idle time waiting for CPU scheduling or competing with prefill.

This means you can run decode workers at higher average GPU utilization than monolithic setups. Budget decode workers (A100 at $1.05/hr) handle high-volume token generation; expensive prefill slots process prompts faster and free up sooner.

For comparison: H100 instances on AWS and Google Cloud carry a significant markup over bare-metal rates. A 6-GPU Dynamo cluster on Spheron (2x H100 SXM5 + 4x A100) costs $9.00/hr. Running the same GPU count on the major cloud providers typically costs several times more, and that gap compounds across the lifetime of a production deployment.

Production Checklist

Network topology - Ensure prefill and decode workers are on the same fabric. NVLink for single-node, InfiniBand or 400GbE for multi-node. NIXL falls back to TCP but with significant latency penalty. See Spheron network configuration for port and firewall setup on bare-metal instances.
KV cache sizing - Each prefill worker transfers the full KV cache for every request. Size decode worker VRAM so that max_num_seqs * max_model_len * kv_cache_dtype fits within HBM.
Prefill-to-decode ratio - Start with 1:2 (1 prefill per 2 decode workers) and profile. Heavy prompt workloads may need 1:1.
Autoscaling - Dynamo's router exposes queue depth metrics per worker type. Set up autoscaling rules separately for prefill and decode pools.
Monitoring - See the GPU monitoring guide for Prometheus and Grafana setup, including Dynamo-specific metrics.
Model warmup - Run 10-20 warmup requests before putting the router behind a load balancer. Cold KV transfers inflate TTFT until caches are warm.

Disaggregated inference is the most significant architectural shift in LLM serving in 2026. Dynamo makes it accessible without building custom routing infrastructure. For further cost reduction strategies, see the GPU cost optimization playbook.

Spheron provides bare-metal H100 and A100 GPU instances with the high-speed interconnects Dynamo's NIXL protocol requires. No Kubernetes overhead, no shared-tenancy latency. Spin up prefill and decode nodes separately and pay only for what you use.
Rent H100 for Dynamo → | View A100 pricing → | See all GPU pricing →
Deploy Dynamo on Spheron →

What Is NVIDIA Dynamo

Disaggregated Inference: Prefill and Decode Are Different Problems

The Two Phases of LLM Generation

Why Separating Them Improves Both

GPU Hardware Requirements

Prefill Nodes

Decode Nodes

Step-by-Step: Deploy Dynamo with vLLM on Spheron

Step 1: Provision GPU Instances

Step 2: Install Dynamo and vLLM

Step 3: Configure the Topology

Step 4: Start Prefill Workers

Step 5: Start Decode Workers

Step 6: Start the Router

Benchmarks: Disaggregated vs Monolithic Inference

Dynamo vs Manual Disaggregation with SGLang and TensorRT-LLM

SGLang RadixAttention vs Dynamo KV Routing

TensorRT-LLM Disaggregation vs Dynamo

Cost Analysis: Disaggregated Serving vs Monolithic

How Dynamo Cuts GPU Spend

Production Checklist

Build what's next.