NVIDIA Dynamo 1.0 went GA on March 16, 2026 at GTC. Disaggregated inference is not a future trend: NVIDIA is shipping it now with up to 7x reported throughput gains (per NVIDIA's published Dynamo benchmarks on DeepSeek R1 on Blackwell). If you need the basics of vLLM setup before adding Dynamo on top, start with vLLM Multi-GPU Production Deployment 2026.
What Is NVIDIA Dynamo
Dynamo is an open-source distributed inference framework. It sits above vLLM as an orchestration layer and routes prefill and decode work to dedicated worker pools rather than running both phases on the same GPU. The NIXL KV cache transport layer handles moving key-value tensors between prefill and decode workers over NVLink or InfiniBand.
The three core components shipped with Dynamo 1.0:
dynamo-router: accepts OpenAI-compatible API requests, routes them to the right workers, and tracks KV cache placementdynamo-worker: wraps vLLM (or another backend) and exposes a prefill or decode role- NIXL: a low-latency KV cache transfer protocol optimized for NVLink, InfiniBand RDMA, and fallback TCP
The same NIXL transport layer that moves KV blocks between Dynamo workers can also tier KV cache to NVMe storage. For a software-level deployment of this pattern today using LMCache, see NVMe KV Cache Offloading for LLM Inference. For the hardware-accelerated version of this same NVMe backend with BlueField-4 DPU integration, see the guide on ICMSP integration with Dynamo and how cuFile GPU-direct NVMe fits into the NIXL stack.
For context on the inference engine landscape Dynamo sits above, see the vLLM vs TensorRT-LLM vs SGLang comparison.
For teams running Kubernetes who want disaggregation as a first-class CNCF project rather than an NVIDIA proprietary layer, see the llm-d Kubernetes disaggregated inference guide.
Disaggregated Inference: Prefill and Decode Are Different Problems
The Two Phases of LLM Generation
Every LLM generation request has two distinct phases. Prefill processes all input tokens at once in a single forward pass. Because attention is computed over the full prompt length, prefill is O(n) in memory and highly compute-intensive, with FLOPs scaling with prompt token count.
Decode generates one token per step, reading the KV cache for every prior token at each step. With a 4K-token context and 256 output tokens, that means 256 separate KV cache reads covering 4K+ tokens each. Decode is dominated by memory bandwidth, not compute.
In monolithic serving, both phases share the same GPU. A long prefill request blocks decode batches from continuing, which inflates latency for other users in the queue. This head-of-line blocking is what disaggregation solves.
Why Separating Them Improves Both
Dedicated prefill workers can process incoming prompts at full compute utilization without waiting for decode batches to drain. Dedicated decode workers maintain consistent token generation throughput without being interrupted by compute-heavy prefill spikes.
This also lets you match hardware to the bottleneck. Prefill nodes benefit from raw compute (H100 SXM5 at 3,958 TFLOPS FP8). Decode nodes benefit from memory bandwidth (A100 80G at 2 TB/s is competitive with H100 PCIe at a lower hourly rate).
| Phase | Bottleneck | Ops per Step | Best GPU Trait |
|---|---|---|---|
| Prefill | Compute (FLOPs) | Full attention over prompt | High TFLOPS |
| Decode | Memory bandwidth | KV cache reads per token | High HBM BW |
GPU Hardware Requirements
Prefill Nodes
H100 SXM5 is the natural choice for prefill: 3,958 TFLOPS FP8, 3.35 TB/s HBM3. For workloads with very long prompts (128K+ tokens), the H200 SXM5 with 141 GB HBM3e gives you the capacity to hold much larger KV caches before evicting.
For a comparison of H100, H200, and other GPUs for inference workloads, see Best GPU for AI Inference in 2026.
Rent prefill nodes: H100 GPU rental | H200 GPU rental
Decode Nodes
H100 PCIe and A100 80G are both solid choices for decode workers. The A100 80G SXM4 from $1.05/hr on Spheron is the most cost-effective option for decode. Its 2 TB/s HBM2e bandwidth handles token generation well, and the lower hourly rate compounds across the many decode workers a production cluster needs.
Mixing GPU tiers is a legitimate production strategy: H100 SXM5 for prefill, A100 80G for decode. For a full breakdown of which GPU pairs deliver the best cost-to-throughput ratio in a heterogeneous setup, see our heterogeneous GPU inference guide.
| GPU | Role | Spheron Price (lowest) | Why |
|---|---|---|---|
| H100 SXM5 | Prefill | from $2.40/hr | 3,958 TFLOPS FP8, 3.35 TB/s HBM3 |
| H200 SXM5 | Prefill (long context) | from $1.72/hr (spot) | 141GB HBM3e handles 128K+ prompts |
| H100 PCIe | Decode | from $2.01/hr | 2 TB/s HBM3, cost-efficient decode |
| A100 80G SXM4 | Decode | from $1.05/hr | 2 TB/s HBM2e, lowest cost per decode |
Pricing fluctuates based on GPU availability. The prices above are based on 25 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
Step-by-Step: Deploy Dynamo with vLLM on Spheron
Step 1: Provision GPU Instances
Rent at least two H100 SXM5 instances for prefill and two H100 PCIe or A100 80G instances for decode on Spheron. All instances need to be on the same high-speed network fabric so NIXL KV transfers stay low-latency. Spheron bare-metal instances get dedicated public IPs with no NAT overhead, which matters for NIXL's KV transfer performance. See the Spheron instance types guide for bare-metal vs VM options, and Spheron GPU pricing for current rates.
Step 2: Install Dynamo and vLLM
# Pull the official Dynamo container (NVIDIA NGC)
docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
# Or install via pip (requires CUDA 12.9+)
pip install "ai-dynamo[vllm]" "vllm>=0.16.0"Step 3: Configure the Topology
cluster:
backend: vllm
model: meta-llama/Llama-3.1-70B-Instruct
dtype: fp8
prefill:
workers: 2
gpus_per_worker: 1
kv_transfer: nixl
decode:
workers: 4
gpus_per_worker: 1
max_num_seqs: 256
router:
port: 8000
strategy: least-kv-pressureStep 4: Start Prefill Workers
python3 -m dynamo.vllm \
--model meta-llama/Llama-3.1-70B-Instruct \
--disaggregation-mode prefill \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'Step 5: Start Decode Workers
python3 -m dynamo.vllm \
--model meta-llama/Llama-3.1-70B-Instruct \
--disaggregation-mode decode \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'For a deep dive into NIXL's architecture and multi-node setup, see NVIDIA NIXL: KV Cache Transfers for Disaggregated Inference.
Step 6: Start the Router
python3 -m dynamo.frontend --port 8000
# Test the endpoint
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-70B-Instruct",
"prompt": "Explain disaggregated inference in one sentence.",
"max_tokens": 100}'Benchmarks: Disaggregated vs Monolithic Inference
The numbers below use Llama 3.1 70B FP8 on H100 SXM5 bare-metal on Spheron, consistent with the methodology from the vLLM vs TensorRT-LLM vs SGLang benchmarks post.
| Configuration | Concurrency | Throughput (tok/s) | TTFT (ms) | Cost/1M tokens |
|---|---|---|---|---|
| vLLM monolithic (1x H100) | 32 | ~820 | ~320 | ~$2.80 |
| Dynamo disaggregated (2+2) | 32 | ~3,200 | ~85 | ~$1.90 |
| Dynamo disaggregated (4+8) | 128 | ~5,200 | ~90 | ~$1.60 |
TTFT drops sharply in the disaggregated setup because prefill nodes are never interrupted by decode work. Once prefill completes and the KV cache transfers to a decode worker, the decode worker starts generating tokens immediately without any queued prefill work blocking it.
Methodology note: These are representative estimates based on NVIDIA-published Dynamo benchmarks and vLLM baseline numbers, not direct Spheron-measured results. Always run your own benchmarks with your specific model and request distribution before committing to a topology.
Dynamo vs Manual Disaggregation with SGLang and TensorRT-LLM
SGLang RadixAttention vs Dynamo KV Routing
SGLang's RadixAttention handles repeated prefixes on a single node: when multiple requests share a common prefix (system prompt, RAG context, few-shot examples), SGLang computes attention once and caches it. Dynamo routes across nodes but does not focus on prefix sharing.
Dynamo wins at scale with multi-node deployments. SGLang wins for single-node setups where shared prefixes are common. They solve adjacent problems. See the SGLang and vLLM comparison for single-node benchmarks.
For a framework-agnostic setup guide covering vLLM and SGLang disaggregation without Dynamo, see Prefill-Decode Disaggregation on GPU Cloud.
TensorRT-LLM Disaggregation vs Dynamo
TensorRT-LLM's Executor API has a disaggregated mode, but it requires ahead-of-time engine compilation. The compile step takes 20-30 minutes for a 70B model on H100 and must be repeated when you update the model or change batch size configurations.
Dynamo uses vLLM as the backend (more backends coming), which loads models directly without compilation. For teams that update models frequently or want faster iteration, Dynamo's flexibility matters more than TRT-LLM's maximum throughput ceiling.
| Approach | Best For | Limitation |
|---|---|---|
| Dynamo + vLLM | Multi-node disaggregated serving | Requires NIXL-capable interconnect |
| SGLang (single-node) | Shared-prefix workloads, simplicity | No cross-node KV routing |
| TRT-LLM disaggregated | Maximum throughput, fixed workload | Slow compilation, less flexible |
| Manual routing (custom) | Specialized architectures | High maintenance cost |
Cost Analysis: Disaggregated Serving vs Monolithic
Mixing H100 SXM5 for prefill with A100 80G for decode cuts hourly cost compared to all-H100 monolithic setups, while matching or exceeding throughput.
| Setup | GPUs | Hourly Cost | Cost/1M tokens (est.) |
|---|---|---|---|
| Monolithic vLLM (4x H100 SXM5) | 4x H100 SXM5 @ $2.40/hr | $9.60/hr | ~$1.80 |
| Dynamo: 2 prefill H100 + 4 decode A100 | 2x H100 + 4x A100 80G | $9.00/hr | ~$1.10 |
| Dynamo: 2 prefill H100 + 4 decode H100 PCIe | 2x H100 SXM5 + 4x H100 PCIe | $12.84/hr | ~$0.85 |
Pricing fluctuates based on GPU availability. The prices above are based on 25 Mar 2026 and may have changed. Check current GPU pricing → for live rates.
Actual savings depend on your prompt-to-output ratio. Long prompts relative to output favor a larger prefill pool. Short prompts with long outputs favor more decode capacity.
How Dynamo Cuts GPU Spend
In a monolithic setup, prefill and decode compete for the same GPU. Prefill work is bursty and compute-intensive, which means GPU utilization is uneven. With Dynamo, decode workers spend nearly all their time on token generation with minimal idle time waiting for CPU scheduling or competing with prefill.
This means you can run decode workers at higher average GPU utilization than monolithic setups. Budget decode workers (A100 at $1.05/hr) handle high-volume token generation; expensive prefill slots process prompts faster and free up sooner.
For comparison: H100 instances on AWS and Google Cloud carry a significant markup over bare-metal rates. A 6-GPU Dynamo cluster on Spheron (2x H100 SXM5 + 4x A100) costs $9.00/hr. Running the same GPU count on the major cloud providers typically costs several times more, and that gap compounds across the lifetime of a production deployment.
Production Checklist
- Network topology - Ensure prefill and decode workers are on the same fabric. NVLink for single-node, InfiniBand or 400GbE for multi-node. NIXL falls back to TCP but with significant latency penalty. See Spheron network configuration for port and firewall setup on bare-metal instances.
- KV cache sizing - Each prefill worker transfers the full KV cache for every request. Size decode worker VRAM so that
max_num_seqs * max_model_len * kv_cache_dtypefits within HBM. - Prefill-to-decode ratio - Start with 1:2 (1 prefill per 2 decode workers) and profile. Heavy prompt workloads may need 1:1.
- Autoscaling - Dynamo's router exposes queue depth metrics per worker type. Set up autoscaling rules separately for prefill and decode pools.
- Monitoring - See the GPU monitoring guide for Prometheus and Grafana setup, including Dynamo-specific metrics.
- Model warmup - Run 10-20 warmup requests before putting the router behind a load balancer. Cold KV transfers inflate TTFT until caches are warm.
Disaggregated inference is the most significant architectural shift in LLM serving in 2026. Dynamo makes it accessible without building custom routing infrastructure. NIM provides a simpler single-container approach when disaggregated prefill/decode isn't required - full NIM deployment guide here. For further cost reduction strategies, see the GPU cost optimization playbook.
Spheron provides bare-metal H100 and A100 GPU instances with the high-speed interconnects Dynamo's NIXL protocol requires. No Kubernetes overhead, no shared-tenancy latency. Spin up prefill and decode nodes separately and pay only for what you use.
Rent H100 for Dynamo → | View A100 pricing → | See all GPU pricing →
Quick Setup Guide
Rent at least two H100 SXM5 or H100 PCIe instances on Spheron: one designated as a prefill worker and one as a decode worker. For production, use 2-4 prefill nodes and 4-8 decode nodes depending on your prompt-to-output length ratio.
Pull the official Dynamo container image from NVIDIA NGC or install via pip. Ensure vLLM 0.16+ is installed as the backend engine. Verify CUDA 12.9+ (driver 575.51.03+) and NVLink or high-bandwidth interconnect between nodes.
Set the DYNAMO_PREFILL_WORKERS and DYNAMO_DECODE_WORKERS environment variables or use the dynamo.yaml config file to define worker counts, GPU assignments, and KV cache transfer protocol (NIXL for NVLink clusters, RDMA for InfiniBand).
Launch prefill workers using python3 -m dynamo.vllm with --disaggregation-mode prefill and the target model. Prefill workers process incoming prompts in parallel and transfer KV caches to decode workers upon completion.
Launch decode workers using python3 -m dynamo.vllm with --disaggregation-mode decode. These receive KV caches from prefill workers and continue autoregressive generation. Set --max-num-seqs to tune decode batch size for your latency target.
Launch the Dynamo router process, which accepts OpenAI-compatible API requests, routes prefill work to prefill workers, tracks KV cache locations, and routes decode continuation to the correct decode worker. Test with a curl request to the router endpoint.
Frequently Asked Questions
NVIDIA Dynamo is an open-source distributed inference framework that disaggregates LLM serving into separate prefill and decode worker pools. It routes requests intelligently, manages KV cache transfer between nodes, and can deliver up to 7x higher throughput compared to monolithic serving.
Disaggregated inference splits the two phases of LLM generation: prefill (processing the input prompt) and decode (generating each output token), onto separate GPU pools. This lets you scale each phase independently and match the hardware to the workload characteristics of each phase.
Prefill nodes are compute-bound and benefit from high-FLOPs GPUs like H100 SXM5. Decode nodes are memory-bandwidth-bound and work well with H100 PCIe or A100 80G. You can mix GPU tiers to reduce cost without sacrificing throughput.
In NVIDIA's benchmarks, Dynamo with disaggregated serving delivers up to 7x higher throughput (measured on DeepSeek R1 on Blackwell). Time to first token (TTFT) drops significantly because dedicated prefill nodes are never interrupted by decode work.
Yes. Dynamo's primary integration is with vLLM. Dynamo sits above vLLM as an orchestration and routing layer, dispatching prefill work to prefill workers and decode continuation to decode workers, with NIXL handling KV cache transfer between them.
