NVIDIA Dynamo 1.0: Disaggregated LLM Inference Deployment Guide (2026)

NVIDIA Dynamo 1.0 went GA on March 16, 2026 at GTC. Disaggregated inference is not a future trend: NVIDIA is shipping it now with up to 7x reported throughput gains (per NVIDIA's published Dynamo benchmarks on DeepSeek R1 on Blackwell). If you need the basics of vLLM setup before adding Dynamo on top, start with vLLM Multi-GPU Production Deployment 2026.

What Is NVIDIA Dynamo

Dynamo is an open-source distributed inference framework. It sits above vLLM as an orchestration layer and routes prefill and decode work to dedicated worker pools rather than running both phases on the same GPU. The NIXL KV cache transport layer handles moving key-value tensors between prefill and decode workers over NVLink or InfiniBand.

The three core components shipped with Dynamo 1.0:

dynamo-router: accepts OpenAI-compatible API requests, routes them to the right workers, and tracks KV cache placement
dynamo-worker: wraps vLLM (or another backend) and exposes a prefill or decode role
NIXL: a low-latency KV cache transfer protocol optimized for NVLink, InfiniBand RDMA, and fallback TCP

The same NIXL transport layer that moves KV blocks between Dynamo workers can also tier KV cache to NVMe storage. For a software-level deployment of this pattern today using LMCache, see NVMe KV Cache Offloading for LLM Inference. For the hardware-accelerated version of this same NVMe backend with BlueField-4 DPU integration, see the guide on ICMSP integration with Dynamo and how cuFile GPU-direct NVMe fits into the NIXL stack.

For context on the inference engine landscape Dynamo sits above, see the vLLM vs TensorRT-LLM vs SGLang comparison.

For teams running Kubernetes who want disaggregation as a first-class CNCF project rather than an NVIDIA proprietary layer, see the llm-d Kubernetes disaggregated inference guide.

Disaggregated Inference: Prefill and Decode Are Different Problems

The Two Phases of LLM Generation

Every LLM generation request has two distinct phases. Prefill processes all input tokens at once in a single forward pass. Because attention is computed over the full prompt length, prefill is O(n) in memory and highly compute-intensive, with FLOPs scaling with prompt token count.

Decode generates one token per step, reading the KV cache for every prior token at each step. With a 4K-token context and 256 output tokens, that means 256 separate KV cache reads covering 4K+ tokens each. Decode is dominated by memory bandwidth, not compute.

In monolithic serving, both phases share the same GPU. A long prefill request blocks decode batches from continuing, which inflates latency for other users in the queue. This head-of-line blocking is what disaggregation solves.

Why Separating Them Improves Both

Dedicated prefill workers can process incoming prompts at full compute utilization without waiting for decode batches to drain. Dedicated decode workers maintain consistent token generation throughput without being interrupted by compute-heavy prefill spikes.

This also lets you match hardware to the bottleneck. Prefill nodes benefit from raw compute (H100 SXM5 at 3,958 TFLOPS FP8). Decode nodes benefit from memory bandwidth (A100 80G at 2 TB/s is competitive with H100 PCIe at a lower hourly rate).

Phase	Bottleneck	Ops per Step	Best GPU Trait
Prefill	Compute (FLOPs)	Full attention over prompt	High TFLOPS
Decode	Memory bandwidth	KV cache reads per token	High HBM BW

GPU Hardware Requirements

Prefill Nodes

H100 SXM5 is the natural choice for prefill: 3,958 TFLOPS FP8, 3.35 TB/s HBM3. For workloads with very long prompts (128K+ tokens), the H200 SXM5 with 141 GB HBM3e gives you the capacity to hold much larger KV caches before evicting.

For a comparison of H100, H200, and other GPUs for inference workloads, see Best GPU for AI Inference in 2026.

Rent prefill nodes: H100 GPU rental | H200 GPU rental

Decode Nodes

H100 PCIe and A100 80G are both solid choices for decode workers. The A100 80G SXM4 from $1.05/hr on Spheron is the most cost-effective option for decode. Its 2 TB/s HBM2e bandwidth handles token generation well, and the lower hourly rate compounds across the many decode workers a production cluster needs.

Mixing GPU tiers is a legitimate production strategy: H100 SXM5 for prefill, A100 80G for decode. For a full breakdown of which GPU pairs deliver the best cost-to-throughput ratio in a heterogeneous setup, see our heterogeneous GPU inference guide.

GPU	Role	Spheron Price (lowest)	Why
H100 SXM5	Prefill	from $2.40/hr	3,958 TFLOPS FP8, 3.35 TB/s HBM3
H200 SXM5	Prefill (long context)	from $1.72/hr (spot)	141GB HBM3e handles 128K+ prompts
H100 PCIe	Decode	from $2.01/hr	2 TB/s HBM3, cost-efficient decode
A100 80G SXM4	Decode	from $1.05/hr	2 TB/s HBM2e, lowest cost per decode

Pricing fluctuates based on GPU availability. The prices above are based on 25 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Step-by-Step: Deploy Dynamo with vLLM on Spheron

Step 1: Provision GPU Instances

Rent at least two H100 SXM5 instances for prefill and two H100 PCIe or A100 80G instances for decode on Spheron. All instances need to be on the same high-speed network fabric so NIXL KV transfers stay low-latency. Spheron bare-metal instances get dedicated public IPs with no NAT overhead, which matters for NIXL's KV transfer performance. See the Spheron instance types guide for bare-metal vs VM options, and Spheron GPU pricing for current rates.

Step 2: Install Dynamo and vLLM

bash

# Pull the official Dynamo container (NVIDIA NGC)
docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0

# Or install via pip (requires CUDA 12.9+)
pip install "ai-dynamo[vllm]" "vllm>=0.16.0"

Step 3: Configure the Topology

yaml

cluster:
  backend: vllm
  model: meta-llama/Llama-3.1-70B-Instruct
  dtype: fp8

prefill:
  workers: 2
  gpus_per_worker: 1
  kv_transfer: nixl

decode:
  workers: 4
  gpus_per_worker: 1
  max_num_seqs: 256

router:
  port: 8000
  strategy: least-kv-pressure

Step 4: Start Prefill Workers

bash

python3 -m dynamo.vllm \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode prefill \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
  --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'

Step 5: Start Decode Workers

bash

python3 -m dynamo.vllm \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode decode \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

For a deep dive into NIXL's architecture and multi-node setup, see NVIDIA NIXL: KV Cache Transfers for Disaggregated Inference.

Step 6: Start the Router

bash

python3 -m dynamo.frontend --port 8000

# Test the endpoint
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-70B-Instruct",
       "prompt": "Explain disaggregated inference in one sentence.",
       "max_tokens": 100}'

Benchmarks: Disaggregated vs Monolithic Inference

The numbers below use Llama 3.1 70B FP8 on H100 SXM5 bare-metal on Spheron, consistent with the methodology from the vLLM vs TensorRT-LLM vs SGLang benchmarks post.

Configuration	Concurrency	Throughput (tok/s)	TTFT (ms)	Cost/1M tokens
vLLM monolithic (1x H100)	32	~820	~320	~$2.80
Dynamo disaggregated (2+2)	32	~3,200	~85	~$1.90
Dynamo disaggregated (4+8)	128	~5,200	~90	~$1.60

TTFT drops sharply in the disaggregated setup because prefill nodes are never interrupted by decode work. Once prefill completes and the KV cache transfers to a decode worker, the decode worker starts generating tokens immediately without any queued prefill work blocking it.

Methodology note: These are representative estimates based on NVIDIA-published Dynamo benchmarks and vLLM baseline numbers, not direct Spheron-measured results. Always run your own benchmarks with your specific model and request distribution before committing to a topology.

Dynamo vs Manual Disaggregation with SGLang and TensorRT-LLM

SGLang RadixAttention vs Dynamo KV Routing

SGLang's RadixAttention handles repeated prefixes on a single node: when multiple requests share a common prefix (system prompt, RAG context, few-shot examples), SGLang computes attention once and caches it. Dynamo routes across nodes but does not focus on prefix sharing.

Dynamo wins at scale with multi-node deployments. SGLang wins for single-node setups where shared prefixes are common. They solve adjacent problems. See the SGLang and vLLM comparison for single-node benchmarks.

For a framework-agnostic setup guide covering vLLM and SGLang disaggregation without Dynamo, see Prefill-Decode Disaggregation on GPU Cloud.

TensorRT-LLM Disaggregation vs Dynamo

TensorRT-LLM's Executor API has a disaggregated mode, but it requires ahead-of-time engine compilation. The compile step takes 20-30 minutes for a 70B model on H100 and must be repeated when you update the model or change batch size configurations.

Dynamo uses vLLM as the backend (more backends coming), which loads models directly without compilation. For teams that update models frequently or want faster iteration, Dynamo's flexibility matters more than TRT-LLM's maximum throughput ceiling.

Approach	Best For	Limitation
Dynamo + vLLM	Multi-node disaggregated serving	Requires NIXL-capable interconnect
SGLang (single-node)	Shared-prefix workloads, simplicity	No cross-node KV routing
TRT-LLM disaggregated	Maximum throughput, fixed workload	Slow compilation, less flexible
Manual routing (custom)	Specialized architectures	High maintenance cost

Cost Analysis: Disaggregated Serving vs Monolithic

Mixing H100 SXM5 for prefill with A100 80G for decode cuts hourly cost compared to all-H100 monolithic setups, while matching or exceeding throughput.

Setup	GPUs	Hourly Cost	Cost/1M tokens (est.)
Monolithic vLLM (4x H100 SXM5)	4x H100 SXM5 @ $2.40/hr	$9.60/hr	~$1.80
Dynamo: 2 prefill H100 + 4 decode A100	2x H100 + 4x A100 80G	$9.00/hr	~$1.10
Dynamo: 2 prefill H100 + 4 decode H100 PCIe	2x H100 SXM5 + 4x H100 PCIe	$12.84/hr	~$0.85

Pricing fluctuates based on GPU availability. The prices above are based on 25 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Actual savings depend on your prompt-to-output ratio. Long prompts relative to output favor a larger prefill pool. Short prompts with long outputs favor more decode capacity.

How Dynamo Cuts GPU Spend

In a monolithic setup, prefill and decode compete for the same GPU. Prefill work is bursty and compute-intensive, which means GPU utilization is uneven. With Dynamo, decode workers spend nearly all their time on token generation with minimal idle time waiting for CPU scheduling or competing with prefill.

This means you can run decode workers at higher average GPU utilization than monolithic setups. Budget decode workers (A100 at $1.05/hr) handle high-volume token generation; expensive prefill slots process prompts faster and free up sooner.

For comparison: H100 instances on AWS and Google Cloud carry a significant markup over bare-metal rates. A 6-GPU Dynamo cluster on Spheron (2x H100 SXM5 + 4x A100) costs $9.00/hr. Running the same GPU count on the major cloud providers typically costs several times more, and that gap compounds across the lifetime of a production deployment.

Production Checklist

Network topology - Ensure prefill and decode workers are on the same fabric. NVLink for single-node, InfiniBand or 400GbE for multi-node. NIXL falls back to TCP but with significant latency penalty. See Spheron network configuration for port and firewall setup on bare-metal instances.
KV cache sizing - Each prefill worker transfers the full KV cache for every request. Size decode worker VRAM so that max_num_seqs * max_model_len * kv_cache_dtype fits within HBM.
Prefill-to-decode ratio - Start with 1:2 (1 prefill per 2 decode workers) and profile. Heavy prompt workloads may need 1:1.
Autoscaling - Dynamo's router exposes queue depth metrics per worker type. Set up autoscaling rules separately for prefill and decode pools.
Monitoring - See the GPU monitoring guide for Prometheus and Grafana setup, including Dynamo-specific metrics.
Model warmup - Run 10-20 warmup requests before putting the router behind a load balancer. Cold KV transfers inflate TTFT until caches are warm.

Disaggregated inference is the most significant architectural shift in LLM serving in 2026. Dynamo makes it accessible without building custom routing infrastructure. NIM provides a simpler single-container approach when disaggregated prefill/decode isn't required - full NIM deployment guide here. For further cost reduction strategies, see the GPU cost optimization playbook.

Spheron provides bare-metal H100 and A100 GPU instances with the high-speed interconnects Dynamo's NIXL protocol requires. No Kubernetes overhead, no shared-tenancy latency. Spin up prefill and decode nodes separately and pay only for what you use.
Rent H100 for Dynamo → | View A100 pricing → | See all GPU pricing →
Deploy Dynamo on Spheron →

STEPS / 06

Quick Setup Guide

Provision GPU instances on Spheron
Rent at least two H100 SXM5 or H100 PCIe instances on Spheron: one designated as a prefill worker and one as a decode worker. For production, use 2-4 prefill nodes and 4-8 decode nodes depending on your prompt-to-output length ratio.
Install NVIDIA Dynamo and vLLM
Pull the official Dynamo container image from NVIDIA NGC or install via pip. Ensure vLLM 0.16+ is installed as the backend engine. Verify CUDA 12.9+ (driver 575.51.03+) and NVLink or high-bandwidth interconnect between nodes.
Configure the disaggregated serving topology
Set the DYNAMO_PREFILL_WORKERS and DYNAMO_DECODE_WORKERS environment variables or use the dynamo.yaml config file to define worker counts, GPU assignments, and KV cache transfer protocol (NIXL for NVLink clusters, RDMA for InfiniBand).
Start the prefill worker pool
Launch prefill workers using python3 -m dynamo.vllm with --disaggregation-mode prefill and the target model. Prefill workers process incoming prompts in parallel and transfer KV caches to decode workers upon completion.
Start the decode worker pool
Launch decode workers using python3 -m dynamo.vllm with --disaggregation-mode decode. These receive KV caches from prefill workers and continue autoregressive generation. Set --max-num-seqs to tune decode batch size for your latency target.
Start the Dynamo router
Launch the Dynamo router process, which accepts OpenAI-compatible API requests, routes prefill work to prefill workers, tracks KV cache locations, and routes decode continuation to the correct decode worker. Test with a curl request to the router endpoint.

FAQ / 05

Frequently Asked Questions

NVIDIA Dynamo is an open-source distributed inference framework that disaggregates LLM serving into separate prefill and decode worker pools. It routes requests intelligently, manages KV cache transfer between nodes, and can deliver up to 7x higher throughput compared to monolithic serving.

Disaggregated inference splits the two phases of LLM generation: prefill (processing the input prompt) and decode (generating each output token), onto separate GPU pools. This lets you scale each phase independently and match the hardware to the workload characteristics of each phase.

Prefill nodes are compute-bound and benefit from high-FLOPs GPUs like H100 SXM5. Decode nodes are memory-bandwidth-bound and work well with H100 PCIe or A100 80G. You can mix GPU tiers to reduce cost without sacrificing throughput.

In NVIDIA's benchmarks, Dynamo with disaggregated serving delivers up to 7x higher throughput (measured on DeepSeek R1 on Blackwell). Time to first token (TTFT) drops significantly because dedicated prefill nodes are never interrupted by decode work.

Yes. Dynamo's primary integration is with vLLM. Dynamo sits above vLLM as an orchestration and routing layer, dispatching prefill work to prefill workers and decode continuation to decode workers, with NIXL handling KV cache transfer between them.

What Is NVIDIA Dynamo

Disaggregated Inference: Prefill and Decode Are Different Problems

The Two Phases of LLM Generation

Why Separating Them Improves Both

GPU Hardware Requirements

Prefill Nodes

Decode Nodes

Step-by-Step: Deploy Dynamo with vLLM on Spheron

Step 1: Provision GPU Instances

Step 2: Install Dynamo and vLLM

Step 3: Configure the Topology

Step 4: Start Prefill Workers

Step 5: Start Decode Workers

Step 6: Start the Router

Benchmarks: Disaggregated vs Monolithic Inference

Dynamo vs Manual Disaggregation with SGLang and TensorRT-LLM

SGLang RadixAttention vs Dynamo KV Routing

TensorRT-LLM Disaggregation vs Dynamo

Cost Analysis: Disaggregated Serving vs Monolithic

How Dynamo Cuts GPU Spend

Production Checklist

Quick Setup Guide

Provision GPU instances on Spheron

Install NVIDIA Dynamo and vLLM

Configure the disaggregated serving topology

Start the prefill worker pool

Start the decode worker pool

Start the Dynamo router

Frequently Asked Questions

01What is NVIDIA Dynamo used for?

02What is disaggregated inference?

03Which GPUs work best for disaggregated inference with Dynamo?

04How much faster is disaggregated inference vs monolithic?

05Does NVIDIA Dynamo work with vLLM?

Build what's next.