Tutorial

NVIDIA Dynamo 1.0: Disaggregated LLM Inference Deployment Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderMar 25, 2026
NVIDIA DynamoDisaggregated InferenceLLM InferencevLLMH100GPU CloudInference Optimization
NVIDIA Dynamo 1.0: Disaggregated LLM Inference Deployment Guide (2026)

NVIDIA Dynamo 1.0 went GA on March 16, 2026 at GTC. Disaggregated inference is not a future trend: NVIDIA is shipping it now with up to 7x reported throughput gains (per NVIDIA's published Dynamo benchmarks on DeepSeek R1 on Blackwell). If you need the basics of vLLM setup before adding Dynamo on top, start with vLLM Multi-GPU Production Deployment 2026.

What Is NVIDIA Dynamo

Dynamo is an open-source distributed inference framework. It sits above vLLM as an orchestration layer and routes prefill and decode work to dedicated worker pools rather than running both phases on the same GPU. The NIXL KV cache transport layer handles moving key-value tensors between prefill and decode workers over NVLink or InfiniBand.

The three core components shipped with Dynamo 1.0:

  • dynamo-router: accepts OpenAI-compatible API requests, routes them to the right workers, and tracks KV cache placement
  • dynamo-worker: wraps vLLM (or another backend) and exposes a prefill or decode role
  • NIXL: a low-latency KV cache transfer protocol optimized for NVLink, InfiniBand RDMA, and fallback TCP

For context on the inference engine landscape Dynamo sits above, see the vLLM vs TensorRT-LLM vs SGLang comparison.

Disaggregated Inference: Prefill and Decode Are Different Problems

The Two Phases of LLM Generation

Every LLM generation request has two distinct phases. Prefill processes all input tokens at once in a single forward pass. Because attention is computed over the full prompt length, prefill is O(n) in memory and highly compute-intensive, with FLOPs scaling with prompt token count.

Decode generates one token per step, reading the KV cache for every prior token at each step. With a 4K-token context and 256 output tokens, that means 256 separate KV cache reads covering 4K+ tokens each. Decode is dominated by memory bandwidth, not compute.

In monolithic serving, both phases share the same GPU. A long prefill request blocks decode batches from continuing, which inflates latency for other users in the queue. This head-of-line blocking is what disaggregation solves.

Why Separating Them Improves Both

Dedicated prefill workers can process incoming prompts at full compute utilization without waiting for decode batches to drain. Dedicated decode workers maintain consistent token generation throughput without being interrupted by compute-heavy prefill spikes.

This also lets you match hardware to the bottleneck. Prefill nodes benefit from raw compute (H100 SXM5 at 3,958 TFLOPS FP8). Decode nodes benefit from memory bandwidth (A100 80G at 2 TB/s is competitive with H100 PCIe at a lower hourly rate).

PhaseBottleneckOps per StepBest GPU Trait
PrefillCompute (FLOPs)Full attention over promptHigh TFLOPS
DecodeMemory bandwidthKV cache reads per tokenHigh HBM BW

GPU Hardware Requirements

Prefill Nodes

H100 SXM5 is the natural choice for prefill: 3,958 TFLOPS FP8, 3.35 TB/s HBM3. For workloads with very long prompts (128K+ tokens), the H200 SXM5 with 141 GB HBM3e gives you the capacity to hold much larger KV caches before evicting.

For a comparison of H100, H200, and other GPUs for inference workloads, see Best GPU for AI Inference in 2026.

Rent prefill nodes: H100 GPU rental | H200 GPU rental

Decode Nodes

H100 PCIe and A100 80G are both solid choices for decode workers. The A100 80G SXM4 from $1.05/hr on Spheron is the most cost-effective option for decode. Its 2 TB/s HBM2e bandwidth handles token generation well, and the lower hourly rate compounds across the many decode workers a production cluster needs.

Mixing GPU tiers is a legitimate production strategy: H100 SXM5 for prefill, A100 80G for decode.

GPURoleSpheron Price (lowest)Why
H100 SXM5Prefillfrom $2.40/hr3,958 TFLOPS FP8, 3.35 TB/s HBM3
H200 SXM5Prefill (long context)from $1.72/hr (spot)141GB HBM3e handles 128K+ prompts
H100 PCIeDecodefrom $2.01/hr2 TB/s HBM3, cost-efficient decode
A100 80G SXM4Decodefrom $1.05/hr2 TB/s HBM2e, lowest cost per decode

Pricing fluctuates based on GPU availability. The prices above are based on 25 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Step-by-Step: Deploy Dynamo with vLLM on Spheron

Step 1: Provision GPU Instances

Rent at least two H100 SXM5 instances for prefill and two H100 PCIe or A100 80G instances for decode on Spheron. All instances need to be on the same high-speed network fabric so NIXL KV transfers stay low-latency. Spheron bare-metal instances get dedicated public IPs with no NAT overhead, which matters for NIXL's KV transfer performance. See the Spheron instance types guide for bare-metal vs VM options, and Spheron GPU pricing for current rates.

Step 2: Install Dynamo and vLLM

bash
# Pull the official Dynamo container (NVIDIA NGC)
docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0

# Or install via pip (requires CUDA 12.9+)
pip install "ai-dynamo[vllm]" "vllm>=0.16.0"

Step 3: Configure the Topology

yaml
cluster:
  backend: vllm
  model: meta-llama/Llama-3.1-70B-Instruct
  dtype: fp8

prefill:
  workers: 2
  gpus_per_worker: 1
  kv_transfer: nixl

decode:
  workers: 4
  gpus_per_worker: 1
  max_num_seqs: 256

router:
  port: 8000
  strategy: least-kv-pressure

Step 4: Start Prefill Workers

bash
python3 -m dynamo.vllm \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode prefill \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
  --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}'

Step 5: Start Decode Workers

bash
python3 -m dynamo.vllm \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --disaggregation-mode decode \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

Step 6: Start the Router

bash
python3 -m dynamo.frontend --port 8000

# Test the endpoint
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-70B-Instruct",
       "prompt": "Explain disaggregated inference in one sentence.",
       "max_tokens": 100}'

Benchmarks: Disaggregated vs Monolithic Inference

The numbers below use Llama 3.1 70B FP8 on H100 SXM5 bare-metal on Spheron, consistent with the methodology from the vLLM vs TensorRT-LLM vs SGLang benchmarks post.

ConfigurationConcurrencyThroughput (tok/s)TTFT (ms)Cost/1M tokens
vLLM monolithic (1x H100)32~820~320~$2.80
Dynamo disaggregated (2+2)32~3,200~85~$1.90
Dynamo disaggregated (4+8)128~5,200~90~$1.60

TTFT drops sharply in the disaggregated setup because prefill nodes are never interrupted by decode work. Once prefill completes and the KV cache transfers to a decode worker, the decode worker starts generating tokens immediately without any queued prefill work blocking it.

Methodology note: These are representative estimates based on NVIDIA-published Dynamo benchmarks and vLLM baseline numbers, not direct Spheron-measured results. Always run your own benchmarks with your specific model and request distribution before committing to a topology.

Dynamo vs Manual Disaggregation with SGLang and TensorRT-LLM

SGLang RadixAttention vs Dynamo KV Routing

SGLang's RadixAttention handles repeated prefixes on a single node: when multiple requests share a common prefix (system prompt, RAG context, few-shot examples), SGLang computes attention once and caches it. Dynamo routes across nodes but does not focus on prefix sharing.

Dynamo wins at scale with multi-node deployments. SGLang wins for single-node setups where shared prefixes are common. They solve adjacent problems. See the SGLang and vLLM comparison for single-node benchmarks.

TensorRT-LLM Disaggregation vs Dynamo

TensorRT-LLM's Executor API has a disaggregated mode, but it requires ahead-of-time engine compilation. The compile step takes 20-30 minutes for a 70B model on H100 and must be repeated when you update the model or change batch size configurations.

Dynamo uses vLLM as the backend (more backends coming), which loads models directly without compilation. For teams that update models frequently or want faster iteration, Dynamo's flexibility matters more than TRT-LLM's maximum throughput ceiling.

ApproachBest ForLimitation
Dynamo + vLLMMulti-node disaggregated servingRequires NIXL-capable interconnect
SGLang (single-node)Shared-prefix workloads, simplicityNo cross-node KV routing
TRT-LLM disaggregatedMaximum throughput, fixed workloadSlow compilation, less flexible
Manual routing (custom)Specialized architecturesHigh maintenance cost

Cost Analysis: Disaggregated Serving vs Monolithic

Mixing H100 SXM5 for prefill with A100 80G for decode cuts hourly cost compared to all-H100 monolithic setups, while matching or exceeding throughput.

SetupGPUsHourly CostCost/1M tokens (est.)
Monolithic vLLM (4x H100 SXM5)4x H100 SXM5 @ $2.40/hr$9.60/hr~$1.80
Dynamo: 2 prefill H100 + 4 decode A1002x H100 + 4x A100 80G$9.00/hr~$1.10
Dynamo: 2 prefill H100 + 4 decode H100 PCIe2x H100 SXM5 + 4x H100 PCIe$12.84/hr~$0.85

Pricing fluctuates based on GPU availability. The prices above are based on 25 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Actual savings depend on your prompt-to-output ratio. Long prompts relative to output favor a larger prefill pool. Short prompts with long outputs favor more decode capacity.

How Dynamo Cuts GPU Spend

In a monolithic setup, prefill and decode compete for the same GPU. Prefill work is bursty and compute-intensive, which means GPU utilization is uneven. With Dynamo, decode workers spend nearly all their time on token generation with minimal idle time waiting for CPU scheduling or competing with prefill.

This means you can run decode workers at higher average GPU utilization than monolithic setups. Budget decode workers (A100 at $1.05/hr) handle high-volume token generation; expensive prefill slots process prompts faster and free up sooner.

For comparison: H100 instances on AWS and Google Cloud carry a significant markup over bare-metal rates. A 6-GPU Dynamo cluster on Spheron (2x H100 SXM5 + 4x A100) costs $9.00/hr. Running the same GPU count on the major cloud providers typically costs several times more, and that gap compounds across the lifetime of a production deployment.

Production Checklist

  1. Network topology - Ensure prefill and decode workers are on the same fabric. NVLink for single-node, InfiniBand or 400GbE for multi-node. NIXL falls back to TCP but with significant latency penalty. See Spheron network configuration for port and firewall setup on bare-metal instances.
  2. KV cache sizing - Each prefill worker transfers the full KV cache for every request. Size decode worker VRAM so that max_num_seqs * max_model_len * kv_cache_dtype fits within HBM.
  3. Prefill-to-decode ratio - Start with 1:2 (1 prefill per 2 decode workers) and profile. Heavy prompt workloads may need 1:1.
  4. Autoscaling - Dynamo's router exposes queue depth metrics per worker type. Set up autoscaling rules separately for prefill and decode pools.
  5. Monitoring - See the GPU monitoring guide for Prometheus and Grafana setup, including Dynamo-specific metrics.
  6. Model warmup - Run 10-20 warmup requests before putting the router behind a load balancer. Cold KV transfers inflate TTFT until caches are warm.

Disaggregated inference is the most significant architectural shift in LLM serving in 2026. Dynamo makes it accessible without building custom routing infrastructure. For further cost reduction strategies, see the GPU cost optimization playbook.

Spheron provides bare-metal H100 and A100 GPU instances with the high-speed interconnects Dynamo's NIXL protocol requires. No Kubernetes overhead, no shared-tenancy latency. Spin up prefill and decode nodes separately and pay only for what you use.

Rent H100 for Dynamo → | View A100 pricing → | See all GPU pricing →

Deploy Dynamo on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.