Engineering

CPU-Free LLM Inference on GPU Cloud: How NVIDIA BlueField-4 DPUs and SmartNICs Cut the Serving-Stack Tax in 2026

CPU-Free LLM InferenceBlueField-4 DPUSmartNIC LLM ServingDPU GPU Cloud InferenceDPU Offload LLMGPU GoodputNVIDIA ICMSPKV CacheRDMA InferenceInference Optimization
CPU-Free LLM Inference on GPU Cloud: How NVIDIA BlueField-4 DPUs and SmartNICs Cut the Serving-Stack Tax in 2026

At 200 concurrent requests on a single H100, the host CPU is often 40-60% saturated just managing tokenization, scheduler decisions, and NVMe KV I/O - before a single FLOP reaches the tensor cores. The GPU sits partially idle, waiting on CPU-bound coordination. That coordination overhead is what DPUs and SmartNICs are designed to absorb.

This post covers what the CPU actually does in an LLM serving stack, why it creates a throughput ceiling, how BlueField-4 DPUs and SmartNICs offload that work, what Blink and Lynx propose for fully CPU-free serving, and what this means for cost per token on today's GPU cloud hardware.

The Serving-Stack Tax: What the Host CPU Does in Every LLM Request

Every LLM inference request passes through a chain of CPU-executed steps before and after the GPU does any computation. Most teams instrument GPU utilization closely and treat CPU as background overhead. At low concurrency, that assumption is safe. At high concurrency, it breaks down.

Here is what the host CPU handles in a standard vLLM deployment:

StepWho runs itNotes
Request ingestion and HTTP parsingCPU (Python asyncio)FastAPI / Uvicorn layer
TokenizationCPU (Hugging Face tokenizer)Scales with prompt length
Sequence schedulerCPU (vLLM scheduler loop)Runs per batch step
KV cache block allocationCPU (PagedAttention block manager)Bitmap-style block table
NVMe KV tier movementCPU (if using LMCache)Triggers I/O interrupts per block
CUDA kernel launchCPU (CUDA driver)Overhead grows with small kernels
Response detokenizationCPU (tokenizer inverse)Runs per output token
HTTP response streamingCPU (SSE or chunked transfer)Tied to token generation rate

The GPU handles the matrix math: Q/K/V projection, attention, feed-forward, logit computation. Everything else on that list runs on CPU threads, often in Python, often holding the GIL.

At low concurrency (under 20 requests), these CPU steps finish in milliseconds and the GPU is the bottleneck. At 100-200 concurrent sessions with 32K+ token contexts, tokenization alone ties up multiple cores, the KV block manager runs its allocation loop continuously, and NVMe I/O interrupts accumulate. The GPU's SM utilization drops even though there are requests waiting, because the CPU cannot feed the scheduler fast enough.

This is the serving-stack tax: a fraction of your GPU inference budget that goes to CPU-side coordination you did not explicitly provision for. Estimates from measuring CPU-to-GPU ratio in serving workloads put it at 15-30% of effective GPU capacity at moderate-to-high concurrency.

The GPU goodput engineering guide documents how enterprise clusters routinely sit at 5% average GPU SM utilization. CPU-side coordination overhead is one of the structural contributors. MIG packing and prefill-decode disaggregation address parts of the problem at the software layer. DPUs address it at the hardware layer.

DPUs and SmartNICs: The Infrastructure-Offload Processor Family

DPUs and SmartNICs are purpose-built co-processors for offloading infrastructure work from the host CPU. They sit between the network fabric and the host PCIe bus, intercepting packets, managing I/O, and running lightweight services without touching the host CPU at all.

NVIDIA BlueField-4 is the most relevant DPU for LLM inference. Announced in October 2025 and shipping in 2026, it integrates ConnectX-9 SuperNICs, a 64-core Arm Neoverse V2 CPU, 128 GB LPDDR5 local memory, a 512 GB onboard SSD, and a PCIe Gen6 host interface, delivering 800 Gb/s network throughput. Hardware engines for inline crypto, packet steering, and storage I/O complete the feature set. From the host's perspective, it looks like a PCIe device that exposes a network interface and storage fabric. From the inference stack's perspective, it can absorb NVMe I/O completion, manage KV block eviction, and handle RDMA transfers without any host CPU involvement.

AMD Pensando DSP (Data-Stream Processor) is AMD's DPU line. The two main products are the Salina 400 (DPU) and Pollara 400 (Ultra Ethernet NIC). The Salina 400 pairs 16 Arm Neoverse-N1 cores with 232 P4 MPU engines (P4 here refers to the programmable packet-processing language used by its hardware engines, not a product variant) and supports inline network processing, storage offload, and telemetry. AMD acquired Pensando in 2022 and has integrated it into the AMD EPYC ecosystem. For inference specifically, its relevance is primarily in network-level offload for high-throughput serving clusters, not KV cache management.

Intel IPU (Infrastructure Processing Unit) started as the Mount Evans ASIC and has evolved into Intel's broader IPU strategy. Intel IPUs target data center storage, networking, and virtualization offload. They are less commonly deployed in GPU inference clusters than BlueField-4 but represent the same class of hardware: a programmable co-processor that offloads infrastructure tasks from host CPUs.

DPU vs SmartNIC: A SmartNIC is a NIC with onboard compute: typically ARM cores and sometimes FPGA fabric. SmartNICs can run lightweight offload tasks like packet classification, firewall rules, and connection tracking. DPUs go further: they include a full Linux-capable CPU complex, large local DRAM, and hardware engines for storage I/O, cryptography, and packet processing. A BlueField-4 is better thought of as a server-in-a-NIC than as a fancy NIC. For LLM inference offload, the distinction matters because SmartNICs can handle networking and request dispatch, while DPUs can additionally absorb NVMe KV I/O scheduling, which is the higher-throughput, higher-interrupt-frequency task.

Two pieces of research define where CPU-free LLM serving is heading: Blink, the active LLM-specific architecture, and Lynx, an older non-LLM paper that Blink builds on and explicitly distinguishes itself from. Both are prototype-level work as of mid-2026, not production software you can install from a package manager.

Blink (arXiv 2604.07609, "Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC") eliminates the host CPU from the per-request path by splitting serving-stack work across two co-processors: the GPU takes on scheduling decisions, and the SmartNIC or DPU takes on networking and tokenization.

What Blink offloads to the GPU:

  • A persistent GPU scheduler kernel runs alongside inference kernels, handling batch formation, sequence priority, and KV block allocation at GPU-thread speed rather than in a Python event loop
  • The GPU monitors request queues and triggers matrix multiply kernels directly, removing the CPU-GPU round-trip from each decode step

What Blink offloads to the SmartNIC/DPU:

  • Network reception: the SmartNIC receives incoming requests over RDMA and places them in the GPU scheduler's queue without host CPU involvement
  • Tokenization: the SmartNIC handles prompt tokenization before requests reach the GPU, removing it from the CPU's responsibility entirely
  • RDMA coordination: output tokens are written from GPU back to the SmartNIC over RDMA, which then handles the TCP response to the client

The host CPU's role in Blink reduces to service startup, authentication setup, and error handling. Per-request scheduling, tokenization, and networking run on GPU and SmartNIC hardware.

The throughput gain comes from two places: eliminating CPU scheduler launch latency (measurable at short sequences and high request rates, where the per-step round-trip overhead is significant) and eliminating CPU interrupt handling for network I/O at high concurrency. The practical constraint is that the GPU-resident scheduler consumes GPU memory for state and adds kernel overhead to each SM, making Blink best suited for workloads with many short, homogeneous requests.

Lynx is a 2020 ASPLOS paper ("Lynx: A SmartNIC-driven Accelerator-centric Architecture for Network Servers") that Blink cites as a conceptual precursor. It is not an LLM system. Lynx targets stateless, one-shot inference workloads: face verification and LeNet digit classification, where the full request is processed in a single accelerator pass with no autoregressive generation loop.

In Lynx's model, a SmartNIC receives requests, dispatches them to an accelerator via RDMA, and returns results, keeping the host CPU out of the per-request path. This works for one-shot workloads because there is no session state, no KV cache to coordinate, and no multi-step generation loop.

Blink distinguishes itself from Lynx precisely because LLM inference is stateful and autoregressive. Each decode step depends on the KV cache from prior steps; requests have variable and multi-turn lifetimes; and the scheduling logic must track in-flight sequences, allocated KV blocks, and preemption decisions. None of these exist in the Lynx workload model. Blink's GPU-resident scheduler and SmartNIC tokenization pipeline were built to handle what Lynx was never designed to support.

NIXL handles the RDMA transport layer in production disaggregated inference stacks, providing the same GPU-to-GPU and GPU-to-NVMe primitives that a Blink-style deployment would use for SmartNIC-to-GPU dispatch. NIXL's five backends (InfiniBand, RoCE, UCX, NVMe-oF, S3) cover the transport options such an architecture would need.

BlueField-4 + ICMSP: Offloading KV Cache I/O at Hardware Speed

The most concrete DPU application in production inference today is KV cache NVMe tiering with BlueField-4 via NVIDIA's ICMSP architecture. The ICMSP architecture guide covers the full hardware stack in detail; here is how the DPU changes the I/O picture specifically.

Three-Tier Storage Hierarchy

ICMSP organizes KV cache across three tiers based on block access recency:

GPU HBM (hot, 3.35 TB/s)
     | cuFile / GPU-direct
CPU DRAM (warm, ~63 GB/s via PCIe 5.0)
     | DPU-managed I/O
NVMe SSD (cold, ~7 GB/s, ~100µs latency)

The bandwidth numbers define what each tier can serve. HBM at 3.35 TB/s can feed KV data to attention at full decode speed. CPU DRAM at 63 GB/s is 50x slower but still fast enough for warm blocks being promoted on multi-turn conversation resume. NVMe at 7 GB/s handles cold blocks from older context that is unlikely to be needed in the current decode step.

What BlueField-4 Adds vs Software-Only LMCache

Software-only tiering with LMCache implements the same three-tier hierarchy but manages it on the host CPU. Every NVMe read or write completion triggers a CPU interrupt. At high concurrency (50+ long-context requests per GPU), these interrupts accumulate. The CPU interrupt handler runs, updates the block table, and potentially triggers the next tier-movement decision. Under sustained load, interrupt latency adds tail latency spikes to requests that happen to need a cold block promoted from NVMe.

BlueField-4 absorbs this entirely:

  • NVMe I/O completions go to the DPU's interrupt handler, not the host CPU's
  • Eviction decisions (which blocks to demote from DRAM to NVMe) run on DPU ARM cores
  • The host CPU never sees NVMe traffic during inference
  • Block table updates happen on the DPU, with the host CPU's block manager receiving periodic sync updates rather than per-I/O notifications

The practical effect is smoother tail latency at high concurrency. p50 TTFT stays similar whether you use LMCache or ICMSP. p95 and p99 TTFT improve with ICMSP because the tail latency spikes from interrupt accumulation at high concurrency are eliminated.

For H200 instances on Spheron, the 141 GB HBM3e capacity means more hot blocks stay on-GPU, which reduces the frequency of DRAM and NVMe fetches. This makes ICMSP's DPU offload most valuable on H100 nodes (80 GB HBM) handling long-context workloads where NVMe fetch frequency is higher.

For more on the full ICMSP architecture, the cuFile GPU-direct NVMe path, and vLLM setup guide, see the ICMSP architecture guide. For KV memory math and PagedAttention background, see the KV cache optimization guide.

Cost-Per-Token and Goodput Impact: A Rough Model

Using live on-demand pricing as of 29 Jun 2026 (H100 SXM5: $4.06/hr, H200 SXM5: $3.70/hr, A100 80G SXM4: $1.69/hr), here is a rough model for how the serving-stack tax and DPU offload affect cost per million tokens.

The scenario is Llama 3 70B at batch size 16, 8K average context, with 60% prefix reuse (a realistic multi-turn chatbot workload):

ScenarioApprox TTFT (p50)Approx throughputApprox $/M tokens (H100 SXM5)
No offload (monolithic vLLM)800 ms~900 tokens/sec~$1.25
Software LMCache tiering650 ms~1,050 tokens/sec~$1.07
DPU-ICMSP (projected from NVIDIA claims)580 ms~1,200 tokens/sec~$0.94

The DPU-ICMSP row is modeled from NVIDIA's ICMSP performance claims (5x improvement for long-context prefix-reuse workloads) scaled to a moderate prefix-reuse scenario, not measured on BlueField-4 hardware. Treat it as a directional estimate, not a benchmark.

For GPU cost-per-token benchmarks across models and batch sizes with measured data, see the GPU cost-per-token benchmarks which covers H100, H200, L40S, and A100 across Llama 7B to 70B workloads.

For inference at scale, H100 SXM5 instances on Spheron provide the baseline. The incremental improvement from software LMCache tiering is available today. DPU-ICMSP adds the next layer when BlueField-4 hardware ships.

Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Practical Guidance: Is DPU-Offloaded Serving Worth It Today?

The honest answer depends on your workload type and timeline.

Workload typeRecommendation
Short-context chat, batch size < 32Software-only serving is fine. CPU overhead is minimal relative to GPU compute time.
Long-context (32K+), high concurrency (50+)Enable LMCache NVMe tiering now. ICMSP DPU acceleration when hardware ships adds tail latency improvement.
Multi-turn, 60%+ prefix reuse, latency-sensitiveHighest value from ICMSP. Software LMCache gives measurable improvement today.
Batch inference, offlineCPU overhead is less critical. Throughput-maximize with prefill-decode disaggregation first.
RL rollout workersCPU is often the bottleneck for environment steps, not GPU inference. DPU does not help here.

What to do now: The software foundation for ICMSP-compatible tiering is available on any Spheron bare-metal node today. Install LMCache with a local NVMe backend, enable vLLM prefix caching, and measure your KV hit rate at your target concurrency level. This establishes the baseline that BlueField-4 DPU nodes will hardware-accelerate when they ship in H2 2026.

Prefill-decode disaggregation is currently the highest-leverage optimization for TTFT-sensitive workloads. It addresses a different bottleneck (prefill blocking decode) but is available today and has measured production benchmarks behind it. Do that first if you have not already.

For provisioning Spheron bare-metal nodes and verifying NVMe device availability, see Spheron's deployment documentation. Bare-metal H100, H200, and B200 instances include NVMe at no extra cost and arrive with InfiniBand drivers pre-installed.

Where This Fits in the Inference Stack in 2026

The inference engineering discipline has split into layers over the past two years. At the application layer: prompt engineering, sampling strategies, speculative decoding. At the framework layer: chunked prefill, prefix caching, continuous batching, disaggregated prefill/decode. At the hardware layer: GPU selection, memory bandwidth, NVMe tiering.

DPU offload sits at the hardware layer but operates on the serving-stack coordination overhead that frameworks currently absorb in software. It is not a replacement for the framework-layer optimizations; it is the next tier of hardware acceleration once framework-layer gains have been captured.

The AI memory wall and inference latency guide frames the core problem: adding GPUs does not reduce per-token latency when the bottleneck is memory bandwidth or serving-stack coordination overhead. DPUs and SmartNICs address the coordination overhead component of that equation, the part that neither more GPUs nor better memory bandwidth can fix, because it is CPU-side code running outside the GPU compute path.

The near-term trajectory looks like this: software tiering (LMCache, vLLM prefix caching) captures the first 10-15% of improvement available by reducing GPU idle time from CPU scheduling overhead. BlueField-4 DPU-equipped nodes capture the next layer when hardware ships in H2 2026, specifically for high-concurrency, long-context workloads where NVMe I/O interrupt accumulation becomes measurable. Blink-style CPU-free serving (GPU-resident scheduling combined with SmartNIC offload for tokenization and networking) is a research target for 2027 and beyond, pending production implementations.

For teams running GPU inference today, the software path is the actionable one. Measure your CPU serving-stack tax (run htop alongside nvidia-smi dmon during load), enable LMCache on NVMe if you are long-context and prefix-reuse-heavy, and watch the BlueField-4 hardware availability announcements for when to add the DPU acceleration layer.


DPU-offloaded serving is the hardware-layer answer to the CPU tax that limits GPU inference goodput. The software foundations - bare-metal instances with local NVMe, InfiniBand connectivity, and vLLM prefix caching - are available today. Spheron bare-metal H100, H200, and B200 nodes include NVMe at no extra cost, so you can validate the LMCache tiering layer now and add BlueField-4 DPU acceleration when the hardware ships.

H100 SXM5 capacity on Spheron → | H200 availability → | View current GPU pricing →

STEPS / 04

Quick Setup Guide

  1. Measure your current CPU serving-stack tax

    On any Spheron GPU instance, run your inference workload under load and collect: GPU SM utilization (nvidia-smi dmon -s u), CPU utilization per core (htop), and time-to-first-token p50/p95 from your serving framework's /metrics endpoint. The gap between GPU SM utilization and 100% during active inference is largely attributable to CPU-side coordination overhead - scheduling, tokenization, and I/O interrupts.

  2. Reduce CPU tokenization overhead with async preprocessing

    Pre-tokenize requests in a dedicated async worker pool before they enter the GPU scheduler. In vLLM, set --disable-log-requests and pre-warm the tokenizer with a background thread pool sized to your vCPU count. This decouples tokenization latency from the GPU scheduling path and is the software-only first step toward the CPU-free architecture DPUs enable at hardware level.

  3. Enable GPU-direct NVMe KV tiering (ICMSP-compatible software layer)

    Install LMCache on your Spheron bare-metal node and configure a local-disk backend pointing to the NVMe mount. Set max_local_disk_size to 80% of NVMe capacity. This implements the same three-tier KV tiering that ICMSP hardware-accelerates with BlueField-4, and is available today on H100/H200/B200 Spheron nodes. Monitor cache hit rate via the /metrics endpoint.

  4. On Spheron InfiniBand-connected instances, run ibv_devinfo to confirm RDMA devices are visible. Verify ibstat shows active ports and check ib_read_bw to measure available bandwidth. This is the RDMA transport layer a Blink-style CPU-free serving architecture uses: the SmartNIC handles network reception and tokenization over RDMA, then coordinates GPU dispatch without any host-CPU involvement in the per-request path.

FAQ / 05

Frequently Asked Questions

CPU-free LLM inference offloads the host-CPU work in a serving stack - tokenization, request scheduling, KV cache routing, and networking - to a DPU or SmartNIC co-processor. The GPU is freed from waiting on CPU-bound coordination tasks, improving goodput and reducing cost per token on long-running clusters.

A DPU (Data Processing Unit) is an SoC that combines a multi-core CPU, networking silicon, and programmable hardware engines on a single chip. NVIDIA BlueField-4, AMD Pensando, and Intel IPU are the main DPU families. A SmartNIC is a NIC with onboard compute - typically ARM cores and FPGA fabric - that can run lightweight offload tasks. DPUs are more capable (full Linux-capable CPU complex); SmartNICs are lower-power and narrower in scope. Both can offload networking and scheduling work from the host CPU.

Blink (arXiv 2604.07609) is a CPU-free LLM inference architecture that delegates the serving stack to both the GPU and a SmartNIC or DPU: a persistent GPU kernel handles scheduling and batch decisions while the SmartNIC handles network reception, tokenization, and RDMA coordination, cutting the host CPU out of the per-request path entirely. Lynx is a separate 2020 ASPLOS paper ('Lynx: A SmartNIC-driven Accelerator-centric Architecture for Network Servers') that is not an LLM system - it targets stateless, one-shot workloads like face verification and LeNet inference. Blink cites Lynx as a conceptual precursor and explicitly distinguishes itself by adding support for autoregressive, stateful LLM generation.

NVIDIA ICMSP (Inference Context Memory Storage Platform) uses the BlueField-4 DPU as an I/O controller for three-tier KV cache storage. Instead of triggering CPU interrupts for every NVMe block read or write, the DPU absorbs I/O completions and tier-movement decisions off the host CPU entirely. This eliminates the tail latency spikes that software-only tiering systems like LMCache experience at high concurrency.

BlueField-4 DPU-equipped nodes are on NVIDIA's H2 2026 hardware roadmap. Today, Spheron bare-metal H100, H200, and B200 instances include local NVMe SSDs at no extra cost, which means the software tiering pattern that ICMSP will hardware-accelerate can be validated right now using LMCache or vLLM prefix caching. The DPU acceleration layer is the next step.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.