A voice agent that feels real-time has less than 200ms between when the user stops talking and when they hear the first word back. Most guides stop at the model layer: pick a fast ASR, run a 7B LLM, stream TTS output. But the network layer is equally load-bearing. If you're still routing voice over HTTP/SSE, you're leaving 50-100ms on the table and losing barge-in support entirely.
This post covers the network plumbing that sits between your GPU inference stack and the end user: WebRTC transport, SFU selection, token streaming over data channels, jitter buffer configuration, and barge-in handling. For the model-layer side (VRAM sizing, ASR/LLM/TTS GPU allocation, and latency budgets per stage), see the Voice AI GPU Infrastructure guide. For the TTS model specifically, NeuTTS Air covers the 320x real-time synthesis option built on Spheron.
Why HTTP Streaming Fails for Real-Time Voice Agents
HTTP/SSE works fine for chat UIs. For voice, it breaks in three specific ways.
Head-of-line blocking. TCP retransmits stalled packets before delivering later ones. On a lossy mobile connection, one dropped packet causes a 20-80ms stall in your token stream. In a chat UI, that's a brief freeze. In a voice agent, it causes audible clipping or a dead silence gap that sounds like a dropped call.
One-way transport. SSE is server-to-client only. When a user starts speaking while the agent is talking (barge-in), your application needs a separate WebSocket or HTTP POST to send the interrupt signal back. That's a second connection, a second TLS handshake on first use, and additional latency on each barge-in event.
Per-request connection overhead. Each SSE stream starts with an HTTP request and TLS negotiation. For a voice session with frequent turn-taking, the cumulative handshake cost adds up. A WebRTC session pays the ICE + DTLS setup cost once, then handles all bidirectional traffic over a single multiplexed UDP connection.
| Property | HTTP/SSE | WebRTC |
|---|---|---|
| Direction | One-way (server to client) | Full-duplex |
| Transport | TCP | UDP + DTLS |
| Latency jitter | 5-50ms (TCP retransmit) | 2-15ms (UDP) |
| Barge-in support | Manual reconnect | Built-in via data channel signal |
| Packet loss handling | Retransmit (blocks stream) | Concealment (Opus PLC) |
| Connection setup | Per-request TLS | Once per session (ICE + DTLS) |
For sub-200ms end-to-end, WebRTC is not optional.
WebRTC Architecture: SFU, Data Channels, and the GPU Inference Path
The full-stack architecture for a WebRTC voice agent looks like this:
User Mic
-> WebRTC (Opus audio)
-> SFU (LiveKit or MediaSoup)
-> VAD (Silero or WebRTC built-in)
-> ASR (FasterWhisper)
-> LLM (vLLM)
-> TTS (NeuTTS Air or Kokoro)
-> WebRTC data channel
-> SFU
-> User SpeakerData channels carry three types of traffic:
- LLM token stream (server to client, unreliable mode for lower latency)
- Interrupt/barge-in signals (client to server, reliable mode)
- Session metadata (turn state, VAD confidence, timestamps)
Audio (mic and speaker) travels over the WebRTC media track. The data channel is a separate logical stream multiplexed on the same UDP port.
SFU Options: LiveKit vs MediaSoup
The SFU (Selective Forwarding Unit) is the server that manages WebRTC sessions, routes audio between participants, and bridges between your GPU inference backend and client connections.
| Feature | LiveKit | MediaSoup |
|---|---|---|
| Language | Go + TypeScript SDK | Node.js |
| Agents SDK | Yes (Python, Node.js) | Manual integration |
| Cloud option | LiveKit Cloud | Self-hosted only |
| TURN built-in | Yes (livekit-server) | Requires coturn |
| Voice agent support | First-class (Agents framework) | DIY |
| Scaling | Built-in room distribution | Manual horizontal scaling |
LiveKit is the better default for most teams. The Agents framework handles the plumbing between WebRTC sessions and Python inference code, built-in TURN removes a dependency to manage, and the Python SDK integrates directly with vLLM and Pipecat.
The GPU Inference Path
Token streaming from vLLM to a LiveKit data channel uses the AsyncEngineClient streaming API. Each decoded token gets forwarded to the client immediately rather than waiting for a full sentence:
import uuid
async def stream_tokens_to_channel(prompt: str, ctx: JobContext):
request_id = str(uuid.uuid4())
prev = ""
async for output in llm_engine.generate(prompt, sampling_params, request_id):
full_text = output.outputs[0].text
delta = full_text[len(prev):]
prev = full_text
if delta:
await ctx.room.local_participant.publish_data(
delta.encode(), reliable=False
)The reliable=False flag uses the data channel in unreliable mode (UDP semantics). A dropped token packet causes a brief gap in the TTS buffer but avoids the TCP retransmit stall. TTS models handle occasional token gaps better than a stalled stream.
For vLLM setup, configuration options, and production deployment patterns, see the vLLM production deployment guide.
Reducing TTFT Below 150ms: Prefill Optimization, KV Cache, and Speculative Decoding
Three techniques compound to push TTFT below 150ms on GPU hardware.
1. KV cache warm pools
Pre-fill the system prompt and conversation prefix on model load. Subsequent requests skip prefill for the static portion entirely. For a typical voice agent prompt (short system instructions plus 2-3 conversation turns), this cuts 40-60% of TTFT. The static system prompt represents most of the prefill compute on short voice turns, so caching it pays off immediately.
See the KV cache optimization guide for implementation details, VRAM sizing, and cache eviction strategies.
2. Speculative decoding
A small draft model (0.5B-7B) generates candidate tokens that the target model (70B) accepts or rejects in parallel. For conversational text patterns (short outputs, common phrasing), speculative decoding typically achieves 2-3x decode speedup. The cost is the draft model's VRAM footprint: 4-8 GB at INT4 for a 7B draft.
The speculative decoding production guide covers draft model selection, acceptance rate tuning, and latency benchmarks.
3. Chunked prefill
vLLM's --enable-chunked-prefill flag prevents long prefill requests from blocking short decode requests. At 30+ concurrent voice sessions, a single user with a long conversation history would otherwise stall every other session's TTFT while their prefill runs. Chunked prefill interleaves prefill and decode steps, smoothing out the p99 latency spikes.
For disaggregating prefill from decode at larger scale, see prefill-decode disaggregation.
| Technique | TTFT reduction | Tradeoff |
|---|---|---|
| KV cache warm pool | 40-60% | Memory overhead (static prompt cache) |
| Speculative decoding (7B draft) | 30-50% decode speedup | +7B VRAM (4-8 GB at INT4) |
| Chunked prefill | Prevents p99 spikes | Minor throughput reduction |
| GPU bare-metal (no virtualization) | 10-20% vs cloud VMs | Must use dedicated provider |
Token Streaming: WebRTC Data Channels vs SSE
Both approaches work. The tradeoffs matter at the latency margins voice agents operate in.
SSE is simpler to implement. One HTTP connection, works through corporate proxies and firewalls, always reliable (TCP retransmit ensures delivery), and can go directly from inference server to client without SFU involvement.
Data channels share the existing WebRTC connection with zero additional handshake cost. They can be configured as unreliable (ordered: false, maxRetransmits: 0) for lowest-latency delivery, multiplexed with audio on the same UDP port, and support the bidirectional communication needed for barge-in on the same connection.
For voice agents, data channels in unreliable mode are the right choice. A dropped token packet causes a brief gap in the TTS phoneme buffer, which the TTS model handles with interpolation at sentence boundaries. A TCP retransmit stall causes an audible silence spike that breaks the natural turn-taking feel.
Note: unreliable data channels (ordered: false, maxRetransmits: 0) are supported across all evergreen browsers. If you have specific minimum-version requirements for enterprise deployments, verify on caniuse.com.
# Server: publish token stream over unreliable data channel
await room.local_participant.publish_data(
token.encode("utf-8"),
reliable=False, # UDP-like: lower latency, tolerate drops
topic="llm_tokens"
)Jitter Buffer Tuning, Packet Loss Recovery, and Barge-In Handling
Jitter buffer
WebRTC's adaptive jitter buffer targets 50-150ms by default. For voice agents with GPU inference, that default is too conservative. Reduce JitterBufferTarget to 40ms and JitterBufferMaxPackets to 50. This cuts perceived latency at the cost of more frequent audio glitches on high-jitter connections. Benchmark against your network's p99 before locking this in.
Packet loss recovery
Opus codec's built-in packet loss concealment (PLC) handles up to 10-15% loss transparently. Above that, enable Opus FEC via useinbandfec=1 in the SDP offer. FEC increases bandwidth by 15-20% but provides graceful degradation on poor connections.
For the token data channel in unreliable mode, dropped packets mean the client interpolates or skips tokens at TTS sentence boundaries. The TTS model handles gaps better than stalls.
Barge-in implementation
Two-part implementation:
Client side: Silero VAD (~1-2MB model, runs in-browser via ONNX) detects speech during agent output and sends {"type": "interrupt"} over a reliable data channel. WebRTC's built-in VAD is an alternative but has lower accuracy on noisy audio.
Server side: The vLLM inference loop checks an asyncio event flag between decode steps. On interrupt, abort generation, cancel TTS output, restart ASR pipeline:
import asyncio
import json
import uuid
async def create_session_handler():
# Per-session event: each voice session gets its own instance so one
# client's barge-in cannot interrupt other active sessions.
interrupt_event = asyncio.Event()
async def handle_data_message(msg: rtc.DataPacket):
try:
data = json.loads(msg.data)
except json.JSONDecodeError:
return
if data.get("type") == "interrupt":
interrupt_event.set()
async def stream_with_barge_in(prompt: str):
req_id = str(uuid.uuid4())
interrupt_event.clear()
prev = ""
async for output in llm_engine.generate(prompt, params, req_id):
if interrupt_event.is_set():
await llm_engine.abort(req_id)
break
full_text = output.outputs[0].text
delta = full_text[len(prev):]
prev = full_text
if delta:
yield delta
return handle_data_message, stream_with_barge_inThe interrupt signal round-trip is under 20ms on a well-placed TURN server. The agent stops speaking within one additional decode step (typically 1-3ms on H100) after the interrupt arrives.
Deploying LiveKit Agents and Pipecat on GPU Cloud
LiveKit Agents on Spheron
The deployment pattern is two containers sharing a Docker network: a vLLM server and a LiveKit agent worker. The agent worker connects to a LiveKit server (self-hosted or LiveKit Cloud), receives audio from client sessions, and calls the local vLLM endpoint for inference.
version: "3.8"
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
command: >
--model meta-llama/Llama-3.3-70B-Instruct
--quantization fp8
--tensor-parallel-size 2
--enable-chunked-prefill
--port 8000
ports:
- "8000:8000"
agent:
image: your-org/livekit-agent:latest
environment:
- LIVEKIT_URL=${LIVEKIT_URL}
- LIVEKIT_API_KEY=${LIVEKIT_API_KEY}
- LIVEKIT_API_SECRET=${LIVEKIT_API_SECRET}
- LLM_ENDPOINT=http://vllm:8000/v1
depends_on:
- vllmRent H100 on Spheron for the inference tier. The bare-metal provisioning means no hypervisor overhead on the GPU path, which contributes 10-20% lower TTFT compared to equivalent virtualized instances.
For the TTS component, NeuTTS Air co-locates on the same RTX 5090 node used for ASR (under 2GB VRAM for the TTS model), leaving 28+ GB free for FasterWhisper and any assistant tasks.
Pipecat Pipeline
Pipecat handles the frame-by-frame pipeline plumbing: mic frames in, audio frames out, with ASR, LLM, and TTS as pipeline stages:
pipeline = Pipeline([
transport.input(),
stt, # FasterWhisper via WebRTC audio frames
llm, # vLLM OpenAI-compatible endpoint
tts, # NeuTTS Air or Kokoro HTTP API
transport.output(),
])Pipecat supports LiveKit as one of its WebRTC transport options. The LiveKitTransport handles ICE negotiation, audio encoding/decoding, and data channel management so your pipeline code stays focused on the AI logic.
Network Topology: GPU Node Placement, TURN Servers, and Edge POPs
Network topology has a larger impact on end-to-end latency than most teams expect. Two placement decisions matter most.
GPU nodes and SFU should be co-located. Put your Spheron GPU instances in the same region as your LiveKit SFU. The SFU-to-GPU hop should be under 5ms. A cross-region SFU-to-GPU path (e.g., SFU in us-east, GPU in eu-west) adds 80-120ms round-trip and immediately breaks the sub-200ms budget.
TURN servers should be near users, not GPU nodes. TURN relays WebRTC traffic when direct peer-to-peer fails (common in corporate networks). A TURN server in the same region as end users adds under 20ms per hop. Placing TURN near the GPU instead of the user doubles the effective path length.
For global deployments: deploy SFU + TURN regionally (us-east, eu-west, ap-southeast), route GPU inference requests to the nearest available Spheron region.
[User] <-> [TURN (edge, near user)] <-> [SFU] <-> [GPU node (Spheron, co-located with SFU)]Latency budget breakdown for a well-architected deployment:
| Hop | Target latency | Notes |
|---|---|---|
| User to TURN | <20ms | Edge TURN placement |
| TURN to SFU | <5ms | Same region |
| SFU to GPU node | <5ms | Same datacenter or co-located |
| GPU inference (TTFT) | <150ms | H100 SXM5, 7B-70B model |
| GPU to SFU to User (audio) | <15ms | Opus encode + WebRTC |
| Total (first audio) | <195ms | End-to-end |
For the GPU-to-SFU hop specifically, see the GPU networking guide for network interface configuration and bandwidth considerations on multi-GPU nodes.
Spheron Reference Architecture and Cost Per Concurrent Call
Reference architecture for 40-100 concurrent voice sessions:
- Inference tier: 2x H100 SXM5 80GB on Spheron (bare-metal, on-demand), running vLLM with Llama 3.3 70B at FP8 + chunked prefill. Handles 40-80 concurrent sessions at TTFT under 150ms.
- ASR tier: 1x RTX 5090 32GB on Spheron, running FasterWhisper Large v3 Turbo at INT8.
- TTS tier: Shares the RTX 5090 with ASR. NeuTTS Air uses under 2GB VRAM, leaving ample headroom.
- SFU: Self-hosted LiveKit on 2x 8-core VMs (not GPU), co-located with inference tier.
- TURN: Hosted TURN via Metered.ca or Cloudflare TURN (per-GB pricing, not per-server).
Cost breakdown (live pricing as of 29 Apr 2026):
| Component | GPU/Instance | Subtotal ($/hr) | Spot ($/hr) |
|---|---|---|---|
| LLM inference (2x H100 SXM5)¹ | 2x H100_NVL | ~$4.12 | N/A |
| ASR + TTS (RTX 5090) | 1x RTX 5090 | ~$0.86 | N/A |
| SFU (2x VM) | Non-GPU | ~$0.40 | N/A |
| TURN (Metered) | Per GB | ~$0.40/GB audio | N/A |
| Total GPU + SFU | ~$5.38/hr | N/A |
¹ H100 SXM5 is the architectural reference used throughout this post. Pricing uses H100_NVL ($2.06/hr per GPU on Spheron), the closest available SKU. Spot pricing is not available for H100 on Spheron.
At 40 concurrent voice sessions, cost per session-hour is approximately $0.13 on-demand. Compare to an AWS p3.2xlarge (V100) equivalent stack, which runs 2-3x higher for equivalent voice agent throughput based on the GPU cloud pricing comparison.
For higher session density or lower TTFT on 70B models, step up to bare-metal B200 instances (192GB HBM3e, $7.43/hr per GPU on-demand or $1.71/hr spot). The B200's memory bandwidth is roughly 2x the H100 SXM5, which translates directly to lower decode latency at high concurrency. For the full cost-per-session-hour economics across model sizes, see inference cost economics.
Pricing fluctuates based on GPU availability. The prices above are based on 29 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Wrapping Up
WebRTC is not an optional optimization for sub-200ms voice agents. It is the transport layer that makes barge-in, full-duplex audio, and consistent low jitter possible at the same time. HTTP/SSE solves for the easy cases; voice agents are not easy cases.
The stack outlined here (LiveKit SFU, vLLM token streaming over data channels, Silero VAD for barge-in, FasterWhisper + NeuTTS Air for ASR/TTS) runs on hardware you can spin up in minutes on Spheron and tear down just as fast. No reserved capacity commitments, no minimum contract.
Start with the LiveKit Agents quickstart and Pipecat's WebRTC examples to get a basic pipeline running, then layer in the TTFT optimizations (KV cache warm pools, chunked prefill) as you hit latency targets. Full Spheron documentation covers instance provisioning and network configuration.
Running a voice agent pipeline on Spheron gives you sub-millisecond GPU scheduling latency alongside WebRTC-ready networking, no long-term reservations needed. Start with an on-demand H100 for development and scale to dedicated multi-GPU nodes as call volume grows.
Rent H100 → | Rent B200 → | View all GPU pricing → | Get started on Spheron →
