Why does HTTP SSE fail for real-time voice agents but WebRTC works?

HTTP SSE is one-directional, TCP-based, and stateless. Voice agents need full-duplex (simultaneous mic input and audio output), barge-in interruption handling, and consistent sub-30ms jitter. TCP head-of-line blocking introduces latency spikes on packet loss, and each SSE reconnect adds TLS handshake overhead. WebRTC uses UDP with DTLS, supports bidirectional data channels alongside audio tracks, and has built-in jitter buffer and packet loss concealment via the Opus codec.

How do I get time-to-first-token below 150ms for a voice agent LLM?

Three techniques compound: (1) keep a warm KV cache pool so prompts start from a pre-filled context prefix, cutting cold-start prefill time by 40-60%; (2) use speculative decoding with a small draft model (e.g., 0.5B alongside a 7B) to generate 3-5 tokens per LLM step; (3) deploy on bare-metal GPUs without virtualization overhead - H100 SXM5 on Spheron hits sub-100ms TTFT for 7B models at FP16.

Should I use WebRTC data channels or SSE for token streaming in a voice agent?

Data channels. Once a WebRTC session is established, data channels share the existing DTLS/ICE connection with zero additional handshake cost. SSE requires a separate HTTP connection with its own TLS round-trip. Data channels also support unreliable mode (lower latency, tolerable for tokens) while SSE always retransmits. The tradeoff: data channels require your SFU to bridge GPU-side token output to the client, while SSE can go direct.

What GPU setup handles 100 concurrent voice agent sessions under 200ms?

For Llama 3.3 70B at FP8 with vLLM: 2x H100 SXM5 nodes handle approximately 80-120 concurrent voice sessions at TTFT under 150ms, depending on conversation history length and prefill batch size. Add a separate lighter GPU (RTX 5090) for Whisper ASR and TTS to avoid VRAM contention. At that scale, plan for a dedicated TURN server and an SFU cluster with at least 2 nodes for redundancy.

What is barge-in and how do I implement it in a WebRTC voice agent?

Barge-in is the ability for a user to interrupt the agent mid-response. Implement it in two parts: client-side VAD (Voice Activity Detection using Silero VAD or WebRTC's built-in VAD) detects the user speaking and sends an interrupt signal over a WebRTC data channel. Server-side, the LLM generation loop checks for this signal between decode steps and stops streaming; the TTS output is cut and the ASR pipeline processes the new input. The interrupt signal round-trip should be under 20ms on a well-placed TURN server.

WebRTC LLM Streaming: Real-Time Voice Agent Infrastructure on GPU Cloud

A voice agent that feels real-time has less than 200ms between when the user stops talking and when they hear the first word back. Most guides stop at the model layer: pick a fast ASR, run a 7B LLM, stream TTS output. But the network layer is equally load-bearing. If you're still routing voice over HTTP/SSE, you're leaving 50-100ms on the table and losing barge-in support entirely.

This post covers the network plumbing that sits between your GPU inference stack and the end user: WebRTC transport, SFU selection, token streaming over data channels, jitter buffer configuration, and barge-in handling. For the model-layer side (VRAM sizing, ASR/LLM/TTS GPU allocation, and latency budgets per stage), see the Voice AI GPU Infrastructure guide. For the TTS model specifically, NeuTTS Air covers the 320x real-time synthesis option built on Spheron.

Why HTTP Streaming Fails for Real-Time Voice Agents

HTTP/SSE works fine for chat UIs. For voice, it breaks in three specific ways.

Head-of-line blocking. TCP retransmits stalled packets before delivering later ones. On a lossy mobile connection, one dropped packet causes a 20-80ms stall in your token stream. In a chat UI, that's a brief freeze. In a voice agent, it causes audible clipping or a dead silence gap that sounds like a dropped call.

One-way transport. SSE is server-to-client only. When a user starts speaking while the agent is talking (barge-in), your application needs a separate WebSocket or HTTP POST to send the interrupt signal back. That's a second connection, a second TLS handshake on first use, and additional latency on each barge-in event.

Per-request connection overhead. Each SSE stream starts with an HTTP request and TLS negotiation. For a voice session with frequent turn-taking, the cumulative handshake cost adds up. A WebRTC session pays the ICE + DTLS setup cost once, then handles all bidirectional traffic over a single multiplexed UDP connection.

Property	HTTP/SSE	WebRTC
Direction	One-way (server to client)	Full-duplex
Transport	TCP	UDP + DTLS
Latency jitter	5-50ms (TCP retransmit)	2-15ms (UDP)
Barge-in support	Manual reconnect	Built-in via data channel signal
Packet loss handling	Retransmit (blocks stream)	Concealment (Opus PLC)
Connection setup	Per-request TLS	Once per session (ICE + DTLS)

For sub-200ms end-to-end, WebRTC is not optional.

WebRTC Architecture: SFU, Data Channels, and the GPU Inference Path

The full-stack architecture for a WebRTC voice agent looks like this:

User Mic
  -> WebRTC (Opus audio)
  -> SFU (LiveKit or MediaSoup)
  -> VAD (Silero or WebRTC built-in)
  -> ASR (FasterWhisper)
  -> LLM (vLLM)
  -> TTS (NeuTTS Air or Kokoro)
  -> WebRTC data channel
  -> SFU
  -> User Speaker

Data channels carry three types of traffic:

LLM token stream (server to client, unreliable mode for lower latency)
Interrupt/barge-in signals (client to server, reliable mode)
Session metadata (turn state, VAD confidence, timestamps)

Audio (mic and speaker) travels over the WebRTC media track. The data channel is a separate logical stream multiplexed on the same UDP port.

SFU Options: LiveKit vs MediaSoup

The SFU (Selective Forwarding Unit) is the server that manages WebRTC sessions, routes audio between participants, and bridges between your GPU inference backend and client connections.

Feature	LiveKit	MediaSoup
Language	Go + TypeScript SDK	Node.js
Agents SDK	Yes (Python, Node.js)	Manual integration
Cloud option	LiveKit Cloud	Self-hosted only
TURN built-in	Yes (livekit-server)	Requires coturn
Voice agent support	First-class (Agents framework)	DIY
Scaling	Built-in room distribution	Manual horizontal scaling

LiveKit is the better default for most teams. The Agents framework handles the plumbing between WebRTC sessions and Python inference code, built-in TURN removes a dependency to manage, and the Python SDK integrates directly with vLLM and Pipecat.

The GPU Inference Path

Token streaming from vLLM to a LiveKit data channel uses the AsyncEngineClient streaming API. Each decoded token gets forwarded to the client immediately rather than waiting for a full sentence:

python

import uuid

async def stream_tokens_to_channel(prompt: str, ctx: JobContext):
    request_id = str(uuid.uuid4())
    prev = ""
    async for output in llm_engine.generate(prompt, sampling_params, request_id):
        full_text = output.outputs[0].text
        delta = full_text[len(prev):]
        prev = full_text
        if delta:
            await ctx.room.local_participant.publish_data(
                delta.encode(), reliable=False
            )

The reliable=False flag uses the data channel in unreliable mode (UDP semantics). A dropped token packet causes a brief gap in the TTS buffer but avoids the TCP retransmit stall. TTS models handle occasional token gaps better than a stalled stream.

For vLLM setup, configuration options, and production deployment patterns, see the vLLM production deployment guide.

Reducing TTFT Below 150ms: Prefill Optimization, KV Cache, and Speculative Decoding

Three techniques compound to push TTFT below 150ms on GPU hardware.

1. KV cache warm pools

Pre-fill the system prompt and conversation prefix on model load. Subsequent requests skip prefill for the static portion entirely. For a typical voice agent prompt (short system instructions plus 2-3 conversation turns), this cuts 40-60% of TTFT. The static system prompt represents most of the prefill compute on short voice turns, so caching it pays off immediately.

See the KV cache optimization guide for implementation details, VRAM sizing, and cache eviction strategies.

2. Speculative decoding

A small draft model (0.5B-7B) generates candidate tokens that the target model (70B) accepts or rejects in parallel. For conversational text patterns (short outputs, common phrasing), speculative decoding typically achieves 2-3x decode speedup. The cost is the draft model's VRAM footprint: 4-8 GB at INT4 for a 7B draft.

The speculative decoding production guide covers draft model selection, acceptance rate tuning, and latency benchmarks.

3. Chunked prefill

vLLM's --enable-chunked-prefill flag prevents long prefill requests from blocking short decode requests. At 30+ concurrent voice sessions, a single user with a long conversation history would otherwise stall every other session's TTFT while their prefill runs. Chunked prefill interleaves prefill and decode steps, smoothing out the p99 latency spikes.

For disaggregating prefill from decode at larger scale, see prefill-decode disaggregation.

Technique	TTFT reduction	Tradeoff
KV cache warm pool	40-60%	Memory overhead (static prompt cache)
Speculative decoding (7B draft)	30-50% decode speedup	+7B VRAM (4-8 GB at INT4)
Chunked prefill	Prevents p99 spikes	Minor throughput reduction
GPU bare-metal (no virtualization)	10-20% vs cloud VMs	Must use dedicated provider

Token Streaming: WebRTC Data Channels vs SSE

Both approaches work. The tradeoffs matter at the latency margins voice agents operate in.

SSE is simpler to implement. One HTTP connection, works through corporate proxies and firewalls, always reliable (TCP retransmit ensures delivery), and can go directly from inference server to client without SFU involvement.

Data channels share the existing WebRTC connection with zero additional handshake cost. They can be configured as unreliable (ordered: false, maxRetransmits: 0) for lowest-latency delivery, multiplexed with audio on the same UDP port, and support the bidirectional communication needed for barge-in on the same connection.

For voice agents, data channels in unreliable mode are the right choice. A dropped token packet causes a brief gap in the TTS phoneme buffer, which the TTS model handles with interpolation at sentence boundaries. A TCP retransmit stall causes an audible silence spike that breaks the natural turn-taking feel.

Note: unreliable data channels (ordered: false, maxRetransmits: 0) are supported across all evergreen browsers. If you have specific minimum-version requirements for enterprise deployments, verify on caniuse.com.

python

# Server: publish token stream over unreliable data channel
await room.local_participant.publish_data(
    token.encode("utf-8"),
    reliable=False,  # UDP-like: lower latency, tolerate drops
    topic="llm_tokens"
)

Jitter Buffer Tuning, Packet Loss Recovery, and Barge-In Handling

Jitter buffer

WebRTC's adaptive jitter buffer targets 50-150ms by default. For voice agents with GPU inference, that default is too conservative. Reduce JitterBufferTarget to 40ms and JitterBufferMaxPackets to 50. This cuts perceived latency at the cost of more frequent audio glitches on high-jitter connections. Benchmark against your network's p99 before locking this in.

Packet loss recovery

Opus codec's built-in packet loss concealment (PLC) handles up to 10-15% loss transparently. Above that, enable Opus FEC via useinbandfec=1 in the SDP offer. FEC increases bandwidth by 15-20% but provides graceful degradation on poor connections.

For the token data channel in unreliable mode, dropped packets mean the client interpolates or skips tokens at TTS sentence boundaries. The TTS model handles gaps better than stalls.

Barge-in implementation

Two-part implementation:

Client side: Silero VAD (~1-2MB model, runs in-browser via ONNX) detects speech during agent output and sends {"type": "interrupt"} over a reliable data channel. WebRTC's built-in VAD is an alternative but has lower accuracy on noisy audio.

Server side: The vLLM inference loop checks an asyncio event flag between decode steps. On interrupt, abort generation, cancel TTS output, restart ASR pipeline:

python

import asyncio
import json
import uuid

async def create_session_handler():
    # Per-session event: each voice session gets its own instance so one
    # client's barge-in cannot interrupt other active sessions.
    interrupt_event = asyncio.Event()

    async def handle_data_message(msg: rtc.DataPacket):
        try:
            data = json.loads(msg.data)
        except json.JSONDecodeError:
            return
        if data.get("type") == "interrupt":
            interrupt_event.set()

    async def stream_with_barge_in(prompt: str):
        req_id = str(uuid.uuid4())
        interrupt_event.clear()
        prev = ""
        async for output in llm_engine.generate(prompt, params, req_id):
            if interrupt_event.is_set():
                await llm_engine.abort(req_id)
                break
            full_text = output.outputs[0].text
            delta = full_text[len(prev):]
            prev = full_text
            if delta:
                yield delta

    return handle_data_message, stream_with_barge_in

The interrupt signal round-trip is under 20ms on a well-placed TURN server. The agent stops speaking within one additional decode step (typically 1-3ms on H100) after the interrupt arrives.

Deploying LiveKit Agents and Pipecat on GPU Cloud

LiveKit Agents on Spheron

The deployment pattern is two containers sharing a Docker network: a vLLM server and a LiveKit agent worker. The agent worker connects to a LiveKit server (self-hosted or LiveKit Cloud), receives audio from client sessions, and calls the local vLLM endpoint for inference.

yaml

version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    command: >
      --model meta-llama/Llama-3.3-70B-Instruct
      --quantization fp8
      --tensor-parallel-size 2
      --enable-chunked-prefill
      --port 8000
    ports:
      - "8000:8000"

  agent:
    image: your-org/livekit-agent:latest
    environment:
      - LIVEKIT_URL=${LIVEKIT_URL}
      - LIVEKIT_API_KEY=${LIVEKIT_API_KEY}
      - LIVEKIT_API_SECRET=${LIVEKIT_API_SECRET}
      - LLM_ENDPOINT=http://vllm:8000/v1
    depends_on:
      - vllm

Rent H100 on Spheron for the inference tier. The bare-metal provisioning means no hypervisor overhead on the GPU path, which contributes 10-20% lower TTFT compared to equivalent virtualized instances.

For the TTS component, NeuTTS Air co-locates on the same RTX 5090 node used for ASR (under 2GB VRAM for the TTS model), leaving 28+ GB free for FasterWhisper and any assistant tasks.

Pipecat Pipeline

Pipecat handles the frame-by-frame pipeline plumbing: mic frames in, audio frames out, with ASR, LLM, and TTS as pipeline stages:

python

pipeline = Pipeline([
    transport.input(),
    stt,           # FasterWhisper via WebRTC audio frames
    llm,           # vLLM OpenAI-compatible endpoint
    tts,           # NeuTTS Air or Kokoro HTTP API
    transport.output(),
])

Pipecat supports LiveKit as one of its WebRTC transport options. The LiveKitTransport handles ICE negotiation, audio encoding/decoding, and data channel management so your pipeline code stays focused on the AI logic.

Network Topology: GPU Node Placement, TURN Servers, and Edge POPs

Network topology has a larger impact on end-to-end latency than most teams expect. Two placement decisions matter most.

GPU nodes and SFU should be co-located. Put your Spheron GPU instances in the same region as your LiveKit SFU. The SFU-to-GPU hop should be under 5ms. A cross-region SFU-to-GPU path (e.g., SFU in us-east, GPU in eu-west) adds 80-120ms round-trip and immediately breaks the sub-200ms budget.

TURN servers should be near users, not GPU nodes. TURN relays WebRTC traffic when direct peer-to-peer fails (common in corporate networks). A TURN server in the same region as end users adds under 20ms per hop. Placing TURN near the GPU instead of the user doubles the effective path length.

For global deployments: deploy SFU + TURN regionally (us-east, eu-west, ap-southeast), route GPU inference requests to the nearest available Spheron region.

[User] <-> [TURN (edge, near user)] <-> [SFU] <-> [GPU node (Spheron, co-located with SFU)]

Latency budget breakdown for a well-architected deployment:

Hop	Target latency	Notes
User to TURN	<20ms	Edge TURN placement
TURN to SFU	<5ms	Same region
SFU to GPU node	<5ms	Same datacenter or co-located
GPU inference (TTFT)	<150ms	H100 SXM5, 7B-70B model
GPU to SFU to User (audio)	<15ms	Opus encode + WebRTC
Total (first audio)	<195ms	End-to-end

For the GPU-to-SFU hop specifically, see the GPU networking guide for network interface configuration and bandwidth considerations on multi-GPU nodes.

Spheron Reference Architecture and Cost Per Concurrent Call

Reference architecture for 40-100 concurrent voice sessions:

Inference tier: 2x H100 SXM5 80GB on Spheron (bare-metal, on-demand), running vLLM with Llama 3.3 70B at FP8 + chunked prefill. Handles 40-80 concurrent sessions at TTFT under 150ms.
ASR tier: 1x RTX 5090 32GB on Spheron, running FasterWhisper Large v3 Turbo at INT8.
TTS tier: Shares the RTX 5090 with ASR. NeuTTS Air uses under 2GB VRAM, leaving ample headroom.
SFU: Self-hosted LiveKit on 2x 8-core VMs (not GPU), co-located with inference tier.
TURN: Hosted TURN via Metered.ca or Cloudflare TURN (per-GB pricing, not per-server).

Cost breakdown (live pricing as of 29 Apr 2026):

Component	GPU/Instance	Subtotal ($/hr)	Spot ($/hr)
LLM inference (2x H100 SXM5)¹	2x H100_NVL	~$4.12	N/A
ASR + TTS (RTX 5090)	1x RTX 5090	~$0.86	N/A
SFU (2x VM)	Non-GPU	~$0.40	N/A
TURN (Metered)	Per GB	~$0.40/GB audio	N/A
Total GPU + SFU	~$5.38/hr	N/A

¹ H100 SXM5 is the architectural reference used throughout this post. Pricing uses H100_NVL ($2.06/hr per GPU on Spheron), the closest available SKU. Spot pricing is not available for H100 on Spheron.

At 40 concurrent voice sessions, cost per session-hour is approximately $0.13 on-demand. Compare to an AWS p3.2xlarge (V100) equivalent stack, which runs 2-3x higher for equivalent voice agent throughput based on the GPU cloud pricing comparison.

For higher session density or lower TTFT on 70B models, step up to bare-metal B200 instances (192GB HBM3e, $7.43/hr per GPU on-demand or $1.71/hr spot). The B200's memory bandwidth is roughly 2x the H100 SXM5, which translates directly to lower decode latency at high concurrency. For the full cost-per-session-hour economics across model sizes, see inference cost economics.

Pricing fluctuates based on GPU availability. The prices above are based on 29 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Wrapping Up

WebRTC is not an optional optimization for sub-200ms voice agents. It is the transport layer that makes barge-in, full-duplex audio, and consistent low jitter possible at the same time. HTTP/SSE solves for the easy cases; voice agents are not easy cases.

The stack outlined here (LiveKit SFU, vLLM token streaming over data channels, Silero VAD for barge-in, FasterWhisper + NeuTTS Air for ASR/TTS) runs on hardware you can spin up in minutes on Spheron and tear down just as fast. No reserved capacity commitments, no minimum contract.

Start with the LiveKit Agents quickstart and Pipecat's WebRTC examples to get a basic pipeline running, then layer in the TTFT optimizations (KV cache warm pools, chunked prefill) as you hit latency targets. Full Spheron documentation covers instance provisioning and network configuration.

Running a voice agent pipeline on Spheron gives you sub-millisecond GPU scheduling latency alongside WebRTC-ready networking, no long-term reservations needed. Start with an on-demand H100 for development and scale to dedicated multi-GPU nodes as call volume grows.
Rent H100 → | Rent B200 → | View all GPU pricing → | Get started on Spheron →

Why HTTP Streaming Fails for Real-Time Voice Agents

WebRTC Architecture: SFU, Data Channels, and the GPU Inference Path

SFU Options: LiveKit vs MediaSoup

The GPU Inference Path

Reducing TTFT Below 150ms: Prefill Optimization, KV Cache, and Speculative Decoding

Token Streaming: WebRTC Data Channels vs SSE

Jitter Buffer Tuning, Packet Loss Recovery, and Barge-In Handling

Jitter buffer

Packet loss recovery

Barge-in implementation

Deploying LiveKit Agents and Pipecat on GPU Cloud

LiveKit Agents on Spheron

Pipecat Pipeline

Network Topology: GPU Node Placement, TURN Servers, and Edge POPs

Spheron Reference Architecture and Cost Per Concurrent Call

Wrapping Up

Build what's next.