The 300ms wall is real. Voice agents that respond in under 300ms feel natural; at 500ms they feel delayed; above 700ms they feel broken. A cascaded ASR+LLM+TTS pipeline, even when optimized, stacks three serial latency contributors: speech recognition (30-80ms), LLM first token (150-300ms), and TTS first audio chunk (50-100ms). That is a minimum of 230ms before you account for audio transport, tokenization overhead, and sentence boundary detection. At production load, p95 typically lands between 400-800ms.
Unified speech-to-speech models take a different approach. Audio goes in, audio comes out, and there is no text representation anywhere in the path. One tokenization pass, one model, one output stream. For the GPU requirements of a cascaded voice pipeline and how each stage competes for VRAM, start there. For the ASR layer specifically, the Whisper production deployment guide covers faster-whisper, streaming chunking, and speaker diarization. This post covers the alternative: unified S2S models that skip the text bottleneck entirely.
What Unified Speech-to-Speech Models Are
A unified S2S model takes raw audio frames as input and generates raw audio frames as output. No transcript, no tokens, no text-to-speech synthesis step. The model processes acoustic features directly and generates audio through a learned audio codec.
A cascaded pipeline has two serialization points: the ASR-to-LLM boundary (speech to text) and the LLM-to-TTS boundary (text to audio). Each serialization point forces the preceding stage to complete before the next can begin. Text must be fully decoded before TTS synthesis starts. Both conversions add latency and introduce potential for quality degradation.
Unified S2S eliminates both boundaries. The model learns a joint representation over speech and language, so it can start generating audio output while still processing input, rather than waiting for a transcript. This also enables full-duplex operation: the model can listen and speak simultaneously, which is how human conversation actually works.
The distinction between full-duplex and half-duplex S2S matters in production. Full-duplex models (Moshi, Hertz-dev) can handle barge-in natively because they are always processing input. Half-duplex S2S models wait for the speaker to finish before generating output, which still eliminates the text bottleneck but does not support interruption without a VAD-triggered hard stop.
The Latency Budget: Why Sub-300ms Requires End-to-End Audio Tokenization
The cascaded pipeline latency breakdown:
| Stage | Best case | Typical | Notes |
|---|---|---|---|
| ASR (faster-whisper, INT8) | 20ms | 40-80ms | Per 3-5s chunk; greedy decode |
| LLM time-to-first-token | 80ms | 150-300ms | 7B model, H100 |
| TTS first audio chunk | 40ms | 50-100ms | Kokoro/NeuTTS Air |
| Audio transport overhead | 10ms | 20-60ms | WebSocket, no WebRTC |
| Total (best case) | 150ms | 260-540ms | Single concurrent session |
At production concurrency (8+ sessions on a single GPU), TTFT for the LLM climbs as KV cache competition increases. The p95 for a fully optimized cascaded pipeline on an H100 with 8 sessions sits around 500-700ms, not 300ms.
S2S models collapse the table differently. With Moshi on an H100:
| Stage | Time |
|---|---|
| Audio tokenization (Mimi codec) | 15ms |
| Joint speech LM forward pass | 80-120ms |
| Audio de-tokenization | 10ms |
| First audio chunk out | 105-145ms |
The depth transformer that handles acoustic token prediction runs in parallel with the primary text stream, not sequentially after it. This interleaved generation is why S2S time-to-first-audio (TTFA) can land under 200ms on well-provisioned hardware.
The 2026 S2S Model Landscape
Moshi (Kyutai)
Moshi is a 7B joint speech language model with a dual-stream architecture: an inner monologue stream that generates text (not sent to the user) and an audio stream that generates speech. The text stream functions as chain-of-thought reasoning, improving response coherence without adding a separate LLM inference call.
The audio codec is Mimi, operating at 12.5 audio tokens per second. A depth transformer generates the 7 residual codebook levels in parallel, not sequentially, so all 8 codebook streams (semantic token plus 7 acoustic) emit simultaneously. This is the architectural reason Moshi's TTFA is fast relative to cascade.
- VRAM: 16-20GB FP16 (bfloat16 in practice)
- License: CC BY 4.0
- Full-duplex: Yes, natively handles simultaneous listen and speak
- Language: English (primary); limited multilingual
Sesame CSM-1B (Sesame AI Labs)
Sesame CSM (Conversational Speech Model) is a 1B-parameter model built on a Llama-style backbone with an audio decoder. It takes text and optional audio context as input and generates expressive speech output. It is not a full S2S model in the strict sense: it still needs text from an LLM. Think of it as TTS with S2S-like expressiveness, trained on conversational audio to match prosody and emotion to context rather than just text instructions.
CSM-1B's strength is context-aware voice synthesis. You can pass several turns of audio history and it conditions its output prosody on the conversation style, not just the text. This produces notably more natural-sounding voice responses than conventional TTS models in conversational settings.
- VRAM: 6-8GB (1B parameters at FP16)
- License: Apache 2.0
- Full-duplex: No, requires a separate LLM
- Language: English; limited multilingual
Hertz-dev (Standard Intelligence)
Hertz-dev is a transformer-based real-time audio model (~8.5B parameters) trained on duplex conversational audio data. It processes raw audio input and generates audio output directly, making it a true full-duplex S2S model. Standard Intelligence released Hertz-dev in November 2024 and benchmarks across diverse hardware and concurrency configurations are still limited as of April 2026. The architecture is similar in intent to Moshi but trained on a different corpus with a different tokenization approach.
Treat Hertz-dev as research-grade for production: the model architecture is sound and inference demos work, but systematic benchmarks across concurrency levels and hardware configurations are not yet widely published. VRAM estimates (8-12GB) are derived from parameter count, not confirmed production benchmarks.
- VRAM: 8-12GB (estimate)
- License: Apache 2.0
- Full-duplex: Yes
- Language: English
Other Models Worth Knowing
GLM-Voice (Tsinghua): 9B parameters, text-guided voice generation, strong multilingual support including code-switching between Chinese and English. VRAM: ~18GB FP16. Better choice than Moshi for multilingual S2S use cases.
Mini-Omni-2: Lightweight omni model (~1.5B), low VRAM (~4GB), trades voice quality for footprint. Usable on an RTX 4090 for dev and testing; not recommended for production voice agents due to quality limitations.
Model Comparison Table
| Model | Params | VRAM FP16 | VRAM INT8 | Full-Duplex | License |
|---|---|---|---|---|---|
| Moshi | 7B | 16-20GB | 8-10GB | Yes | CC BY 4.0 |
| Sesame CSM-1B | 1B | 6-8GB | 3-4GB | No (needs LLM) | Apache 2.0 |
| Hertz-dev | ~8.5B | 8-12GB (est.) | 5-7GB (est.) | Yes | Apache 2.0 |
| GLM-Voice | 9B | ~18GB | ~9GB | Partial | Apache 2.0 |
| Mini-Omni-2 | ~1.5B | ~4GB | ~2GB | Partial | Apache 2.0 |
GPU VRAM Requirements and Concurrency Math
The concurrency formula is:
max_sessions = floor((total_vram_gb - 2) / per_session_vram_gb)The 2GB overhead covers CUDA context, OS, and framework buffers. Audio ring buffers for active sessions grow with session duration; monitor VRAM during load testing, not just during model load.
| GPU | Total VRAM | Moshi sessions | CSM-1B sessions | Hertz-dev sessions |
|---|---|---|---|---|
| RTX 4090 | 24GB | 1 | 2-3 | 1 |
| L40S | 48GB | 2-3 | 5-7 | 3-4 |
| H100 SXM5 | 80GB | 3-4 | 9-11 | 5-6 |
| H200 SXM5 | 141GB | 7-8 | 17-19 | 11-12 |
The RTX 4090 handles a single Moshi session with no headroom for failover. Production deployments should start at L40S on Spheron for cost-effective concurrency or Spheron H100 SXM5 instances for higher throughput workloads. For general VRAM sizing methodology across model sizes, see the GPU memory requirements for LLM inference guide.
VRAM fragmentation is a real concern for long-session S2S workloads. Moshi's Mimi codec maintains an audio ring buffer per active session. Each buffer grows up to max_session_tokens * bytes_per_token. Set a hard max_session_tokens limit (2,000 tokens is a reasonable starting point, roughly 160 seconds at 12.5 tok/s) and reset session state on natural conversation turns. Use nvidia-smi dmon -s u to monitor per-process VRAM growth during load testing.
Streaming Inference Architecture
Audio Chunk Sizing and RTC Alignment
WebRTC uses 20ms packet intervals. For S2S models operating on audio frames, 80ms chunks (1,280 samples at 16kHz) align well with WebRTC frame boundaries (4 packets), keep per-chunk inference overhead low, and match Moshi's Mimi codec internal frame size. Going below 40ms increases overhead without meaningful TTFA improvement. Going above 160ms increases perceived input lag.
VAD Integration
Attach a VAD (Voice Activity Detection) model to detect end-of-speech. Silero-VAD and webrtcvad are both reliable choices. Key settings for voice agents:
- End-of-speech silence threshold: 200-250ms for fast-paced conversation, 350-400ms for customer service where longer pauses are normal mid-turn
- Pre-speech buffer: retain the 300-400ms before VAD triggers to capture the start of utterances that ramp slowly
- Barge-in buffer: keep the last 3-4 audio chunks in a rolling buffer so the model can be interrupted without losing the last partial output
For full-duplex models (Moshi, Hertz-dev), VAD end-of-speech detection is less critical because the model handles overlap natively. For CSM-1B, which requires a separate LLM, the VAD turn boundary is the handoff point to the text pipeline.
Token Interleaving (Moshi-Specific)
Moshi's depth transformer generates 8 codebook streams in parallel: one semantic token stream (the inner monologue text tokens) and 7 acoustic residual codebook streams. The acoustic streams feed the Mimi decoder in real-time. This interleaved generation means audio output starts before the full semantic sequence is complete, lowering TTFA relative to generating the text sequence first and then synthesizing.
In practical terms: configure Moshi with streaming=True and process output tokens as they arrive rather than accumulating the full response.
WebSocket vs WebRTC Transport
WebSocket (via FastAPI or aiohttp) is simpler to deploy and sufficient for sub-200ms transport on co-located or low-latency cloud connections. WebRTC adds complexity but enables p2p paths and DTLS encryption for end-user facing deployments.
For initial deployments, use WebSocket. WebRTC is warranted when transport latency itself becomes a bottleneck, typically when client-to-server distance exceeds 50ms RTT.
Minimal FastAPI WebSocket handler for Moshi streaming:
from fastapi import FastAPI, WebSocket
import torch
from moshi.models import loaders
app = FastAPI()
device = "cuda"
moshi_weight = "./moshi-weights"
@app.on_event("startup")
async def load_model():
app.state.moshi = loaders.get_moshi_lm(moshi_weight, device=device)
app.state.moshi.eval()
@app.websocket("/ws/voice")
async def voice_session(ws: WebSocket):
await ws.accept()
model = app.state.moshi # shared model weights — never call model.step() directly
chunk_size = 1280 # 80ms at 16kHz
buffer = b""
# model.streaming() creates an isolated session with its own KV cache and
# Mimi audio ring buffer per connection. Without this, concurrent clients
# calling step() on the same shared object interleave writes into shared
# buffers, corrupting audio output for all sessions.
with model.streaming(batch_size=1) as session:
with torch.no_grad():
async for incoming in ws.iter_bytes():
buffer += incoming
while len(buffer) >= chunk_size * 2: # int16 = 2 bytes per sample
pcm_bytes = buffer[: chunk_size * 2]
buffer = buffer[chunk_size * 2 :]
audio_tensor = torch.frombuffer(pcm_bytes, dtype=torch.int16).float()
audio_tensor = audio_tensor / 32768.0 # normalize to [-1, 1]
output_audio = session.step(audio_tensor.unsqueeze(0).to(device))
if output_audio is not None:
pcm_out = (output_audio.squeeze(0).cpu().clamp(-1.0, 1.0) * 32767).short()
await ws.send_bytes(pcm_out.numpy().tobytes())Production Deployment on Spheron GPU Cloud
Provision Your Instance
Log in to app.spheron.ai, select your GPU tier (L40S or H100 SXM5 for production S2S), choose on-demand or spot, and deploy. SSH setup and key management are covered in the Spheron documentation. Allocate at least 100GB SSD for model weights and audio buffers.
Install Moshi
pip install moshi torch torchaudio --index-url https://download.pytorch.org/whl/cu121
huggingface-cli download kyutai/moshiko-pytorch-bf16 --local-dir ./moshi-weightsMoshi loads in bfloat16 by default. The full model checkpoint is approximately 14GB on disk. Load time on an H100 SXM5 with NVMe storage is roughly 25-35 seconds.
For production, extend the minimal handler above with:
- Session ID tracking (one
moshi.step()state per client) - Silero-VAD pre-processing to gate audio chunks to the model
- Health check endpoint for load balancer probing
- Prometheus metrics for TTFA p50/p95 and VRAM utilization
Deploy Sesame CSM-1B
Sesame CSM requires a separate LLM for response generation. The CSM model handles only the audio synthesis step.
git clone https://github.com/SesameAILabs/csm
cd csm && pip install -r requirements.txtSingle-turn inference:
from huggingface_hub import hf_hub_download
import torchaudio
from generator import load_csm_1b, Segment
model = load_csm_1b(device="cuda")
# Generate speech from LLM response text
audio = model.generate(
text="Your LLM response text here.",
speaker=0,
context=[], # pass prior audio segments for context-aware prosody
max_audio_length_ms=10000,
)
torchaudio.save("response.wav", audio.unsqueeze(0).cpu(), model.sample_rate)For a production LLM serving setup to feed text to CSM, see the LLM deployment guide.
Deploy Hertz-dev
Hertz-dev is not distributed as a HuggingFace transformers checkpoint. Checkpoints are downloaded automatically by the official inference scripts from https://ckpt.si.inc/hertz-dev/. Clone the official repository and run inference from there:
git clone https://github.com/Standard-Intelligence/hertz-dev
cd hertz-dev
pip install -r requirements.txt
# Checkpoints download automatically to ./ckpt on first run
python demo.pyThe demo.py script handles checkpoint fetching and provides a working real-time audio inference loop. For server deployments, adapt the audio I/O in demo.py to read from a WebSocket instead of a local microphone.
Published inference benchmarks for Hertz-dev are limited as of April 2026. Run your own TTFA benchmarks at your target concurrency before committing to a GPU tier.
Spheron Pricing and Cost-Per-Session Analysis
Live pricing as of 28 Apr 2026, per GPU per hour:
| GPU | VRAM | On-demand $/hr | Spot $/hr | Moshi sessions | OD cost/session | Spot cost/session |
|---|---|---|---|---|---|---|
| RTX 4090 | 24GB | $0.79 | N/A | 1 | $0.79 | N/A |
| L40S | 48GB | $0.72 | $0.32 | 2-3 | $0.24-0.36 | $0.11-0.16 |
| A100 80G SXM4 | 80GB | $1.64 | $0.45 | 3 | $0.55 | $0.15 |
| H100 SXM5 | 80GB | $2.90 | $0.80 | 3-4 | $0.73-0.97 | $0.20-0.27 |
| H200 SXM5 | 141GB | $3.96 | $1.19 | 7 | $0.57 | $0.17 |
| RTX PRO 6000 | 96GB | $1.70 | $0.59 | 4 | $0.43 | $0.15 |
The L40S offers the best on-demand per-session economics for Moshi: 3 concurrent sessions at $0.24/session/hr on-demand, or $0.11/session/hr at spot. The A100 80G SXM4 and RTX PRO 6000 are both strong spot options at $0.15/session/hr for interruption-tolerant workloads.
On-demand vs spot for voice agents: Use on-demand for always-on agents where an interruption mid-call is unacceptable. Use spot for batch voice processing (automated calls, voice-overs, synthetic data generation) where jobs can checkpoint and restart.
Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Benchmarks: TTFA, Interruptions, and Full-Duplex Behavior
| Model | TTFA p50 (ms) | TTFA p95 (ms) | Full-duplex | Interruption recovery |
|---|---|---|---|---|
| Moshi (H100 SXM5, 1 session) | ~200ms | ~250ms | Yes | ~50ms (state reset) |
| Sesame CSM-1B + LLM (H100) | ~150ms (TTS only) | ~200ms | No | N/A |
| Hertz-dev (~8.5B, H100) | ~300ms (est.) | ~400ms (est.) | Yes | Limited public data |
| Cascaded ASR+LLM+TTS (H100) | ~300ms | ~500-700ms | No | 200-400ms |
Moshi's p95 degrades at higher concurrency. At 3-4 sessions on an H100 SXM5, expect p95 TTFA to rise to 350-500ms as KV cache competition increases. At 7 sessions on an H200, the pattern is similar to Moshi at 4 sessions on H100. Always benchmark at your target concurrency, not at single-session idle.
Cascaded pipeline TTFA from the Voice AI GPU Infrastructure guide sits at 400-800ms for typical production loads. Moshi at 1-2 sessions on an H100 comfortably beats this. The advantage narrows as concurrency grows.
Sesame CSM-1B's 150ms TTFA is for the synthesis step only. Add LLM TTFT (150-300ms) and ASR time (30-80ms) and the total is comparable to a cascaded pipeline. CSM's advantage is expressiveness and prosody quality, not raw latency.
Pitfalls in Production S2S Deployment
Barge-In and Prosody Collapse
Interrupting Moshi mid-generation by sending new audio while the model is still outputting can leave the session in an inconsistent audio token state. The audio buffer may contain partial codebook sequences that the decoder cannot cleanly stop. Implement a graceful session reset on barge-in: discard the last partial output, flush the audio buffer, and restart the model's generation state. The 3-4 chunk rolling buffer handles this.
VRAM Fragmentation
KV cache and audio ring buffers grow unbounded in long sessions. A 10-minute voice session at 12.5 tokens/s generates 7,500 audio tokens. Set max_session_tokens=2000 (about 2.5 minutes of audio) and implement session rotation or summarization at the limit. Watch nvidia-smi dmon -s u under sustained load to detect fragmentation before it causes OOM errors.
Multilingual Gaps
Moshi is predominantly English. CSM-1B has limited multilingual coverage. If you need multilingual S2S (Chinese, Spanish, German), GLM-Voice is the better architecture choice. Do not deploy Moshi for multilingual production workloads expecting Whisper-level language coverage.
Prosody Degradation on Long Turns
Both Moshi and Hertz-dev show prosody drift in responses longer than 30 seconds. The model's attention window struggles to maintain consistent intonation across a very long generation. Split long responses into segments at natural sentence boundaries and re-initialize generation context between segments.
VAD False-Positive Barge-In
Aggressive VAD thresholds (silence duration under 200ms) in noisy environments cause false end-of-speech triggers mid-sentence. The model cuts off and resets, producing garbled responses from the user's perspective. Tune silence duration to 300-400ms for environments with background noise. For office or call center environments, 250ms is usually acceptable.
When to Stay on Cascaded Pipelines
S2S is not a replacement for every voice use case. Keep ASR+LLM+TTS when:
| Use case | Recommended approach | Reason |
|---|---|---|
| Multilingual (10+ languages) | Cascaded (Whisper + LLM + TTS) | Moshi/Hertz-dev do not match Whisper language coverage |
| 70B+ LLM required | Cascaded | No S2S model uses a 70B backbone |
| RAG or tool use in the loop | Cascaded | S2S models cannot call external tools mid-generation |
| Fine-tuned voice persona | Cascaded with XTTS v2 or CSM | Voice cloning needs TTS, not full S2S |
| Sub-150ms TTFA target | Cascaded (NeuTTS Air + fast LLM) | Optimized TTS on a small LLM beats S2S at low concurrency |
The Whisper production ASR guide covers the ASR layer for cascaded pipelines in detail. For the TTS layer, see the NeuTTS Air deployment guide and the open-source TTS GPU comparison.
Spheron GPU Cloud gives voice agent teams bare-metal H100 and L40S instances at a fraction of hyperscaler cost, with per-minute billing and no reserved commitments required for on-demand workloads.
