What GPU VRAM do I need to run Moshi for real-time voice?

Moshi requires approximately 16-20GB VRAM for FP16 inference, making it runnable on an RTX 4090 (24GB) for single-session workloads. For 4-8 concurrent sessions in production, an L40S (48GB) or H100 (80GB) is recommended.

How does speech-to-speech latency compare to cascaded ASR+LLM+TTS pipelines?

Unified S2S models like Moshi and Hertz-dev can reach 200-300ms time-to-first-audio by eliminating the text bottleneck between pipeline stages. Well-optimized cascaded pipelines typically land at 400-800ms because text must be fully decoded before TTS synthesis can begin.

Can Sesame CSM-1B handle production voice agent workloads?

Sesame CSM-1B (1B parameters) runs comfortably on a single RTX 4090 or L40S. It is optimized for expressive voice synthesis with contextual audio awareness but requires a separate LLM for response generation - it is not a true full-duplex model. Best suited for low-latency voice output, not end-to-end S2S.

What is the difference between Moshi and Sesame CSM architectures?

Moshi (Kyutai) processes audio input and generates audio output directly via a joint speech language model with no intermediate text representation, enabling true full-duplex conversation. Sesame CSM is a speech synthesis model that takes text and audio context as input and generates expressive speech output - it still requires text. Hertz-dev is a transformer-based real-time audio model trained on duplex audio data.

How much does a voice agent cost per concurrent session on Spheron?

On Spheron, an H100 SXM5 starts at roughly $2.90/hour per GPU on-demand. Running Moshi at 3-4 concurrent sessions per H100 gives a per-session cost of $0.73-0.97/hr on-demand, or approximately $0.20-0.27/hr at spot prices - substantially below AWS or GCP equivalents for always-on workloads.

Real-Time Speech-to-Speech AI on GPU Cloud: Deploy Moshi, Sesame CSM, and Hertz-dev for Sub-300ms Voice Agents (2026)

The 300ms wall is real. Voice agents that respond in under 300ms feel natural; at 500ms they feel delayed; above 700ms they feel broken. A cascaded ASR+LLM+TTS pipeline, even when optimized, stacks three serial latency contributors: speech recognition (30-80ms), LLM first token (150-300ms), and TTS first audio chunk (50-100ms). That is a minimum of 230ms before you account for audio transport, tokenization overhead, and sentence boundary detection. At production load, p95 typically lands between 400-800ms.

Unified speech-to-speech models take a different approach. Audio goes in, audio comes out, and there is no text representation anywhere in the path. One tokenization pass, one model, one output stream. For the GPU requirements of a cascaded voice pipeline and how each stage competes for VRAM, start there. For the ASR layer specifically, the Whisper production deployment guide covers faster-whisper, streaming chunking, and speaker diarization. This post covers the alternative: unified S2S models that skip the text bottleneck entirely.

What Unified Speech-to-Speech Models Are

A unified S2S model takes raw audio frames as input and generates raw audio frames as output. No transcript, no tokens, no text-to-speech synthesis step. The model processes acoustic features directly and generates audio through a learned audio codec.

A cascaded pipeline has two serialization points: the ASR-to-LLM boundary (speech to text) and the LLM-to-TTS boundary (text to audio). Each serialization point forces the preceding stage to complete before the next can begin. Text must be fully decoded before TTS synthesis starts. Both conversions add latency and introduce potential for quality degradation.

Unified S2S eliminates both boundaries. The model learns a joint representation over speech and language, so it can start generating audio output while still processing input, rather than waiting for a transcript. This also enables full-duplex operation: the model can listen and speak simultaneously, which is how human conversation actually works.

The distinction between full-duplex and half-duplex S2S matters in production. Full-duplex models (Moshi, Hertz-dev) can handle barge-in natively because they are always processing input. Half-duplex S2S models wait for the speaker to finish before generating output, which still eliminates the text bottleneck but does not support interruption without a VAD-triggered hard stop.

The Latency Budget: Why Sub-300ms Requires End-to-End Audio Tokenization

The cascaded pipeline latency breakdown:

Stage	Best case	Typical	Notes
ASR (faster-whisper, INT8)	20ms	40-80ms	Per 3-5s chunk; greedy decode
LLM time-to-first-token	80ms	150-300ms	7B model, H100
TTS first audio chunk	40ms	50-100ms	Kokoro/NeuTTS Air
Audio transport overhead	10ms	20-60ms	WebSocket, no WebRTC
Total (best case)	150ms	260-540ms	Single concurrent session

At production concurrency (8+ sessions on a single GPU), TTFT for the LLM climbs as KV cache competition increases. The p95 for a fully optimized cascaded pipeline on an H100 with 8 sessions sits around 500-700ms, not 300ms.

S2S models collapse the table differently. With Moshi on an H100:

Stage	Time
Audio tokenization (Mimi codec)	15ms
Joint speech LM forward pass	80-120ms
Audio de-tokenization	10ms
First audio chunk out	105-145ms

The depth transformer that handles acoustic token prediction runs in parallel with the primary text stream, not sequentially after it. This interleaved generation is why S2S time-to-first-audio (TTFA) can land under 200ms on well-provisioned hardware.

The 2026 S2S Model Landscape

Moshi (Kyutai)

Moshi is a 7B joint speech language model with a dual-stream architecture: an inner monologue stream that generates text (not sent to the user) and an audio stream that generates speech. The text stream functions as chain-of-thought reasoning, improving response coherence without adding a separate LLM inference call.

The audio codec is Mimi, operating at 12.5 audio tokens per second. A depth transformer generates the 7 residual codebook levels in parallel, not sequentially, so all 8 codebook streams (semantic token plus 7 acoustic) emit simultaneously. This is the architectural reason Moshi's TTFA is fast relative to cascade.

VRAM: 16-20GB FP16 (bfloat16 in practice)
License: CC BY 4.0
Full-duplex: Yes, natively handles simultaneous listen and speak
Language: English (primary); limited multilingual

Sesame CSM-1B (Sesame AI Labs)

Sesame CSM (Conversational Speech Model) is a 1B-parameter model built on a Llama-style backbone with an audio decoder. It takes text and optional audio context as input and generates expressive speech output. It is not a full S2S model in the strict sense: it still needs text from an LLM. Think of it as TTS with S2S-like expressiveness, trained on conversational audio to match prosody and emotion to context rather than just text instructions.

CSM-1B's strength is context-aware voice synthesis. You can pass several turns of audio history and it conditions its output prosody on the conversation style, not just the text. This produces notably more natural-sounding voice responses than conventional TTS models in conversational settings.

VRAM: 6-8GB (1B parameters at FP16)
License: Apache 2.0
Full-duplex: No, requires a separate LLM
Language: English; limited multilingual

Hertz-dev (Standard Intelligence)

Hertz-dev is a transformer-based real-time audio model (~8.5B parameters) trained on duplex conversational audio data. It processes raw audio input and generates audio output directly, making it a true full-duplex S2S model. Standard Intelligence released Hertz-dev in November 2024 and benchmarks across diverse hardware and concurrency configurations are still limited as of April 2026. The architecture is similar in intent to Moshi but trained on a different corpus with a different tokenization approach.

Treat Hertz-dev as research-grade for production: the model architecture is sound and inference demos work, but systematic benchmarks across concurrency levels and hardware configurations are not yet widely published. VRAM estimates (8-12GB) are derived from parameter count, not confirmed production benchmarks.

VRAM: 8-12GB (estimate)
License: Apache 2.0
Full-duplex: Yes
Language: English

Other Models Worth Knowing

GLM-Voice (Tsinghua): 9B parameters, text-guided voice generation, strong multilingual support including code-switching between Chinese and English. VRAM: ~18GB FP16. Better choice than Moshi for multilingual S2S use cases.

Mini-Omni-2: Lightweight omni model (~1.5B), low VRAM (~4GB), trades voice quality for footprint. Usable on an RTX 4090 for dev and testing; not recommended for production voice agents due to quality limitations.

Model Comparison Table

Model	Params	VRAM FP16	VRAM INT8	Full-Duplex	License
Moshi	7B	16-20GB	8-10GB	Yes	CC BY 4.0
Sesame CSM-1B	1B	6-8GB	3-4GB	No (needs LLM)	Apache 2.0
Hertz-dev	~8.5B	8-12GB (est.)	5-7GB (est.)	Yes	Apache 2.0
GLM-Voice	9B	~18GB	~9GB	Partial	Apache 2.0
Mini-Omni-2	~1.5B	~4GB	~2GB	Partial	Apache 2.0

GPU VRAM Requirements and Concurrency Math

The concurrency formula is:

max_sessions = floor((total_vram_gb - 2) / per_session_vram_gb)

The 2GB overhead covers CUDA context, OS, and framework buffers. Audio ring buffers for active sessions grow with session duration; monitor VRAM during load testing, not just during model load.

GPU	Total VRAM	Moshi sessions	CSM-1B sessions	Hertz-dev sessions
RTX 4090	24GB	1	2-3	1
L40S	48GB	2-3	5-7	3-4
H100 SXM5	80GB	3-4	9-11	5-6
H200 SXM5	141GB	7-8	17-19	11-12

The RTX 4090 handles a single Moshi session with no headroom for failover. Production deployments should start at L40S on Spheron for cost-effective concurrency or Spheron H100 SXM5 instances for higher throughput workloads. For general VRAM sizing methodology across model sizes, see the GPU memory requirements for LLM inference guide.

VRAM fragmentation is a real concern for long-session S2S workloads. Moshi's Mimi codec maintains an audio ring buffer per active session. Each buffer grows up to max_session_tokens * bytes_per_token. Set a hard max_session_tokens limit (2,000 tokens is a reasonable starting point, roughly 160 seconds at 12.5 tok/s) and reset session state on natural conversation turns. Use nvidia-smi dmon -s u to monitor per-process VRAM growth during load testing.

Streaming Inference Architecture

Audio Chunk Sizing and RTC Alignment

WebRTC uses 20ms packet intervals. For S2S models operating on audio frames, 80ms chunks (1,280 samples at 16kHz) align well with WebRTC frame boundaries (4 packets), keep per-chunk inference overhead low, and match Moshi's Mimi codec internal frame size. Going below 40ms increases overhead without meaningful TTFA improvement. Going above 160ms increases perceived input lag.

VAD Integration

Attach a VAD (Voice Activity Detection) model to detect end-of-speech. Silero-VAD and webrtcvad are both reliable choices. Key settings for voice agents:

End-of-speech silence threshold: 200-250ms for fast-paced conversation, 350-400ms for customer service where longer pauses are normal mid-turn
Pre-speech buffer: retain the 300-400ms before VAD triggers to capture the start of utterances that ramp slowly
Barge-in buffer: keep the last 3-4 audio chunks in a rolling buffer so the model can be interrupted without losing the last partial output

For full-duplex models (Moshi, Hertz-dev), VAD end-of-speech detection is less critical because the model handles overlap natively. For CSM-1B, which requires a separate LLM, the VAD turn boundary is the handoff point to the text pipeline.

Token Interleaving (Moshi-Specific)

Moshi's depth transformer generates 8 codebook streams in parallel: one semantic token stream (the inner monologue text tokens) and 7 acoustic residual codebook streams. The acoustic streams feed the Mimi decoder in real-time. This interleaved generation means audio output starts before the full semantic sequence is complete, lowering TTFA relative to generating the text sequence first and then synthesizing.

In practical terms: configure Moshi with streaming=True and process output tokens as they arrive rather than accumulating the full response.

WebSocket vs WebRTC Transport

WebSocket (via FastAPI or aiohttp) is simpler to deploy and sufficient for sub-200ms transport on co-located or low-latency cloud connections. WebRTC adds complexity but enables p2p paths and DTLS encryption for end-user facing deployments.

For initial deployments, use WebSocket. WebRTC is warranted when transport latency itself becomes a bottleneck, typically when client-to-server distance exceeds 50ms RTT.

Minimal FastAPI WebSocket handler for Moshi streaming:

python

from fastapi import FastAPI, WebSocket
import torch
from moshi.models import loaders

app = FastAPI()
device = "cuda"
moshi_weight = "./moshi-weights"

@app.on_event("startup")
async def load_model():
    app.state.moshi = loaders.get_moshi_lm(moshi_weight, device=device)
    app.state.moshi.eval()

@app.websocket("/ws/voice")
async def voice_session(ws: WebSocket):
    await ws.accept()
    model = app.state.moshi  # shared model weights — never call model.step() directly
    chunk_size = 1280  # 80ms at 16kHz
    buffer = b""

    # model.streaming() creates an isolated session with its own KV cache and
    # Mimi audio ring buffer per connection. Without this, concurrent clients
    # calling step() on the same shared object interleave writes into shared
    # buffers, corrupting audio output for all sessions.
    with model.streaming(batch_size=1) as session:
        with torch.no_grad():
            async for incoming in ws.iter_bytes():
                buffer += incoming
                while len(buffer) >= chunk_size * 2:  # int16 = 2 bytes per sample
                    pcm_bytes = buffer[: chunk_size * 2]
                    buffer = buffer[chunk_size * 2 :]
                    audio_tensor = torch.frombuffer(pcm_bytes, dtype=torch.int16).float()
                    audio_tensor = audio_tensor / 32768.0  # normalize to [-1, 1]

                    output_audio = session.step(audio_tensor.unsqueeze(0).to(device))

                    if output_audio is not None:
                        pcm_out = (output_audio.squeeze(0).cpu().clamp(-1.0, 1.0) * 32767).short()
                        await ws.send_bytes(pcm_out.numpy().tobytes())

Production Deployment on Spheron GPU Cloud

Provision Your Instance

Log in to app.spheron.ai, select your GPU tier (L40S or H100 SXM5 for production S2S), choose on-demand or spot, and deploy. SSH setup and key management are covered in the Spheron documentation. Allocate at least 100GB SSD for model weights and audio buffers.

Install Moshi

bash

pip install moshi torch torchaudio --index-url https://download.pytorch.org/whl/cu121
huggingface-cli download kyutai/moshiko-pytorch-bf16 --local-dir ./moshi-weights

Moshi loads in bfloat16 by default. The full model checkpoint is approximately 14GB on disk. Load time on an H100 SXM5 with NVMe storage is roughly 25-35 seconds.

For production, extend the minimal handler above with:

Session ID tracking (one moshi.step() state per client)
Silero-VAD pre-processing to gate audio chunks to the model
Health check endpoint for load balancer probing
Prometheus metrics for TTFA p50/p95 and VRAM utilization

Deploy Sesame CSM-1B

Sesame CSM requires a separate LLM for response generation. The CSM model handles only the audio synthesis step.

bash

git clone https://github.com/SesameAILabs/csm
cd csm && pip install -r requirements.txt

Single-turn inference:

python

from huggingface_hub import hf_hub_download
import torchaudio
from generator import load_csm_1b, Segment

model = load_csm_1b(device="cuda")

# Generate speech from LLM response text
audio = model.generate(
    text="Your LLM response text here.",
    speaker=0,
    context=[],  # pass prior audio segments for context-aware prosody
    max_audio_length_ms=10000,
)

torchaudio.save("response.wav", audio.unsqueeze(0).cpu(), model.sample_rate)

For a production LLM serving setup to feed text to CSM, see the LLM deployment guide.

Deploy Hertz-dev

Hertz-dev is not distributed as a HuggingFace transformers checkpoint. Checkpoints are downloaded automatically by the official inference scripts from https://ckpt.si.inc/hertz-dev/. Clone the official repository and run inference from there:

bash

git clone https://github.com/Standard-Intelligence/hertz-dev
cd hertz-dev
pip install -r requirements.txt
# Checkpoints download automatically to ./ckpt on first run
python demo.py

The demo.py script handles checkpoint fetching and provides a working real-time audio inference loop. For server deployments, adapt the audio I/O in demo.py to read from a WebSocket instead of a local microphone.

Published inference benchmarks for Hertz-dev are limited as of April 2026. Run your own TTFA benchmarks at your target concurrency before committing to a GPU tier.

Spheron Pricing and Cost-Per-Session Analysis

Live pricing as of 28 Apr 2026, per GPU per hour:

GPU	VRAM	On-demand $/hr	Spot $/hr	Moshi sessions	OD cost/session	Spot cost/session
RTX 4090	24GB	$0.79	N/A	1	$0.79	N/A
L40S	48GB	$0.72	$0.32	2-3	$0.24-0.36	$0.11-0.16
A100 80G SXM4	80GB	$1.64	$0.45	3	$0.55	$0.15
H100 SXM5	80GB	$2.90	$0.80	3-4	$0.73-0.97	$0.20-0.27
H200 SXM5	141GB	$3.96	$1.19	7	$0.57	$0.17
RTX PRO 6000	96GB	$1.70	$0.59	4	$0.43	$0.15

The L40S offers the best on-demand per-session economics for Moshi: 3 concurrent sessions at $0.24/session/hr on-demand, or $0.11/session/hr at spot. The A100 80G SXM4 and RTX PRO 6000 are both strong spot options at $0.15/session/hr for interruption-tolerant workloads.

On-demand vs spot for voice agents: Use on-demand for always-on agents where an interruption mid-call is unacceptable. Use spot for batch voice processing (automated calls, voice-overs, synthetic data generation) where jobs can checkpoint and restart.

Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Benchmarks: TTFA, Interruptions, and Full-Duplex Behavior

Model	TTFA p50 (ms)	TTFA p95 (ms)	Full-duplex	Interruption recovery
Moshi (H100 SXM5, 1 session)	~200ms	~250ms	Yes	~50ms (state reset)
Sesame CSM-1B + LLM (H100)	~150ms (TTS only)	~200ms	No	N/A
Hertz-dev (~8.5B, H100)	~300ms (est.)	~400ms (est.)	Yes	Limited public data
Cascaded ASR+LLM+TTS (H100)	~300ms	~500-700ms	No	200-400ms

Moshi's p95 degrades at higher concurrency. At 3-4 sessions on an H100 SXM5, expect p95 TTFA to rise to 350-500ms as KV cache competition increases. At 7 sessions on an H200, the pattern is similar to Moshi at 4 sessions on H100. Always benchmark at your target concurrency, not at single-session idle.

Cascaded pipeline TTFA from the Voice AI GPU Infrastructure guide sits at 400-800ms for typical production loads. Moshi at 1-2 sessions on an H100 comfortably beats this. The advantage narrows as concurrency grows.

Sesame CSM-1B's 150ms TTFA is for the synthesis step only. Add LLM TTFT (150-300ms) and ASR time (30-80ms) and the total is comparable to a cascaded pipeline. CSM's advantage is expressiveness and prosody quality, not raw latency.

Pitfalls in Production S2S Deployment

Barge-In and Prosody Collapse

Interrupting Moshi mid-generation by sending new audio while the model is still outputting can leave the session in an inconsistent audio token state. The audio buffer may contain partial codebook sequences that the decoder cannot cleanly stop. Implement a graceful session reset on barge-in: discard the last partial output, flush the audio buffer, and restart the model's generation state. The 3-4 chunk rolling buffer handles this.

VRAM Fragmentation

KV cache and audio ring buffers grow unbounded in long sessions. A 10-minute voice session at 12.5 tokens/s generates 7,500 audio tokens. Set max_session_tokens=2000 (about 2.5 minutes of audio) and implement session rotation or summarization at the limit. Watch nvidia-smi dmon -s u under sustained load to detect fragmentation before it causes OOM errors.

Multilingual Gaps

Moshi is predominantly English. CSM-1B has limited multilingual coverage. If you need multilingual S2S (Chinese, Spanish, German), GLM-Voice is the better architecture choice. Do not deploy Moshi for multilingual production workloads expecting Whisper-level language coverage.

Prosody Degradation on Long Turns

Both Moshi and Hertz-dev show prosody drift in responses longer than 30 seconds. The model's attention window struggles to maintain consistent intonation across a very long generation. Split long responses into segments at natural sentence boundaries and re-initialize generation context between segments.

VAD False-Positive Barge-In

Aggressive VAD thresholds (silence duration under 200ms) in noisy environments cause false end-of-speech triggers mid-sentence. The model cuts off and resets, producing garbled responses from the user's perspective. Tune silence duration to 300-400ms for environments with background noise. For office or call center environments, 250ms is usually acceptable.

When to Stay on Cascaded Pipelines

S2S is not a replacement for every voice use case. Keep ASR+LLM+TTS when:

Use case	Recommended approach	Reason
Multilingual (10+ languages)	Cascaded (Whisper + LLM + TTS)	Moshi/Hertz-dev do not match Whisper language coverage
70B+ LLM required	Cascaded	No S2S model uses a 70B backbone
RAG or tool use in the loop	Cascaded	S2S models cannot call external tools mid-generation
Fine-tuned voice persona	Cascaded with XTTS v2 or CSM	Voice cloning needs TTS, not full S2S
Sub-150ms TTFA target	Cascaded (NeuTTS Air + fast LLM)	Optimized TTS on a small LLM beats S2S at low concurrency

The Whisper production ASR guide covers the ASR layer for cascaded pipelines in detail. For the TTS layer, see the NeuTTS Air deployment guide and the open-source TTS GPU comparison.

Spheron GPU Cloud gives voice agent teams bare-metal H100 and L40S instances at a fraction of hyperscaler cost, with per-minute billing and no reserved commitments required for on-demand workloads.
Rent L40S for S2S → | Rent H100 → | View all GPU pricing →

What Unified Speech-to-Speech Models Are

The Latency Budget: Why Sub-300ms Requires End-to-End Audio Tokenization

The 2026 S2S Model Landscape

Moshi (Kyutai)

Sesame CSM-1B (Sesame AI Labs)

Hertz-dev (Standard Intelligence)

Other Models Worth Knowing

Model Comparison Table

GPU VRAM Requirements and Concurrency Math

Streaming Inference Architecture

Audio Chunk Sizing and RTC Alignment

VAD Integration

Token Interleaving (Moshi-Specific)

WebSocket vs WebRTC Transport

Production Deployment on Spheron GPU Cloud

Provision Your Instance

Install Moshi

Deploy Sesame CSM-1B

Deploy Hertz-dev

Spheron Pricing and Cost-Per-Session Analysis

Benchmarks: TTFA, Interruptions, and Full-Duplex Behavior

Pitfalls in Production S2S Deployment

Barge-In and Prosody Collapse

VRAM Fragmentation

Multilingual Gaps

Prosody Degradation on Long Turns

VAD False-Positive Barge-In

When to Stay on Cascaded Pipelines

Build what's next.