Engineering

Voice AI GPU Infrastructure: GPU Requirements for Sub-200ms Real-Time Inference

Back to BlogWritten by Mitrasish, Co-founderMar 11, 2026
GPU CloudVoice AIASRText-to-SpeechLLM InferenceReal-Time AIAI InfrastructurevLLM
Voice AI GPU Infrastructure: GPU Requirements for Sub-200ms Real-Time Inference

A voice AI customer service bot needs to respond in under 500ms total. From the moment the user stops speaking to the moment they hear the first word of the response. That 500ms is split across three GPU-dependent stages: transcribing the speech, generating the response, and synthesizing the audio. If any one stage is starved of compute, the entire experience breaks.

This isn't just a TTS problem or an LLM problem. It's a pipeline problem. Whisper handles speech recognition, an LLM generates the response, and a TTS model converts text back to audio. Each stage competes for GPU memory, each has its own latency budget, and each interacts with the others in ways that aren't obvious until you're running at production load. The RAG pipeline case study showed how co-locating AI components on the same bare-metal server dropped p99 latency from 1.8s to 190ms; the same principle applies to voice pipelines.

This guide breaks down exactly what each stage needs, which GPUs to use, and how to build a production voice AI stack that consistently hits sub-500ms end-to-end latency. The GPU recommendations draw from Spheron's experience running voice AI workloads, including NeuTTS Air, which is live on the platform today.

The Three-Stage Voice AI Pipeline

Every real-time voice AI system is a pipeline with three distinct GPU stages. Getting even one of them wrong breaks the experience.

Stage 1: ASR - Speech to Text

Automatic Speech Recognition converts the user's speech into text. The dominant open-source choice is Whisper Large v3, though teams also use Whisper Medium for lower latency or Deepgram's hosted API to skip self-hosting ASR entirely.

Whisper Large v3 model weights are approximately 3 GB on disk (FP16 checkpoint; 1,550M parameters at 2 bytes/param = ~2.9 GB, plus metadata). In VRAM, FasterWhisper uses approximately 3-4 GB at FP16 (raw weights are ~2.9 GB but inference adds activation buffers, KV cache, and framework overhead), or around 2-3 GB with INT8 quantization. For a 5-second audio clip (typical for a voice agent turn), it runs in under 50ms on high-end GPUs like an RTX 4090 or H100 using greedy decoding (beam_size=1); with beam search decoding (beam_size=5, better quality), expect 80-150ms. ASR is the easiest pipeline stage to satisfy; you don't need an H100 here. If you need lower latency, Whisper Large v3 Turbo (released October 2024) delivers comparable accuracy with 8x fewer decoder layers (4 vs. 32), cutting roughly 48% of total parameters (~809M vs. 1,550M) and meaningfully faster inference times.

  • VRAM needed: ~3-4 GB (Whisper Large v3, FP16 with FasterWhisper including inference overhead; ~2-3 GB with INT8 quantization; base weights ~2.9 GB at FP16)
  • Latency target: < 50ms for a 5-second audio clip with greedy decoding (beam_size=1); 80-150ms with beam search (beam_size=5)
  • GPU recommendation: RTX 4090 or RTX 5090 is sufficient; over-provisioning with H100s wastes budget unless you're running hundreds of concurrent streams

Stage 2: LLM - Response Generation

This is where the pipeline gets expensive. The LLM reads the transcribed text, understands context, and generates the response. It's typically the biggest latency bottleneck in any voice AI system.

The model size tradeoff is stark: 7B models are fast but less capable, 70B models are capable but often too slow for real-time voice without significant optimization. Most self-hosted voice agent deployments land on 7B-13B models to balance quality and latency. Platforms like Vapi and Retell default to frontier cloud models (GPT-4o, Claude) but support custom LLM endpoints for teams running their own open-source models. A well-tuned 8B model on a fast GPU can produce TTFT (time to first token) under 100ms.

Model generation note (as of 2026): The benchmarks in this section use Llama 3.x models (3.1 8B, 3.2 11B, 3.3 70B), which remain the most widely deployed open-source LLMs for voice AI due to mature tooling support across vLLM, llama.cpp, and Ollama. Note: Llama 3.2 11B (meta-llama/Llama-3.2-11B-Vision-Instruct) is a vision-language model that accepts both text and image inputs; it supports text-only inference when no images are provided, making it suitable for voice pipelines, though teams prioritizing minimal VRAM should be aware that the vision encoder adds overhead compared to a text-only 11B model. Llama 4 (Scout 17B-16E and Maverick 17B-128E), released by Meta on April 5, 2025, introduced a mixture-of-experts architecture with multimodal (text and image) capabilities; in MoE naming convention the "17B" refers to active parameters per forward pass, not total parameters (Scout has 109B total parameters across 16 experts; Maverick has 400B total across 128 experts). Llama 3.x series continues to be the dominant choice for production voice agents as of early 2026, particularly for text-only inference where the simpler dense architecture maps more predictably to latency budgets. If your deployment prioritizes multimodal or longer-context scenarios, evaluate Llama 4 Scout or Maverick against your latency requirements.

  • VRAM needed: 5-40GB depending on model size and quantization
  • Latency target: < 300ms TTFT for a typical conversational response
  • GPU recommendation: H100 PCIe for 13B-70B models; RTX 5090 handles 7B with good latency

Stage 3: TTS - Text to Speech

Text-to-speech converts the LLM's response into audio. The critical capability for voice AI is streaming synthesis: don't wait for the full LLM response before starting to speak. Begin generating audio as soon as you have the first sentence.

Popular open-source options in 2026:

  • Kokoro-82M: Extremely fast, minimal VRAM, multilingual (8 languages, 54 voices as of v1.0)
  • XTTS v2: Higher quality, voice cloning support, 17 languages, slower than Kokoro (note: Coqui AI shut down in January 2024; XTTS v2 receives no official updates but remains usable via community forks)
  • NeuTTS Air: 748M total parameters (~552M embedding + active, ~360M active-only) built on a Qwen2-based backbone, 320x real-time on RTX 4090, instant 3-second voice cloning (see the full NeuTTS Air deployment guide)
  • StyleTTS2: High naturalness scores, open-source, good for custom voice work
  • ElevenLabs (hosted): Zero infrastructure overhead, but adds network latency and per-character cost
[User speaks] → [ASR: Whisper] → [LLM: Llama 8B] → [TTS: Kokoro] → [User hears response]
    30-80ms latency     150-300ms TTFT      50-100ms to first chunk
  • VRAM needed: 0.3-6GB depending on model
  • Latency target: < 100ms to first audio chunk
  • GPU recommendation: RTX 4090 or RTX 5090; almost any modern GPU handles TTS if the LLM fits

End-to-End Latency Budget

Here's how a real 500ms budget breaks down across the pipeline:

StageTarget LatencyGPU Bottleneck?Optimization Lever
Network (inbound audio)10-30msNoCDN / edge deployment
ASR (Whisper Large v3)30-80msModerateGPU speed + model size
LLM (TTFT)150-300msYes (primary bottleneck)GPU speed + model size
TTS (first audio chunk)50-100msYesStreaming + sentence buffering
Network (outbound audio)10-30msNoCDN / edge deployment
Total~250-540ms

The LLM stage is the hardest to optimize without sacrificing response quality. GPU speed directly translates to lower TTFT: a faster GPU means fewer milliseconds before the first token exits. For real-world latency numbers from a production bare-metal deployment, the RAG pipeline case study documents exactly how infrastructure choices affect p99 latency at scale.

Note that these are targets, not guarantees. Real-world numbers depend on model size, context length, concurrent load, and hardware. Your latency will degrade under high concurrency as the GPU becomes compute-bound.

GPU Recommendations by Pipeline Stage

For ASR (Whisper Large v3)

GPULatency (5s audio clip)On-Demand (/hr)Spot (/hr)
RTX 4090~38ms$0.58N/A
RTX 5090~23ms$0.76N/A
H100 PCIe~25ms$2.01$0.80

The RTX 4090 is the right call for ASR-only deployments. The H100 costs 3.6x more but doesn't improve ASR latency enough to justify it at low concurrency. If you're running hundreds of simultaneous ASR streams, the H100's higher throughput starts to matter, but that's a throughput problem, not a per-request latency problem.

For LLM Response Generation

GPUModel SizeOutput Tokens/sec (single-request)TTFT (500-token context)
RTX 5090Llama 3.1 8B~110 tok/s~85ms
H100 PCIeLlama 3.1 8B~110-115 tok/s~70ms
H100 PCIeLlama 3.2 11B~95 tok/s~100ms
H100 PCIeLlama 3.3 70B (INT4)~30 tok/s~290ms

For sub-200ms TTFT on a voice agent with typical context lengths (500-1,000 tokens), stay with 7B-11B models. The 70B at ~290ms TTFT already exceeds what many 500ms total budgets can absorb when combined with ASR and TTS. Note: Llama 3.3 70B at FP16 requires ~140GB VRAM and does not fit on a single H100 PCIe 80GB; the benchmark above uses INT4 quantization (~38-43GB for Q4_K_M, depending on context length and framework overhead), which makes it feasible on one H100 but at reduced output quality. Output tok/s figures above are single-request (batch=1) numbers; in multi-request throughput mode these GPUs serve orders of magnitude more tokens per second. RTX 5090 figures are based on early 2025 hardware benchmarks; actual results may vary as driver and framework support matures. The RTX 5090's GDDR7 memory bandwidth (~1,792 GB/s) is roughly 10% below the H100 PCIe's HBM2e bandwidth (~2,000 GB/s), so the H100 PCIe typically has a modest edge on memory-bandwidth-bound decode workloads at equivalent model sizes.

The RTX 5090 handles 8B models well and costs significantly less than an H100 PCIe. For teams starting out before scaling, it's a cost-effective first GPU for voice AI. For 11B-class models (such as Llama 3.2 11B) and above, or for teams that need headroom for concurrent users, the H100 PCIe is the right foundation. For a full model-by-model breakdown of VRAM requirements, see the GPU requirements cheat sheet 2026.

For TTS (Kokoro, XTTS v2, NeuTTS Air)

GPUModelReal-Time Factor (RTF)On-Demand (/hr)Spot (/hr)
RTX 4090Kokoro-82M~0.04$0.58N/A
RTX 5090Kokoro-82M~0.025$0.76N/A
RTX 4090XTTS v2~0.30$0.58N/A
RTX 4090NeuTTS Air~0.003$0.58N/A

Prices as of 11 Mar 2026. GPU pricing fluctuates over time; check Spheron pricing for live rates. Spot instances are interruptible and subject to availability.

RTF (real-time factor) = compute time per second of generated audio. RTF of 0.04 means you generate 1 second of audio in 40ms (25x faster than real-time). Any RTF < 1.0 supports real-time streaming; RTF < 0.1 gives you meaningful latency headroom.

Kokoro and NeuTTS Air dominate for low-latency voice. XTTS v2's RTF of ~0.3 is real-time capable, but leaves less headroom and takes longer to emit the first audio chunk. NeuTTS Air's numbers are covered in detail in the NeuTTS Air deployment guide: per Neuphonic's benchmarks, 16,194 tokens/sec on an RTX 4090 (320x real-time) means a single GPU can handle hundreds of concurrent TTS streams.

VRAM Requirements for the Full Pipeline

When you co-locate all three stages on a single GPU (simpler operationally, and often better latency due to zero inter-component network overhead):

Pipeline ConfigTotal VRAM RequiredFits on...
Whisper Medium + Llama 3.1 8B (INT4) + Kokoro~10GB✅ RTX 4090 (24GB), RTX 5090 (32GB)
Whisper Large v3 + Llama 3.2 11B (FP16) + XTTS v2~33-36GB✅ H100 80GB (recommended); RTX 5090 is at or over its 32GB limit
Whisper Large v3 + Llama 3.3 70B (FP16) + NeuTTS Air~148GB✅ 2x H100 or H200

VRAM math for the 11B config: Whisper Large v3 (~5-8 GB at FP16, including inference overhead) + Llama 3.2 11B at FP16 (~22 GB) + XTTS v2 (~6 GB) = ~33-36 GB. The RTX 5090 tops out at 32GB, so this config exceeds its limit; an H100 PCIe 80GB is the right choice here. For the 70B config, you're looking at Llama 70B at FP16 (~140GB) + ASR + TTS (~5GB) = ~145GB, which requires two H100s with NVLink or an H200.

Single GPU is simpler operationally but may not hit latency targets if too much is packed in. Two-GPU setups that split the pipeline (ASR + TTS on one GPU, LLM on the other) often improve latency under concurrent load. For a deeper dive into VRAM planning across model sizes, see the GPU memory requirements guide.

Practical recommendation: start with a single H100 PCIe 80GB. It handles a full 11B-class pipeline (e.g., Llama 3.2 11B) with room for KV cache growth as conversation context accumulates. Add GPUs for scale, not bigger models.

Streaming Configuration for Sub-200ms TTS

The single biggest latency optimization in voice AI is not GPU selection; it's streaming architecture.

The problem with batch TTS: If you wait for the full LLM response before starting TTS synthesis, you add 150-300ms to every interaction before any audio can begin. For a 500ms total budget, that's most of your allocation spent waiting.

The solution: LLM-to-TTS token streaming. Pipe LLM tokens to TTS as they're generated, but don't send individual tokens; TTS models need complete sentences for natural prosody. Buffer incoming tokens, detect sentence boundaries (period, exclamation mark, question mark), and send complete sentences to TTS immediately.

python
import re

async def stream_llm_to_tts(client, messages, tts_queue):
    # Enforce the unbounded-queue invariant up front. The finally block uses
    # put_nowait to deliver the sentinel (avoiding cancellation-interrupted
    # awaits), which raises asyncio.QueueFull if the queue has a maxsize > 0
    # and is currently full. An uncaught QueueFull here means the sentinel is
    # never delivered and any consumer doing `await tts_queue.get()` blocks
    # indefinitely. Asserting maxsize == 0 surfaces the misuse immediately
    # rather than leaving the caller with a silent deadlock.
    assert tts_queue.maxsize == 0, (
        "tts_queue must be an unbounded asyncio.Queue() (maxsize=0). "
        "A bounded queue can silently drop the sentinel and hang the TTS consumer."
    )
    sentence_buffer = ""

    try:
        async with await client.chat.completions.create(
            model="meta-llama/Llama-3.1-8B-Instruct",
            messages=messages,
            stream=True,
        ) as response:
            async for chunk in response:
                if not chunk.choices:
                    continue
                token = chunk.choices[0].delta.content
                if not token:
                    continue

                sentence_buffer += token

                # Detect sentence boundaries in the accumulated buffer, not just the current token.
                # Checking only token.rstrip()[-1:] fails when punctuation appears mid-token
                # (e.g. ". He" or "world! The"). Instead, search the full buffer and split at
                # the boundary so multi-sentence tokens are handled correctly.
                #
                # The regex matches punctuation ([.!?]) followed by whitespace or a closing
                # quote ([\s"\']). Numeric ranges like "3.5 GB" are safe because the digit
                # after the decimal is not whitespace. This approach may incorrectly split on
                # abbreviations and honorifics (e.g. "Mr. Smith" or "Dr. Jones"). For
                # production use, prefer nltk.sent_tokenize or pysbd which handle abbreviations
                # correctly.
                match = re.search(r'[.!?][\s"\']', sentence_buffer)
                while match:
                    # Use match.end() so closing quote characters (e.g. `."` or `!'`) are
                    # included in the flushed sentence rather than left as a stray leading
                    # character at the start of the next buffer segment.
                    split_at = match.end()
                    completed = sentence_buffer[:split_at].strip()
                    if completed:
                        await tts_queue.put(completed)
                    sentence_buffer = sentence_buffer[split_at:].lstrip()
                    match = re.search(r'[.!?][\s"\']', sentence_buffer)

            # Flush any remaining text at end of response
            if sentence_buffer.strip():
                await tts_queue.put(sentence_buffer.strip())

    finally:
        # Signal completion so the TTS consumer can exit its receive loop.
        # Without this sentinel, any consumer doing `await tts_queue.get()` in a loop
        # will block indefinitely after the last sentence is processed.
        # Using finally guarantees the sentinel is delivered even if an exception
        # occurs during streaming (e.g. network drop, API error).
        #
        # Use put_nowait instead of `await put()` to protect against cancellation:
        # in Python 3.8+, asyncio.CancelledError is a BaseException subclass and
        # propagates through finally blocks. If the coroutine is cancelled while
        # `await tts_queue.put(None)` is pending, the CancelledError interrupts the
        # put, the sentinel is never delivered, and the TTS consumer blocks indefinitely.
        # put_nowait is safe here because the assertion above guarantees the
        # queue is unbounded (maxsize == 0), so it can never raise QueueFull.
        tts_queue.put_nowait(None)

The TTS consumer picks up complete sentences from the queue and synthesizes them as they arrive. Audio playback begins before the LLM finishes generating the full response, exactly the pattern that drops perceived latency under 200ms.

The sentence-boundary approach balances two competing needs: you want to start audio as early as possible (shorter buffer = lower latency), but you need enough context for the TTS model to generate natural prosody (too short = robotic, unnatural cadence). A single complete sentence is the right buffer unit for most TTS models.

Pipecat is one of the most mature open-source frameworks for orchestrating this pipeline (LiveKit Agents is another strong alternative). It handles streaming architecture, sentence buffering, audio transport, and WebRTC integration out of the box. If you're building a voice agent from scratch, evaluate Pipecat before rolling a custom pipeline coordination layer.

Multi-Tenant Capacity Planning

How many concurrent voice sessions can a single GPU handle?

GPUPipeline ConfigMax Concurrent SessionsNotes
RTX 5090Llama 3.1 8B (INT4) + Kokoro~6VRAM headroom for KV cache
H100 PCIeLlama 3.2 11B (FP16) + XTTS v2~8Balanced quality and scale
H100 PCIeLlama 3.1 8B (INT4) + Kokoro~14High throughput configuration
2x H100Llama 3.3 70B (FP16) + NeuTTS Air~10Enterprise scale, highest quality

These estimates assume typical voice conversation context lengths (500-2,000 tokens of accumulated history). Concurrent capacity depends heavily on:

  • KV cache growth: longer conversations consume more VRAM per active session
  • Turn length: longer user turns increase ASR and LLM compute time per turn
  • Traffic pattern: bursty load (all sessions simultaneously active) hits GPU compute limits faster than staggered arrival

Production deployments typically run at 50-70% of theoretical max to maintain latency SLAs under burst conditions. For architecture patterns to handle capacity spikes and failover, see the production GPU cloud architecture guide.

Case Study: NeuTTS Air on Spheron

NeuTTS Air is the clearest production proof point for the infrastructure described in this guide.

NeuTTS Air is a TTS model developed by Neuphonic, with 748M total parameters (~552M embedding + active, ~360M active-only) built on a Qwen2-based backbone, running on Spheron's RTX 4090 instances. According to Neuphonic's benchmarks, on this hardware it generates audio at 16,194 tokens per second (prefill throughput), 320x faster than real-time playback speed (the codec runs at 50 tokens/sec for real-time audio). Note: these benchmarks cover the Speech Language Model only and exclude the NeuCodec audio decoder; end-to-end throughput including the codec will be somewhat lower, though still well above real-time. A single RTX 4090 can serve hundreds of concurrent TTS streams simultaneously.

The deployment runs the NeuTTS Air language model backbone alongside the NeuCodec audio decoder on a single GPU. Total VRAM footprint is approximately 2-3GB depending on quantization settings, leaving over 20GB of RTX 4090 VRAM available for co-locating Whisper and a 7B LLM on the same instance. No model sharding, no separate inference server, no per-character API cost.

For voice agent teams specifically, NeuTTS Air's zero-shot voice cloning is production-ready: provide a 3-second reference audio clip at startup, and every subsequent synthesis request generates audio in that speaker's voice at full inference speed. Speaker similarity reaches 85-90% from a 3-second reference, and exceeds 95% from 15 seconds of clean audio.

The NeuTTS Air guide includes step-by-step Spheron deployment instructions, the full startup script, and a Gradio interface for testing. Use it as the TTS layer in the pipeline described in this post.

Deploying Your Voice AI Stack on Spheron

Recommended starting configuration

For a voice agent serving up to 8 concurrent users with production-grade response quality:

  • GPU: 1x H100 PCIe 80GB on Spheron
  • LLM: vLLM serving Llama 3.2 11B Instruct with streaming enabled
  • TTS: Kokoro-82M (fastest latency) or NeuTTS Air (if voice cloning is required)
  • ASR: Whisper Large v3 self-hosted, or Deepgram hosted API to skip the operational overhead
yaml
services:
  llm:
    image: vllm/vllm-openai:latest
    command: >
      --model meta-llama/Llama-3.2-11B-Vision-Instruct
      --enable-chunked-prefill
      --max-model-len 4096
      --gpu-memory-utilization 0.6
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              device_ids: ['0']
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s

  tts:
    image: your-kokoro-server:latest
    ports:
      - "8001:8001"
    environment:
      - PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              device_ids: ['0']

  asr:
    image: your-whisper-server:latest
    ports:
      - "8002:8002"
    environment:
      - PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              device_ids: ['0']

Enable --enable-chunked-prefill and set --max-model-len 4096 in your vLLM flags to cap context length and protect VRAM from accumulation across long conversations. For deeper production architecture patterns (health checks, failover, checkpoint recovery), see the production GPU cloud architecture guide.

Deploy on Spheron by selecting your GPU instance, configuring storage (50GB minimum for model weights), and launching. Bare-metal hardware means no virtualization overhead, no noisy-neighbor throttling, and no enterprise contracts.


NeuTTS Air and other production voice AI systems run on Spheron's bare-metal GPUs today. If you're building a voice agent, get an H100 or RTX 5090 running in minutes, with no contracts and no quotas.

Explore GPU options →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.