What GPU do I need for production Whisper ASR?

An RTX 4090 (24GB) handles 30+ concurrent streams with faster-whisper at INT8. An L40S (48GB) doubles that to 60+ concurrent streams and adds headroom for batch jobs. H100 (80GB) is the right choice for transcription at scale - 1,000+ concurrent sessions or large batch workloads. For a voice agent pipeline running a single stream, any GPU with 6+ GB VRAM is sufficient.

What is the difference between faster-whisper and WhisperX?

faster-whisper is a reimplementation of OpenAI Whisper using CTranslate2, which runs 4x faster than the original at the same accuracy. WhisperX builds on faster-whisper and adds word-level timestamp alignment (via wav2vec2 forced alignment) and speaker diarization (via pyannote-audio). Use faster-whisper for raw transcription speed; use WhisperX when you need word-level timing or speaker labels.

How much does it cost to transcribe 100 hours of audio on Spheron?

At spot prices on an L40S, a job transcribing 100 hours of audio costs around $1.28 total. faster-whisper on an L40S processes audio at roughly 25-30x real time, so 100 hours of audio finishes in under 4 hours. At an L40S spot price of $0.32/hr, the full job costs approximately $1.28 (4 hours × $0.32/hr). On-demand pricing on an RTX 4090 brings this to around $3.16. Either way, this compares to $36 for 100 hours using OpenAI's Whisper API at $0.006/minute.

Can I do real-time streaming transcription with Whisper?

Yes, with chunking. Whisper is not natively a streaming model - it runs on fixed-length audio windows. For real-time streaming, split the incoming audio into 3-5 second chunks with 0.5-1 second overlaps, run inference on each chunk, and merge hypotheses using the overlap to handle cross-boundary words. For lower latency, Whisper Large v3 Turbo (4 decoder layers vs 32) cuts per-chunk latency nearly in half compared to Large v3.

How do I add speaker diarization to my ASR output?

Use WhisperX with pyannote-audio 3.x. Install both packages, accept pyannote's terms on HuggingFace and set your HF_TOKEN environment variable. Run: result = model.transcribe(audio), aligned = whisperx.align(result['segments'], alignment_model, metadata, audio, device), diarized = whisperx.assign_word_speakers(diarize_model(audio), aligned). This adds a 'speaker' field to each word segment. pyannote requires an NVIDIA GPU with CUDA - it will not run efficiently on CPU.

Deploy Whisper v4 and Production ASR on GPU Cloud: Self-Host Speech Recognition for Voice Agents, Meetings, and Call Centers (2026 Guide)

Most deployment guides for Whisper still target CPU or delegate to the OpenAI API. On CPU, a 3-second audio chunk takes 300-600ms to transcribe depending on model size. On an RTX 4090 on-demand instance running faster-whisper at INT8, the same chunk takes under 15ms. That delta is the difference between a voice agent that feels natural and one that clearly lags. This guide covers model selection, GPU sizing, streaming chunking, speaker diarization with pyannote, and batch transcription cost math. For the full pipeline context, see the voice AI GPU infrastructure guide and the TTS deployment guide for the synthesis layer.

ASR Model Comparison: Whisper v4, Large v3, Canary, and Parakeet

Model	Parameters	VRAM (FP16)	Languages	WER (LibriSpeech)	License	Best For
Whisper Large v3	1,550M	~3GB	99	2.7% (test-clean)	Apache 2.0	Multilingual, accuracy-critical production
Whisper Large v3 Turbo	809M	~1.6GB	99	3.0% (test-clean)	Apache 2.0	Latency-sensitive streaming, voice agents
NVIDIA Canary-1B	1,000M	~2GB	4	2.89% (test-other)	CC BY-NC 4.0	Low WER on NVIDIA hardware, research
NVIDIA Parakeet-TDT-1.1B	1,100M	~2.2GB	1 (English)	1.39% (test-clean)	CC-BY-4.0	High-throughput English-only batch jobs

A note on "Whisper v4": The post title uses "Whisper v4" because that's the phrase people search for when looking for the latest Whisper-compatible production guide. As of April 2026, OpenAI has not released an official v4 checkpoint. The two current stable releases are Large v3 (October 2023) and Large v3 Turbo (October 2024). This guide covers both as the production-stable options. When a new major checkpoint ships, the faster-whisper deployment pattern here applies directly.

For most production deployments, pick one of two:

Whisper Large v3 Turbo for voice agents where first-word latency matters. At 809M parameters and 4 decoder layers (versus 32 in Large v3), inference is meaningfully faster per chunk with only a minor accuracy drop on clean English audio.
Whisper Large v3 for meeting transcription, call center archives, or workloads where accuracy and language coverage matter more than raw speed. 99-language support and stronger performance on accented speech make it the safer default for production multilingual systems.

Canary-1B shows excellent WER on benchmarks but carries a CC BY-NC 4.0 license and supports only four languages (English, German, Spanish, French). Parakeet is CC-BY-4.0 (commercial use allowed with attribution) and excellent for English-only batch at scale, but you give up everything outside English and the NVIDIA hardware dependency makes it less portable.

GPU Sizing for ASR Workloads

GPU	VRAM	On-demand	Spot	Streams (Large v3 INT8)	Streams (Turbo INT8)	Best Workload
RTX 4090 PCIe	24GB	$0.79/hr	N/A	30+	50+	Cost-efficient batch, low-concurrency streaming
L40S PCIe	48GB	$0.72/hr	$0.32/hr	60+	100+	Production streaming ASR, mixed workloads
H100 PCIe	80GB	$2.01/hr	N/A	150+	250+	High-concurrency call centers (1,000+ sessions)
A100 PCIe 80GB	80GB	$1.07/hr	N/A	120+	200+	Batch at scale, multi-model serving

Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For streaming ASR (voice agents, real-time call transcription): The L40S on Spheron is the sweet spot. At $0.72/hr on-demand with spot at $0.32/hr, it handles 60+ concurrent Large v3 INT8 streams across its 48GB VRAM. The A100 instance gives similar concurrent capacity and works well when co-locating Whisper with an LLM and TTS on the same machine.

For batch transcription (meeting recordings, podcast archives, call center logs): L40S spot at $0.32/hr is the cheapest path. No spot availability exists for RTX 4090 currently, but at $0.79/hr on-demand the economics still hold at batch scale. Scale horizontally with multiple instances to hit wall-clock time targets.

For call center scale (1,000+ concurrent live sessions): You need H100 GPU rental or multi-GPU A100 setups. A single H100 PCIe handles 150+ concurrent Large v3 INT8 streams at $2.01/hr, which works out to about $0.013/hr per concurrent session.

Deploying faster-whisper with CTranslate2

faster-whisper reimplements Whisper using CTranslate2, a runtime optimized for transformer inference on CPU and GPU. The practical result: 4x faster inference than the original Whisper implementation at identical accuracy, with INT8 quantization support that cuts VRAM usage by 30-40%.

Step 1: Install

bash

pip install faster-whisper

Step 2: Load and run with INT8

faster-whisper downloads and converts the model from HuggingFace on first load:

python

from faster_whisper import WhisperModel

model = WhisperModel(
    "large-v3",
    device="cuda",
    compute_type="int8",
    download_root="./models"
)

segments, info = model.transcribe(
    "audio.wav",
    beam_size=5,
    language="en"
)

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

For faster startup in production, pre-download the pre-converted CTranslate2 weights:

bash

huggingface-cli download Systran/faster-whisper-large-v3 --local-dir ./models/whisper-large-v3-ct2

Then load with model = WhisperModel("./models/whisper-large-v3-ct2", ...) to skip the conversion step.

Step 3: Enable VAD

The vad_filter=True option runs Silero VAD before Whisper. Segments without detected speech are skipped entirely, which eliminates hallucinations on audio with silence gaps and speeds up batch throughput by skipping empty sections:

python

segments, info = model.transcribe(
    "audio.wav",
    vad_filter=True,
    vad_parameters=dict(
        min_silence_duration_ms=500,
        speech_pad_ms=400
    ),
    beam_size=5
)

Step 4: Serve as an API

The faster-whisper-server Docker image wraps faster-whisper in an OpenAI-compatible REST API:

bash

docker run --gpus all \
  -p 8000:8000 \
  -e WHISPER__MODEL=large-v3 \
  -e WHISPER__DEVICE=cuda \
  -e WHISPER__COMPUTE_TYPE=int8 \
  fedirz/faster-whisper-server:latest-cuda

This exposes /v1/audio/transcriptions matching the OpenAI Whisper API schema. Drop-in replacement for applications already using the OpenAI SDK for transcription.

WhisperX: Word-Level Alignment and Speaker Diarization

WhisperX adds two capabilities on top of faster-whisper: word-level timestamp alignment via wav2vec2 forced alignment, and speaker diarization via pyannote-audio 3.x. If you need to know which word started at which millisecond (subtitle generation, indexing, search), or which speaker said what in a multi-person recording, WhisperX handles both.

Install:

bash

pip install whisperx

pyannote-audio HuggingFace token requirement (the most common setup failure):

pyannote-audio 3.x requires accepting usage terms for two gated models on HuggingFace. Before running diarization, navigate to both of these pages and click "Accept":

https://huggingface.co/pyannote/speaker-diarization-3.1
https://huggingface.co/pyannote/segmentation-3.0

Then set your token in the environment:

bash

export HF_TOKEN=your_token_here

Skipping this step results in a cryptic 401 Unauthorized error when loading pyannote models, not a clear message about terms acceptance. It accounts for a large share of WhisperX diarization failures on fresh setups.

The three-stage pipeline:

python

import whisperx
import os

device = "cuda"
HF_TOKEN = os.environ["HF_TOKEN"]
audio_file = "meeting.wav"

# Stage 1: Transcribe with faster-whisper backend
model = whisperx.load_model("large-v3", device, compute_type="float16")
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=16)

# Stage 2: Word-level alignment via wav2vec2
align_model, metadata = whisperx.load_align_model(
    language_code=result["language"],
    device=device
)
result = whisperx.align(
    result["segments"],
    align_model,
    metadata,
    audio,
    device,
    return_char_alignments=False
)

# Stage 3: Speaker diarization via pyannote
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token=HF_TOKEN,
    device=device
)
diarize_segments = diarize_model(audio, min_speakers=2, max_speakers=10)
result = whisperx.assign_word_speakers(diarize_segments, result)

# Output has per-word timestamps and speaker labels
for segment in result["segments"]:
    speaker = segment.get("speaker", "UNKNOWN")
    print(f"[{speaker}] {segment['text']}")

VRAM budget: Whisper Large v3 at float16 uses roughly 3GB. The wav2vec2 alignment model adds about 0.5GB. pyannote speaker-diarization-3.1 needs another 1.5GB. Total is 5-6GB across all three stages. On a 24GB RTX 4090, that leaves ~18GB for KV cache and batch overhead. If you hit memory pressure, reduce batch_size from 16 to 8 first. For very large audio files, run diarization in a separate process after transcription completes so the two models' peak VRAM usage stays sequential rather than simultaneous.

Real-Time Streaming Transcription

Whisper is not a native streaming model. It processes fixed-length audio windows through an encoder-decoder architecture: the encoder ingests the full chunk, the decoder generates tokens one by one. You cannot feed an open WebSocket stream directly to Whisper and get continuous output. The autoregressive decoder is the bottleneck here. For a deeper look at why this pattern is memory-bandwidth-bound, see the AI memory wall and inference latency guide.

The production approach is chunked streaming with overlap.

Chunking strategy:

Buffer incoming audio into 4-second windows with 0.5-second overlaps on both leading and trailing edges. Run faster-whisper on each chunk with beam_size=1 for minimum latency. Strip the overlap regions from each hypothesis before emitting output to avoid duplicating words at chunk boundaries:

python

from faster_whisper import WhisperModel
import numpy as np

model = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8")

SAMPLE_RATE = 16000
CHUNK_SAMPLES = 4 * SAMPLE_RATE    # 4 seconds
OVERLAP_SAMPLES = int(0.5 * SAMPLE_RATE)  # 0.5 seconds

def transcribe_chunk(audio_chunk: np.ndarray) -> str:
    segments, _ = model.transcribe(
        audio_chunk,
        beam_size=1,
        condition_on_previous_text=False,
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=600)
    )
    overlap_sec = OVERLAP_SAMPLES / SAMPLE_RATE
    chunk_sec = CHUNK_SAMPLES / SAMPLE_RATE
    return " ".join(
        seg.text.strip() for seg in segments
        if seg.start >= overlap_sec and seg.start < chunk_sec - overlap_sec
    )

condition_on_previous_text=False is the most important flag for streaming. The default is True, which feeds the previous chunk's transcript as context for the next. In a streaming pipeline this causes hallucination drift: Whisper starts extending the previous transcript even when the speaker said something entirely different. Always disable it for chunked streaming.

End-of-utterance detection with Silero VAD: When VAD detects silence exceeding 600ms, finalize the current hypothesis and reset the buffer. This gives clean utterance boundaries without a fixed timer and works better than silence thresholds computed from raw audio amplitude.

Whisper Large v3 Turbo for latency: Turbo's 4-layer decoder replaces Large v3's 32 layers. For streaming, this cuts per-chunk inference from roughly 40ms to 22ms on an RTX 4090 at INT8. The accuracy difference on clean English audio is small (3.0% vs 2.7% WER), and since you're already disabling condition_on_previous_text, the longer decoder provides no inter-chunk benefit anyway.

For English-only pipelines requiring the absolute lowest latency, NVIDIA Parakeet is worth evaluating. It runs on a non-autoregressive architecture that processes the full audio in one forward pass, which eliminates the autoregressive decoding bottleneck entirely.

Batch Transcription Economics: 100 Hours Under $4

The math is straightforward. faster-whisper on a modern GPU processes audio at approximately 25-30x real time at INT8:

100 hours of audio = 360,000 seconds
At 25x real time, compute time = 14,400 seconds = 4 hours
L40S spot at $0.32/hr: 4 hours × $0.32 = $1.28 total
RTX 4090 on-demand at $0.79/hr: 4 hours × $0.79 = $3.16 total

Compare that to the OpenAI Whisper API:

OpenAI Whisper API: $0.006/minute
100 hours = 6,000 minutes
API cost: 6,000 × $0.006 = $36.00

Volume	L40S spot self-host	RTX 4090 on-demand	OpenAI Whisper API
100 hours/month	~$1.28	~$3.16	$36
1,000 hours/month	~$13	~$32	$360
10,000 hours/month	~$128	~$316	$3,600

Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

The API price stays flat per minute regardless of volume. Self-hosted costs scale almost linearly with compute time but the GPU rate doesn't change at higher volumes, so the gap widens significantly past 1,000 hours/month.

Running parallel workers:

python

import concurrent.futures
import threading
from pathlib import Path
from faster_whisper import WhisperModel

_thread_local = threading.local()

def _get_model() -> WhisperModel:
    if not hasattr(_thread_local, "model"):
        _thread_local.model = WhisperModel("large-v3", device="cuda", compute_type="int8")
    return _thread_local.model

def transcribe_file(audio_path: Path) -> str:
    # One model per thread, created on first use and reused across all files that thread handles.
    model = _get_model()
    segments, _ = model.transcribe(str(audio_path), vad_filter=True)
    transcript = " ".join(seg.text.strip() for seg in segments)
    out_path = audio_path.with_suffix(".txt")
    out_path.write_text(transcript)
    return str(audio_path)

audio_files = list(Path("/audio").glob("*.wav"))

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    for result in executor.map(transcribe_file, audio_files):
        print(f"Done: {result}")

For very large archives, use spot GPU instances across multiple machines and split the file list. Four L40S spot instances in parallel finish 100 hours in about an hour at $1.28 combined.

Integrating ASR with a Voice Agent Stack

faster-whisper handles the ASR layer. The full real-time pipeline looks like this:

Audio In (microphone / phone)
    ↓
LiveKit (WebRTC transport + audio capture)
    ↓
faster-whisper / WhisperX (ASR: 15-50ms per chunk)
    ↓
Llama 3.1 8B / Qwen 2.5 7B (LLM response: 150-300ms TTFT)
    ↓
Kokoro-82M / Fish Speech (TTS synthesis: streaming, first chunk 50-100ms)
    ↓
LiveKit (audio delivery)
    ↓
Audio Out (speaker)

Latency targets: ASR should account for 30-80ms of the 500ms total budget. The LLM is typically the bottleneck at 150-300ms TTFT. TTS first-chunk latency with Kokoro runs 50-100ms. Total end-to-end stays under 500ms if you stream LLM tokens into TTS as complete sentences arrive.

For the full stack:

Voice AI GPU infrastructure guide covers latency budgets, VRAM allocation across stages, and GPU recommendations for ASR + LLM + TTS co-location.
TTS deployment guide covers Kokoro, Fish Speech, and Hume TADA deployment in detail.
NeuTTS Air guide covers the 320x real-time TTS option for ultra-low-latency synthesis with 3-second voice cloning.

NeuTTS Air's 748M-parameter model uses under 2GB VRAM, which means you can co-locate it alongside faster-whisper (3-4GB at INT8) and a 7B LLM (5-6GB at INT4) on a single 24GB RTX 4090 for the complete pipeline.

On-Demand vs Spot: When to Use Each for ASR

Workload	Recommended Instance	Estimated Monthly Cost
Always-on voice agent, 50 concurrent streams	L40S PCIe on-demand	~$518/month
Batch call center archive, 1,000 hrs/month	L40S spot	~$13
Meeting transcription SaaS, 1,000 hrs/month	RTX 4090 on-demand	~$32
High-concurrency real-time, 200 concurrent streams	H100 PCIe on-demand	~$1,447/month

Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Use on-demand when: You're running always-on voice agents or real-time transcription services with SLAs. On-demand instances don't get preempted, which matters when active sessions break if the machine goes away mid-call.

Use spot when: You're processing batch jobs: meeting recordings, call center archives, podcast transcription, any workload that can checkpoint and restart. L40S at $0.32/hr spot is less than half the on-demand rate and is the right call for any job that doesn't require guaranteed uptime.

Troubleshooting: Hallucinations, Long-Audio Drift, and Low-Resource Languages

Hallucinations on silent audio: Whisper's default behavior is to generate something even on segments with no speech. Set no_speech_threshold=0.6 and vad_filter=True together:

python

segments, info = model.transcribe(
    audio,
    no_speech_threshold=0.6,
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500)
)

Segments where Whisper's internal no-speech confidence exceeds 0.6 get discarded before reaching your output. Combined with VAD pre-filtering, this handles nearly all hallucination cases on real-world audio.

Long-audio drift with condition_on_previous_text: The default condition_on_previous_text=True feeds each segment's transcript back as a prefix for the next. On long audio with consistent content this usually helps. On batch jobs with variable audio quality, it can cause Whisper to repeat the same phrase in a loop. Set condition_on_previous_text=False for batch jobs over 10 minutes.

Repetition loops: Set compression_ratio_threshold=2.4 and log_prob_threshold=-1.0:

python

segments, info = model.transcribe(
    audio,
    compression_ratio_threshold=2.4,
    log_prob_threshold=-1.0,
    condition_on_previous_text=False
)

The compression ratio threshold catches segments where output text is highly repetitive (high ratio means repeated tokens). The log probability threshold discards segments where Whisper has low overall confidence, which correlates strongly with hallucination artifacts.

Low-resource languages: Use Whisper Large v3, not Turbo. Turbo's reduced decoder layer count (4 vs 32) hurts most on low-resource language pairs where the model needs more computation to produce accurate output. For multilingual workloads spanning more than 10 languages, stay on Large v3.

CUDA OOM during WhisperX batch jobs: On a 24GB RTX 4090, running WhisperX with batch_size=16 alongside pyannote in the same process can exhaust VRAM. Drop batch_size to 8 first. If the problem persists, drop to 4 or run diarization in a separate process after transcription completes. The two models' peak VRAM usage becomes sequential instead of simultaneous, which removes the memory spike entirely.

Whisper and faster-whisper give you production-grade ASR at a fraction of API costs. Whether you're building a voice agent, transcribing meeting recordings, or running a call center pipeline, the GPU cloud math consistently favors self-hosting at any volume above a few hundred hours per month.
Rent RTX 4090 → | Rent L40S → | View all GPU pricing →

ASR Model Comparison: Whisper v4, Large v3, Canary, and Parakeet

GPU Sizing for ASR Workloads

Deploying faster-whisper with CTranslate2

WhisperX: Word-Level Alignment and Speaker Diarization

Real-Time Streaming Transcription

Batch Transcription Economics: 100 Hours Under $4

Integrating ASR with a Voice Agent Stack

On-Demand vs Spot: When to Use Each for ASR

Troubleshooting: Hallucinations, Long-Audio Drift, and Low-Resource Languages

Build what's next.