Most deployment guides for Whisper still target CPU or delegate to the OpenAI API. On CPU, a 3-second audio chunk takes 300-600ms to transcribe depending on model size. On an RTX 4090 on-demand instance running faster-whisper at INT8, the same chunk takes under 15ms. That delta is the difference between a voice agent that feels natural and one that clearly lags. This guide covers model selection, GPU sizing, streaming chunking, speaker diarization with pyannote, and batch transcription cost math. For the full pipeline context, see the voice AI GPU infrastructure guide and the TTS deployment guide for the synthesis layer.
ASR Model Comparison: Whisper v4, Large v3, Canary, and Parakeet
| Model | Parameters | VRAM (FP16) | Languages | WER (LibriSpeech) | License | Best For |
|---|---|---|---|---|---|---|
| Whisper Large v3 | 1,550M | ~3GB | 99 | 2.7% (test-clean) | Apache 2.0 | Multilingual, accuracy-critical production |
| Whisper Large v3 Turbo | 809M | ~1.6GB | 99 | 3.0% (test-clean) | Apache 2.0 | Latency-sensitive streaming, voice agents |
| NVIDIA Canary-1B | 1,000M | ~2GB | 4 | 2.89% (test-other) | CC BY-NC 4.0 | Low WER on NVIDIA hardware, research |
| NVIDIA Parakeet-TDT-1.1B | 1,100M | ~2.2GB | 1 (English) | 1.39% (test-clean) | CC-BY-4.0 | High-throughput English-only batch jobs |
A note on "Whisper v4": The post title uses "Whisper v4" because that's the phrase people search for when looking for the latest Whisper-compatible production guide. As of April 2026, OpenAI has not released an official v4 checkpoint. The two current stable releases are Large v3 (October 2023) and Large v3 Turbo (October 2024). This guide covers both as the production-stable options. When a new major checkpoint ships, the faster-whisper deployment pattern here applies directly.
For most production deployments, pick one of two:
- Whisper Large v3 Turbo for voice agents where first-word latency matters. At 809M parameters and 4 decoder layers (versus 32 in Large v3), inference is meaningfully faster per chunk with only a minor accuracy drop on clean English audio.
- Whisper Large v3 for meeting transcription, call center archives, or workloads where accuracy and language coverage matter more than raw speed. 99-language support and stronger performance on accented speech make it the safer default for production multilingual systems.
Canary-1B shows excellent WER on benchmarks but carries a CC BY-NC 4.0 license and supports only four languages (English, German, Spanish, French). Parakeet is CC-BY-4.0 (commercial use allowed with attribution) and excellent for English-only batch at scale, but you give up everything outside English and the NVIDIA hardware dependency makes it less portable.
GPU Sizing for ASR Workloads
| GPU | VRAM | On-demand | Spot | Streams (Large v3 INT8) | Streams (Turbo INT8) | Best Workload |
|---|---|---|---|---|---|---|
| RTX 4090 PCIe | 24GB | $0.79/hr | N/A | 30+ | 50+ | Cost-efficient batch, low-concurrency streaming |
| L40S PCIe | 48GB | $0.72/hr | $0.32/hr | 60+ | 100+ | Production streaming ASR, mixed workloads |
| H100 PCIe | 80GB | $2.01/hr | N/A | 150+ | 250+ | High-concurrency call centers (1,000+ sessions) |
| A100 PCIe 80GB | 80GB | $1.07/hr | N/A | 120+ | 200+ | Batch at scale, multi-model serving |
Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
For streaming ASR (voice agents, real-time call transcription): The L40S on Spheron is the sweet spot. At $0.72/hr on-demand with spot at $0.32/hr, it handles 60+ concurrent Large v3 INT8 streams across its 48GB VRAM. The A100 instance gives similar concurrent capacity and works well when co-locating Whisper with an LLM and TTS on the same machine.
For batch transcription (meeting recordings, podcast archives, call center logs): L40S spot at $0.32/hr is the cheapest path. No spot availability exists for RTX 4090 currently, but at $0.79/hr on-demand the economics still hold at batch scale. Scale horizontally with multiple instances to hit wall-clock time targets.
For call center scale (1,000+ concurrent live sessions): You need H100 GPU rental or multi-GPU A100 setups. A single H100 PCIe handles 150+ concurrent Large v3 INT8 streams at $2.01/hr, which works out to about $0.013/hr per concurrent session.
Deploying faster-whisper with CTranslate2
faster-whisper reimplements Whisper using CTranslate2, a runtime optimized for transformer inference on CPU and GPU. The practical result: 4x faster inference than the original Whisper implementation at identical accuracy, with INT8 quantization support that cuts VRAM usage by 30-40%. For a dedicated guide covering the CTranslate2 backend in depth, INT8 quantization modes, model size selection, and a FastAPI streaming server pattern, see the faster-whisper production deployment guide.
Step 1: Install
pip install faster-whisperStep 2: Load and run with INT8
faster-whisper downloads and converts the model from HuggingFace on first load:
from faster_whisper import WhisperModel
model = WhisperModel(
"large-v3",
device="cuda",
compute_type="int8",
download_root="./models"
)
segments, info = model.transcribe(
"audio.wav",
beam_size=5,
language="en"
)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")For faster startup in production, pre-download the pre-converted CTranslate2 weights:
huggingface-cli download Systran/faster-whisper-large-v3 --local-dir ./models/whisper-large-v3-ct2Then load with model = WhisperModel("./models/whisper-large-v3-ct2", ...) to skip the conversion step.
Step 3: Enable VAD
The vad_filter=True option runs Silero VAD before Whisper. Segments without detected speech are skipped entirely, which eliminates hallucinations on audio with silence gaps and speeds up batch throughput by skipping empty sections:
segments, info = model.transcribe(
"audio.wav",
vad_filter=True,
vad_parameters=dict(
min_silence_duration_ms=500,
speech_pad_ms=400
),
beam_size=5
)Step 4: Serve as an API
The faster-whisper-server Docker image wraps faster-whisper in an OpenAI-compatible REST API:
docker run --gpus all \
-p 8000:8000 \
-e WHISPER__MODEL=large-v3 \
-e WHISPER__DEVICE=cuda \
-e WHISPER__COMPUTE_TYPE=int8 \
fedirz/faster-whisper-server:latest-cudaThis exposes /v1/audio/transcriptions matching the OpenAI Whisper API schema. Drop-in replacement for applications already using the OpenAI SDK for transcription.
WhisperX: Word-Level Alignment and Speaker Diarization
WhisperX adds two capabilities on top of faster-whisper: word-level timestamp alignment via wav2vec2 forced alignment, and speaker diarization via pyannote-audio 3.x. If you need to know which word started at which millisecond (subtitle generation, indexing, search), or which speaker said what in a multi-person recording, WhisperX handles both.
Install:
pip install whisperxpyannote-audio HuggingFace token requirement (the most common setup failure):
pyannote-audio 3.x requires accepting usage terms for two gated models on HuggingFace. Before running diarization, navigate to both of these pages and click "Accept":
https://huggingface.co/pyannote/speaker-diarization-3.1https://huggingface.co/pyannote/segmentation-3.0
Then set your token in the environment:
export HF_TOKEN=your_token_hereSkipping this step results in a cryptic 401 Unauthorized error when loading pyannote models, not a clear message about terms acceptance. It accounts for a large share of WhisperX diarization failures on fresh setups.
The three-stage pipeline:
import whisperx
import os
device = "cuda"
HF_TOKEN = os.environ["HF_TOKEN"]
audio_file = "meeting.wav"
# Stage 1: Transcribe with faster-whisper backend
model = whisperx.load_model("large-v3", device, compute_type="float16")
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=16)
# Stage 2: Word-level alignment via wav2vec2
align_model, metadata = whisperx.load_align_model(
language_code=result["language"],
device=device
)
result = whisperx.align(
result["segments"],
align_model,
metadata,
audio,
device,
return_char_alignments=False
)
# Stage 3: Speaker diarization via pyannote
diarize_model = whisperx.DiarizationPipeline(
use_auth_token=HF_TOKEN,
device=device
)
diarize_segments = diarize_model(audio, min_speakers=2, max_speakers=10)
result = whisperx.assign_word_speakers(diarize_segments, result)
# Output has per-word timestamps and speaker labels
for segment in result["segments"]:
speaker = segment.get("speaker", "UNKNOWN")
print(f"[{speaker}] {segment['text']}")VRAM budget: Whisper Large v3 at float16 uses roughly 3GB. The wav2vec2 alignment model adds about 0.5GB. pyannote speaker-diarization-3.1 needs another 1.5GB. Total is 5-6GB across all three stages. On a 24GB RTX 4090, that leaves ~18GB for KV cache and batch overhead. If you hit memory pressure, reduce batch_size from 16 to 8 first. For very large audio files, run diarization in a separate process after transcription completes so the two models' peak VRAM usage stays sequential rather than simultaneous.
Real-Time Streaming Transcription
Whisper is not a native streaming model. It processes fixed-length audio windows through an encoder-decoder architecture: the encoder ingests the full chunk, the decoder generates tokens one by one. You cannot feed an open WebSocket stream directly to Whisper and get continuous output. The autoregressive decoder is the bottleneck here. For a deeper look at why this pattern is memory-bandwidth-bound, see the AI memory wall and inference latency guide.
The production approach is chunked streaming with overlap.
Chunking strategy:
Buffer incoming audio into 4-second windows with 0.5-second overlaps on both leading and trailing edges. Run faster-whisper on each chunk with beam_size=1 for minimum latency. Strip the overlap regions from each hypothesis before emitting output to avoid duplicating words at chunk boundaries:
from faster_whisper import WhisperModel
import numpy as np
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8")
SAMPLE_RATE = 16000
CHUNK_SAMPLES = 4 * SAMPLE_RATE # 4 seconds
OVERLAP_SAMPLES = int(0.5 * SAMPLE_RATE) # 0.5 seconds
def transcribe_chunk(audio_chunk: np.ndarray) -> str:
segments, _ = model.transcribe(
audio_chunk,
beam_size=1,
condition_on_previous_text=False,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=600)
)
overlap_sec = OVERLAP_SAMPLES / SAMPLE_RATE
chunk_sec = CHUNK_SAMPLES / SAMPLE_RATE
return " ".join(
seg.text.strip() for seg in segments
if seg.start >= overlap_sec and seg.start < chunk_sec - overlap_sec
)condition_on_previous_text=False is the most important flag for streaming. The default is True, which feeds the previous chunk's transcript as context for the next. In a streaming pipeline this causes hallucination drift: Whisper starts extending the previous transcript even when the speaker said something entirely different. Always disable it for chunked streaming.
End-of-utterance detection with Silero VAD: When VAD detects silence exceeding 600ms, finalize the current hypothesis and reset the buffer. This gives clean utterance boundaries without a fixed timer and works better than silence thresholds computed from raw audio amplitude.
Whisper Large v3 Turbo for latency: Turbo's 4-layer decoder replaces Large v3's 32 layers. For streaming, this cuts per-chunk inference from roughly 40ms to 22ms on an RTX 4090 at INT8. The accuracy difference on clean English audio is small (3.0% vs 2.7% WER), and since you're already disabling condition_on_previous_text, the longer decoder provides no inter-chunk benefit anyway.
For English-only pipelines requiring the absolute lowest latency, NVIDIA Parakeet is worth evaluating. It runs on a non-autoregressive architecture that processes the full audio in one forward pass, which eliminates the autoregressive decoding bottleneck entirely.
Batch Transcription Economics: 100 Hours Under $4
The math is straightforward. faster-whisper on a modern GPU processes audio at approximately 25-30x real time at INT8:
- 100 hours of audio = 360,000 seconds
- At 25x real time, compute time = 14,400 seconds = 4 hours
- L40S spot at $0.32/hr: 4 hours × $0.32 = $1.28 total
- RTX 4090 on-demand at $0.79/hr: 4 hours × $0.79 = $3.16 total
Compare that to the OpenAI Whisper API:
- OpenAI Whisper API: $0.006/minute
- 100 hours = 6,000 minutes
- API cost: 6,000 × $0.006 = $36.00
| Volume | L40S spot self-host | RTX 4090 on-demand | OpenAI Whisper API |
|---|---|---|---|
| 100 hours/month | ~$1.28 | ~$3.16 | $36 |
| 1,000 hours/month | ~$13 | ~$32 | $360 |
| 10,000 hours/month | ~$128 | ~$316 | $3,600 |
Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
The API price stays flat per minute regardless of volume. Self-hosted costs scale almost linearly with compute time but the GPU rate doesn't change at higher volumes, so the gap widens significantly past 1,000 hours/month.
Running parallel workers:
import concurrent.futures
import threading
from pathlib import Path
from faster_whisper import WhisperModel
_thread_local = threading.local()
def _get_model() -> WhisperModel:
if not hasattr(_thread_local, "model"):
_thread_local.model = WhisperModel("large-v3", device="cuda", compute_type="int8")
return _thread_local.model
def transcribe_file(audio_path: Path) -> str:
# One model per thread, created on first use and reused across all files that thread handles.
model = _get_model()
segments, _ = model.transcribe(str(audio_path), vad_filter=True)
transcript = " ".join(seg.text.strip() for seg in segments)
out_path = audio_path.with_suffix(".txt")
out_path.write_text(transcript)
return str(audio_path)
audio_files = list(Path("/audio").glob("*.wav"))
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
for result in executor.map(transcribe_file, audio_files):
print(f"Done: {result}")For very large archives, use spot GPU instances across multiple machines and split the file list. Four L40S spot instances in parallel finish 100 hours in about an hour at $1.28 combined.
Integrating ASR with a Voice Agent Stack
faster-whisper handles the ASR layer. The full real-time pipeline looks like this:
Audio In (microphone / phone)
↓
LiveKit (WebRTC transport + audio capture)
↓
faster-whisper / WhisperX (ASR: 15-50ms per chunk)
↓
Llama 3.1 8B / Qwen 2.5 7B (LLM response: 150-300ms TTFT)
↓
Kokoro-82M / Fish Speech (TTS synthesis: streaming, first chunk 50-100ms)
↓
LiveKit (audio delivery)
↓
Audio Out (speaker)Latency targets: ASR should account for 30-80ms of the 500ms total budget. The LLM is typically the bottleneck at 150-300ms TTFT. TTS first-chunk latency with Kokoro runs 50-100ms. Total end-to-end stays under 500ms if you stream LLM tokens into TTS as complete sentences arrive.
For the full stack:
- Voice AI GPU infrastructure guide covers latency budgets, VRAM allocation across stages, and GPU recommendations for ASR + LLM + TTS co-location.
- TTS deployment guide covers Kokoro, Fish Speech, and Hume TADA deployment in detail.
- NeuTTS Air guide covers the 320x real-time TTS option for ultra-low-latency synthesis with 3-second voice cloning.
- For the complete WebRTC integration, including how to connect Whisper ASR output to an LLM over a WebRTC data channel and handle barge-in, see the WebRTC LLM streaming voice agent guide.
- If your target latency is under 300ms and you can tolerate English-only or limited multilingual support, unified speech-to-speech models like Moshi skip the ASR stage entirely, collapsing three pipeline stages into one forward pass.
NeuTTS Air's 748M-parameter model uses under 2GB VRAM, which means you can co-locate it alongside faster-whisper (3-4GB at INT8) and a 7B LLM (5-6GB at INT4) on a single 24GB RTX 4090 for the complete pipeline.
On-Demand vs Spot: When to Use Each for ASR
| Workload | Recommended Instance | Estimated Monthly Cost |
|---|---|---|
| Always-on voice agent, 50 concurrent streams | L40S PCIe on-demand | ~$518/month |
| Batch call center archive, 1,000 hrs/month | L40S spot | ~$13 |
| Meeting transcription SaaS, 1,000 hrs/month | RTX 4090 on-demand | ~$32 |
| High-concurrency real-time, 200 concurrent streams | H100 PCIe on-demand | ~$1,447/month |
Pricing fluctuates based on GPU availability. The prices above are based on 25 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Use on-demand when: You're running always-on voice agents or real-time transcription services with SLAs. On-demand instances don't get preempted, which matters when active sessions break if the machine goes away mid-call.
Use spot when: You're processing batch jobs: meeting recordings, call center archives, podcast transcription, any workload that can checkpoint and restart. L40S at $0.32/hr spot is less than half the on-demand rate and is the right call for any job that doesn't require guaranteed uptime.
Troubleshooting: Hallucinations, Long-Audio Drift, and Low-Resource Languages
Hallucinations on silent audio: Whisper's default behavior is to generate something even on segments with no speech. Set no_speech_threshold=0.6 and vad_filter=True together:
segments, info = model.transcribe(
audio,
no_speech_threshold=0.6,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500)
)Segments where Whisper's internal no-speech confidence exceeds 0.6 get discarded before reaching your output. Combined with VAD pre-filtering, this handles nearly all hallucination cases on real-world audio.
Long-audio drift with condition_on_previous_text: The default condition_on_previous_text=True feeds each segment's transcript back as a prefix for the next. On long audio with consistent content this usually helps. On batch jobs with variable audio quality, it can cause Whisper to repeat the same phrase in a loop. Set condition_on_previous_text=False for batch jobs over 10 minutes.
Repetition loops: Set compression_ratio_threshold=2.4 and log_prob_threshold=-1.0:
segments, info = model.transcribe(
audio,
compression_ratio_threshold=2.4,
log_prob_threshold=-1.0,
condition_on_previous_text=False
)The compression ratio threshold catches segments where output text is highly repetitive (high ratio means repeated tokens). The log probability threshold discards segments where Whisper has low overall confidence, which correlates strongly with hallucination artifacts.
Low-resource languages: Use Whisper Large v3, not Turbo. Turbo's reduced decoder layer count (4 vs 32) hurts most on low-resource language pairs where the model needs more computation to produce accurate output. For multilingual workloads spanning more than 10 languages, stay on Large v3.
CUDA OOM during WhisperX batch jobs: On a 24GB RTX 4090, running WhisperX with batch_size=16 alongside pyannote in the same process can exhaust VRAM. Drop batch_size to 8 first. If the problem persists, drop to 4 or run diarization in a separate process after transcription completes. The two models' peak VRAM usage becomes sequential instead of simultaneous, which removes the memory spike entirely.
Whisper and faster-whisper give you production-grade ASR at a fraction of API costs. Whether you're building a voice agent, transcribing meeting recordings, or running a call center pipeline, the GPU cloud math consistently favors self-hosting at any volume above a few hundred hours per month.
Quick Setup Guide
Whisper Large v3 is the safest production choice as of April 2026 - 99 language support, Apache 2.0 license, 1,550M parameters, ~3GB VRAM at FP16. Whisper Large v3 Turbo cuts decoder layers from 32 to 4 for 48% faster inference at minor accuracy cost - right choice for latency-sensitive streaming. NVIDIA Canary-1B (1B params, 4 languages, CC BY-NC 4.0) and Parakeet (English-only, CC-BY-4.0) are faster on NVIDIA hardware but lack Whisper's language breadth. Use Whisper Large v3 Turbo for voice agents, Large v3 for meeting transcription and accuracy-critical workloads, Parakeet for high-throughput English-only batch jobs.
Go to app.spheron.ai. For streaming ASR (single or low-concurrency streams), an RTX 4090 PCIe ($0.79/hr on-demand) or L40S ($0.72/hr on-demand, $0.32/hr spot) is sufficient. For batch transcription, use L40S spot instances and run as many parallel workers as your job size warrants. For 60+ concurrent live streams, an L40S on-demand gives headroom without overpaying for H100 VRAM you won't fill. Select Ubuntu 22.04, 50GB+ storage, SSH in.
pip install faster-whisper silero-vad. Optionally convert a Whisper checkpoint to CTranslate2 format for faster startup: download from Systran/faster-whisper-large-v3 on HuggingFace. Load with: from faster_whisper import WhisperModel; model = WhisperModel('large-v3', device='cuda', compute_type='int8'). Enable VAD with: segments, info = model.transcribe('audio.wav', vad_filter=True, vad_parameters=dict(min_silence_duration_ms=500)). VAD cuts hallucinations on silent audio sections and speeds up batch throughput.
pip install whisperx. Set HF_TOKEN=your_token (required for pyannote - accept terms at pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0 on HuggingFace first). Run: import whisperx; model = whisperx.load_model('large-v3', 'cuda', compute_type='float16'); result = model.transcribe(audio, batch_size=16); align_model, metadata = whisperx.load_align_model(language_code=result['language'], device='cuda'); result = whisperx.align(result['segments'], align_model, metadata, audio, 'cuda'); diarize_model = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device='cuda'); diarize_segments = diarize_model(audio); result = whisperx.assign_word_speakers(diarize_segments, result). Output includes per-word timestamps and speaker labels.
Buffer incoming audio into 4-second chunks with 0.5-second overlaps on both sides. Feed each chunk through faster-whisper with beam_size=1 and condition_on_previous_text=False to avoid hypothesis drift across chunks. Merge hypotheses by stripping the first and last 0.5 seconds of each chunk's output (the overlap region) and concatenating the stable middle portions. For end-of-utterance detection, use the Silero VAD output: when VAD detects silence exceeding 600ms, finalize the current hypothesis and reset the buffer.
For large batch jobs, use Spheron L40S spot instances at $0.32/hr or RTX 4090 on-demand at $0.79/hr. Launch with: docker run --gpus all -v /audio:/audio fedirz/faster-whisper-server:latest-cuda. Queue audio files with a simple Python worker that pulls from a directory or S3 bucket and writes transcripts to JSON. faster-whisper processes at 25-30x real time on a modern GPU, so 100 hours of audio finishes in under 4 hours. At L40S spot pricing, total cost is approximately $1.28 (4 hours × $0.32/hr). Use multiple spot instances in parallel to cut wall-clock time proportionally.
Frequently Asked Questions
An RTX 4090 (24GB) handles 30+ concurrent streams with faster-whisper at INT8. An L40S (48GB) doubles that to 60+ concurrent streams and adds headroom for batch jobs. H100 (80GB) is the right choice for transcription at scale - 1,000+ concurrent sessions or large batch workloads. For a voice agent pipeline running a single stream, any GPU with 6+ GB VRAM is sufficient.
faster-whisper is a reimplementation of OpenAI Whisper using CTranslate2, which runs 4x faster than the original at the same accuracy. WhisperX builds on faster-whisper and adds word-level timestamp alignment (via wav2vec2 forced alignment) and speaker diarization (via pyannote-audio). Use faster-whisper for raw transcription speed; use WhisperX when you need word-level timing or speaker labels.
At spot prices on an L40S, a job transcribing 100 hours of audio costs around $1.28 total. faster-whisper on an L40S processes audio at roughly 25-30x real time, so 100 hours of audio finishes in under 4 hours. At an L40S spot price of $0.32/hr, the full job costs approximately $1.28 (4 hours × $0.32/hr). On-demand pricing on an RTX 4090 brings this to around $3.16. Either way, this compares to $36 for 100 hours using OpenAI's Whisper API at $0.006/minute.
Yes, with chunking. Whisper is not natively a streaming model - it runs on fixed-length audio windows. For real-time streaming, split the incoming audio into 3-5 second chunks with 0.5-1 second overlaps, run inference on each chunk, and merge hypotheses using the overlap to handle cross-boundary words. For lower latency, Whisper Large v3 Turbo (4 decoder layers vs 32) cuts per-chunk latency nearly in half compared to Large v3.
Use WhisperX with pyannote-audio 3.x. Install both packages, accept pyannote's terms on HuggingFace and set your HF_TOKEN environment variable. Run: result = model.transcribe(audio), aligned = whisperx.align(result['segments'], alignment_model, metadata, audio, device), diarized = whisperx.assign_word_speakers(diarize_model(audio), aligned). This adds a 'speaker' field to each word segment. pyannote requires an NVIDIA GPU with CUDA - it will not run efficiently on CPU.
