Tutorial

Self-Host Faster-Whisper on GPU Cloud: Production Deployment Guide for Real-Time ASR (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 21, 2026
Faster WhisperFaster Whisper GPUFaster Whisper DeploymentSelf-Host Faster WhisperReal-Time ASRCTranslate2 InferenceGPU Cloud ASRFaster Whisper Production
Self-Host Faster-Whisper on GPU Cloud: Production Deployment Guide for Real-Time ASR (2026)

The reference Whisper implementation is slow by design. It runs PyTorch ops in FP32, allocates a full attention matrix per decode step, and has no batched inference path. faster-whisper replaces the backend with CTranslate2, a C++ inference engine, and gets 4x lower latency and roughly half the VRAM on the same model size. For production ASR, that difference matters. This guide covers model selection, GPU sizing on Spheron, a FastAPI reference server, real-time WebSocket streaming with VAD, and the cost-per-minute math versus hosted APIs. For the upstream voice pipeline context, see the speech-to-speech deployment guide for how the ASR layer fits into unified S2S architectures.

What Is faster-whisper: CTranslate2 Backend and Quantization

faster-whisper is a Python library that wraps CTranslate2, a C++ inference engine for transformer models. The original OpenAI Whisper runs through PyTorch with FP32 ops. CTranslate2 replaces that with a custom kernel path optimized for Intel MKL and NVIDIA cuBLAS, adds INT8 and FP16 quantization, and implements a fused attention operator that cuts memory allocation overhead per decode step.

The practical differences for production deployment:

MetricWhisper Referencefaster-whisper (INT8)faster-whisper (FP16)
Inference speed (large-v3)1x baseline4-5x faster2-3x faster
VRAM (large-v3)~6GB (FP32)~3GB~3GB
WER delta vs. reference0% (baseline)< 0.2% (negligible)< 0.1%
PyTorch required at runtimeYesNoNo
Batched inferenceNoYesYes
Beam search optimizationStandardFused opsFused ops

INT8 quantization halves VRAM with under 0.2% WER regression on clean audio. The quantization applies to the encoder and decoder weights, not the audio features themselves, so the accuracy impact is minimal on studio-quality or phone-quality audio. On noisy audio or heavy accents, INT8 and FP16 are essentially indistinguishable.

CTranslate2 does not require PyTorch at inference time. The model weights are stored in a custom binary format (.bin with a config.json). This matters for Docker image size: a PyTorch-free CTranslate2 image is 2-3GB smaller than a full PyTorch GPU image.

One other advantage: CTranslate2 has a native batched inference path. The reference Whisper processes one audio file at a time. faster-whisper can process a list of files in a single batch call, which improves GPU utilization on large transcription jobs.

Model Size Selection: VRAM and Latency Tradeoffs

ModelParametersVRAM (INT8)VRAM (FP16)RTF on L40SWER (LibriSpeech test-clean)Best For
tiny39M~75MB~150MB150x5.7%Edge/CPU, prototyping
base74M~145MB~290MB100x4.2%Low-latency English, simple domains
small244M~500MB~950MB60x3.0%Budget GPU, background tasks
medium769M~1.5GB~3GB35x2.6%Balanced accuracy/speed
large-v21,550M~3GB~6GB15x2.7%High accuracy, legacy checkpoints
large-v31,550M~3GB~6GB15x2.7%Accuracy-critical production
large-v3 Turbo809M~1.6GB~3.2GB30x3.0%Voice agents, real-time streaming
distil-large-v3756M~1.5GB~3GB90x3.0%High-throughput English batch

RTF (real-time factor) represents how many hours of audio the model processes per hour of GPU time. RTF 30x means 30 hours of audio per GPU-hour, or about 2 minutes of processing per 1 hour of audio.

For voice agents and streaming: large-v3 Turbo is the best starting point. It has 4 decoder layers instead of large-v3's 32, which cuts per-chunk inference time nearly in half with only a 0.3% WER increase. At INT8, it fits in 1.6GB VRAM, so an L40S can run 25+ concurrent model instances before VRAM fills.

For batch transcription: large-v3 at INT8 gives the best accuracy for multilingual audio. The WER on accented speech and noisy recordings is noticeably better than Turbo. Use large-v3 when you have batch jobs and latency is not the constraint.

For English-only high-throughput workloads: Systran/faster-distil-whisper-large-v3 is a distilled variant that achieves large-v3-level WER on English while running at 90x RTF. It runs 6x faster than large-v3 at similar accuracy on English audio. Load it with: WhisperModel("Systran/faster-distil-whisper-large-v3", device="cuda", compute_type="int8").

GPU Selection: L40S vs. H100 for Faster-Whisper

GPUVRAMOn-demand ($/hr)RTF (large-v3 INT8)Concurrent Streams (large-v3 INT8)Best Workload
RTX 4090 PCIe (market reference, not on Spheron)24GB~$0.58/hr25x50+Dev, single-stream, small batch
L40S on Spheron48GB$0.75/hr35x100+Production streaming, mixed batch
H100 PCIe80GB$2.09/hr60x200+High-concurrency call centers
H100 SXM5 on Spheron80GB$2.64/hr (spot: $1.66/hr)70x250+500+ sessions, co-located multi-model

Pricing fluctuates based on GPU availability. The prices above are based on 21 May 2026 and may have changed. Check current GPU pricing → for live rates.

The L40S at $0.75/hr is the production sweet spot. Its 48GB VRAM fits 25+ large-v3-turbo INT8 model instances simultaneously (each at 1.6GB), with VRAM headroom for KV caches and audio buffers during concurrent inference. The RTF of 35x means 100 hours of audio finishes in roughly 2.9 hours of GPU time.

The RTX 4090 is cheaper per GPU-hour on some providers (it is not currently listed on Spheron) and has half the VRAM (24GB). For single-stream voice agents or dev environments, it works fine. For production with 20+ concurrent streams, VRAM becomes the constraint.

The H100 SXM5 makes sense when you need 500+ simultaneous sessions or want to co-locate faster-whisper with an LLM or TTS model on the same instance. Its NVLink fabric also improves multi-GPU tensor parallelism if you need to run multiple large-v3 instances at batch size > 1.

Decision rule: Start with L40S. If you hit VRAM limits at your target concurrency, scale to H100. If you are running a single stream with no production SLA, RTX 4090 saves money.

Production Setup on Spheron: Container, Deps, and Model Caching

Provision an L40S instance

Log in to app.spheron.ai, select a GPU instance, and choose L40S PCIe (48GB VRAM). See the Spheron getting started guide for account setup and SSH key configuration. Select Ubuntu 22.04, allocate at least 50GB of disk (model weights for large-v3 at INT8 are about 3GB; Turbo is 1.6GB), and open port 8000 in the instance firewall rules. SSH in and confirm CUDA availability:

bash
nvidia-smi
# Should show L40S with 48564 MiB VRAM

Install dependencies

bash
pip install faster-whisper fastapi uvicorn websockets python-multipart

For VAD support (required for streaming):

bash
pip install silero-vad

Pre-cache the model

On first load, faster-whisper downloads the pre-converted CTranslate2 weights from HuggingFace. In production, pre-cache the model during container startup rather than at first request:

python
from faster_whisper import WhisperModel

# Pre-cache at startup
model = WhisperModel("large-v3", device="cuda", compute_type="int8")

The cache lives at ~/.cache/huggingface/hub. Mount this as a persistent volume in Docker to avoid re-downloading on container restarts.

Dockerfile example

dockerfile
FROM nvidia/cuda:12.1-cudnn8-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y python3-pip && \
    pip install faster-whisper fastapi uvicorn websockets \
                python-multipart silero-vad

WORKDIR /app
COPY server.py .

# Pre-cache model at build time
RUN python3 -c "from faster_whisper import WhisperModel; \
    WhisperModel('large-v3', device='cpu', compute_type='int8')"

CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]

Note the device='cpu' in the build step: Docker build does not have GPU access, so the model is downloaded during build and CUDA inference is initialized at container start. For automating Docker setup on Spheron instances via startup scripts, see the Spheron startup scripts guide.

Run it with:

bash
docker run --gpus all \
  -p 8000:8000 \
  -v model-cache:/root/.cache/huggingface \
  your-faster-whisper-image

FastAPI + faster-whisper Reference Server

This server exposes two endpoints: POST /transcribe for batch file upload and WebSocket /ws/stream for real-time audio. The model loads once at startup and is shared across all requests.

python
import asyncio
import io
import json
import threading
from contextlib import asynccontextmanager
from typing import Optional

import numpy as np
import uvicorn
from fastapi import FastAPI, File, UploadFile, WebSocket, WebSocketDisconnect
from faster_whisper import WhisperModel
from pydantic import BaseModel

# Global model instance — protected by a lock since CTranslate2 is not thread-safe
whisper_model: Optional[WhisperModel] = None
_model_lock = threading.Lock()


@asynccontextmanager
async def lifespan(app: FastAPI):
    global whisper_model
    whisper_model = WhisperModel(
        "large-v3",
        device="cuda",
        compute_type="int8",
    )
    yield
    whisper_model = None


app = FastAPI(lifespan=lifespan)


class WordTimestamp(BaseModel):
    word: str
    start: float
    end: float
    probability: float


class Segment(BaseModel):
    id: int
    start: float
    end: float
    text: str
    words: list[WordTimestamp]


class TranscribeResponse(BaseModel):
    segments: list[Segment]
    language: str
    language_probability: float
    duration: float


@app.post("/transcribe", response_model=TranscribeResponse)
async def transcribe(audio: UploadFile = File(...)):
    audio_bytes = await audio.read()

    loop = asyncio.get_running_loop()
    result = await loop.run_in_executor(None, _transcribe_sync, audio_bytes)
    return result


def _transcribe_sync(audio_bytes: bytes) -> TranscribeResponse:
    with _model_lock:
        segments_raw, info = whisper_model.transcribe(
            io.BytesIO(audio_bytes),
            beam_size=5,
            word_timestamps=True,
            vad_filter=True,
            vad_parameters={"min_silence_duration_ms": 500},
        )
        segments_raw = list(segments_raw)

    segments = []
    for i, seg in enumerate(segments_raw):
        words = [
            WordTimestamp(
                word=w.word,
                start=w.start,
                end=w.end,
                probability=w.probability,
            )
            for w in (seg.words or [])
        ]
        segments.append(
            Segment(
                id=i,
                start=seg.start,
                end=seg.end,
                text=seg.text.strip(),
                words=words,
            )
        )

    return TranscribeResponse(
        segments=segments,
        language=info.language,
        language_probability=info.language_probability,
        duration=info.duration,
    )


def _transcribe_stream_sync(audio_array: np.ndarray) -> tuple[str, str]:
    with _model_lock:
        segments_raw, info = whisper_model.transcribe(
            audio_array,
            beam_size=1,
            condition_on_previous_text=False,
            vad_filter=True,
            vad_parameters={"min_silence_duration_ms": 500},
        )
        text = " ".join(seg.text.strip() for seg in segments_raw)
        language = info.language
    return text, language


@app.websocket("/ws/stream")
async def stream_audio(websocket: WebSocket):
    await websocket.accept()
    buffer = bytearray()
    CHUNK_SIZE = 16000 * 4 * 2  # 4 seconds at 16kHz, 16-bit mono (2 bytes per sample)

    try:
        while True:
            data = await websocket.receive_bytes()
            buffer.extend(data)

            while len(buffer) >= CHUNK_SIZE:
                chunk = bytes(buffer[:CHUNK_SIZE])
                buffer = buffer[CHUNK_SIZE:]

                audio_array = np.frombuffer(chunk, dtype=np.int16).astype(np.float32)
                audio_array /= 32768.0

                loop = asyncio.get_running_loop()
                text, language = await loop.run_in_executor(
                    None, _transcribe_stream_sync, audio_array
                )

                if text:
                    await websocket.send_text(
                        json.dumps(
                            {
                                "text": text,
                                "language": language,
                            }
                        )
                    )

    except WebSocketDisconnect:
        if buffer:
            # Flush the remaining audio on disconnect so final words are not dropped.
            # Zero-pad to the nearest full chunk so the model sees valid frame boundaries.
            remainder = bytes(buffer)
            pad_len = CHUNK_SIZE - (len(remainder) % CHUNK_SIZE)
            if pad_len < CHUNK_SIZE:
                remainder += b"\x00" * pad_len
            audio_array = np.frombuffer(remainder, dtype=np.int16).astype(np.float32)
            audio_array /= 32768.0
            loop = asyncio.get_running_loop()
            text, language = await loop.run_in_executor(
                None, _transcribe_stream_sync, audio_array
            )
            if text:
                try:
                    await websocket.send_text(
                        json.dumps({"text": text, "language": language})
                    )
                except Exception:
                    pass


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Test the batch endpoint:

bash
curl -X POST http://localhost:8000/transcribe \
  -F "audio=@test.wav"

Test the WebSocket streaming endpoint with a Python client:

python
import asyncio
import websockets

async def stream_test():
    uri = "ws://localhost:8000/ws/stream"
    async with websockets.connect(uri) as ws:
        with open("test.wav", "rb") as f:
            audio = f.read()[44:]  # skip WAV header
        chunk_size = 16000 * 4 * 2  # 4s at 16kHz 16-bit
        for i in range(0, len(audio), chunk_size):
            await ws.send(audio[i:i+chunk_size])
            try:
                msg = await asyncio.wait_for(ws.recv(), timeout=2.0)
                print(msg)
            except asyncio.TimeoutError:
                pass

asyncio.run(stream_test())

Real-Time vs. Batch Transcription: Streaming, VAD, and Chunking

Chunking strategy

Whisper is not a streaming model by design. It processes fixed-length audio windows (up to 30 seconds). For real-time streaming, the standard pattern is to break incoming audio into 3-5 second chunks with 0.5-second overlaps and merge the outputs:

python
SAMPLE_RATE = 16000
CHUNK_SECONDS = 4
OVERLAP_SECONDS = 0.5
CHUNK_SAMPLES = SAMPLE_RATE * CHUNK_SECONDS
OVERLAP_SAMPLES = int(SAMPLE_RATE * OVERLAP_SECONDS)

def chunk_audio(audio_array: np.ndarray):
    step = CHUNK_SAMPLES - OVERLAP_SAMPLES
    for start in range(0, len(audio_array), step):
        yield audio_array[start:start + CHUNK_SAMPLES]

Discard the overlap region when merging segment outputs. Words within the first and last 0.5 seconds of each chunk's output are candidates for revision in the next chunk, so treat them as preliminary until the next chunk confirms or replaces them.

VAD integration

Silero VAD is integrated natively into faster-whisper via the vad_filter=True parameter. When enabled, VAD detects speech activity within the chunk and skips silent regions before running the Whisper decoder. This reduces hallucinations on silent chunks (Whisper tends to generate noise or repeat context on pure silence without VAD) and speeds up batch jobs where audio has significant silence.

Key VAD parameters for streaming:

python
segments, info = model.transcribe(
    audio_chunk,
    beam_size=1,
    condition_on_previous_text=False,  # critical for streaming
    vad_filter=True,
    vad_parameters={
        "min_silence_duration_ms": 500,
        "speech_pad_ms": 100,
        "threshold": 0.5,
    },
)

Latency comparison: batch vs. streaming

ModeModelLatency per chunkVRAMGPU
Streaminglarge-v3 Turbo (INT8)50-80ms per 4s chunk1.6GBL40S
Streaminglarge-v3 (INT8)120-180ms per 4s chunk3GBL40S
Batchlarge-v3 (INT8)15-25ms per 4s chunk3GBL40S
Batchdistil-large-v3 (INT8)5-10ms per 4s chunk1.5GBL40S

Batch mode is faster per chunk because it benefits from higher GPU utilization across multiple audio files. Streaming mode processes one chunk at a time and uses a fraction of the GPU, so raw throughput is lower but latency per chunk is bounded. For a guide on setting TTFT and ITL targets based on your use case, see the LLM inference latency budget guide.

Performance Tuning: Beam Size, Timestamps, and Language Detection

ParameterStreaming SettingBatch SettingImpact
beam_size15Higher beam improves WER ~0.3-0.5%; increases latency linearly
condition_on_previous_textFalseTrueMust be False for streaming to prevent hypothesis drift
word_timestampsFalseTrueAdds ~5-10% latency; required for subtitle/diarization alignment
languageExplicit if knownExplicit if knownLanguage detection adds ~50ms per segment
vad_filterTrueTrueReduces hallucinations on silence; small throughput boost
temperature0.00.0Non-zero adds random sampling; not recommended in production

condition_on_previous_text=False for streaming is critical. At its default value of True, faster-whisper passes the previous chunk's transcript as a conditioning prefix for the next chunk. For streaming, where chunks are independent windows of real-time audio, this causes the model to condition on potentially incorrect previous hypotheses. A bad chunk can cause all subsequent chunks to hallucinate or drift semantically. Always set this to False for real-time streaming. For batch transcription of a single file, True is correct and improves coherence between segments.

Language detection cost: Auto-detection runs a forward pass through the encoder to identify the language, which costs ~50ms on large-v3. If you know the input language, set language="en" (or the appropriate BCP-47 code) to skip detection entirely. For multilingual pipelines where callers might speak different languages, auto-detection is worth the 50ms per segment.

Word-level timestamps: Enable with word_timestamps=True. This adds a forced-alignment step (similar to wav2vec2 alignment in WhisperX) and adds 5-10% latency overhead per segment. Use it when you need caption sync, subtitle generation, or alignment input for a speaker diarization model.

Throughput Comparison: faster-whisper vs. Whisper v4 vs. vLLM Whisper vs. WhisperX

FrameworkBackendRTF on L40S (large-v3 INT8)VRAMStreaming SupportNotes
faster-whisperCTranslate235x3GBYes (chunking)Production default; no PyTorch required
Whisper referencePyTorch (FP32)8x~6GBNoSlow; only for compatibility testing
vLLM WhispervLLM (FP16)20-25x~6GBNoHigher memory; better for batched LLM co-serving
WhisperXCTranslate2 + wav2vec220-25x3GB + ~1GBYes (chunking)Adds diarization overhead; builds on faster-whisper
distil-large-v3CTranslate290x1.5GBYes (chunking)English-only; 6x faster than large-v3

For the newer reference model and WhisperX diarization pipeline, see the Whisper v4 deployment guide which covers pyannote speaker diarization, word-level timestamp alignment, and batch transcription at scale.

vLLM Whisper is architecturally separate from the CTranslate2 faster-whisper path. vLLM treats Whisper as a standard encoder-decoder model and serves it through its continuous batching infrastructure. Note that the encoder runs as a one-shot forward pass; only the decoder benefits from continuous batching, so the batching dynamics differ from decoder-only LLMs. The advantage is unified serving for mixed LLM + ASR workloads: if you are already running vLLM for text inference, you can add Whisper to the same server without a separate process. The tradeoff is higher VRAM (FP16 by default) and lower RTF than CTranslate2 INT8. Do not conflate vLLM Whisper with faster-whisper; they share the model weights but use entirely different inference paths.

WhisperX builds on faster-whisper and adds a wav2vec2 alignment step for word-level timestamps plus a pyannote-audio diarization pipeline. It is the right choice when you need speaker labels per word. The RTF is lower than raw faster-whisper because the alignment and diarization steps add overhead. For transcription-only workloads without diarization, faster-whisper directly is faster.

Cost-Per-Minute Math: Spheron L40S vs. Hosted ASR APIs

Math walkthrough

An L40S on Spheron at $0.75/hr with large-v3 INT8 runs at ~35x RTF. That means:

  • 1 GPU-hour processes 35 hours of audio
  • 100 hours of audio requires 100/35 = 2.86 GPU-hours
  • Cost: 2.86 × $0.75 = $2.14 for 100 hours of audio

With large-v3 Turbo at ~30x RTF:

  • 100 hours requires 100/30 = 3.33 GPU-hours
  • Cost: 3.33 × $0.75 = $2.50 for 100 hours of audio

Provider comparison

ProviderPricing100hr Audio CostNotes
Spheron L40S on-demand (large-v3 INT8)$0.75/hr GPU~$2.1435x RTF; self-hosted; no per-minute charge
Spheron H100 SXM5 spot (large-v3 INT8)$1.66/hr GPU~$2.3770x RTF; faster wall-clock time
OpenAI Whisper API$0.006/min$36.00Managed; no GPU to maintain
AssemblyAI (Nano)$0.002/min$12.00Managed; lower accuracy on accents
Deepgram (Nova-2)$0.0043/min$25.80Managed; fast; limited language support

Pricing fluctuates based on GPU availability. The prices above are based on 21 May 2026 and may have changed. Check current GPU pricing → for live rates.

Breakeven analysis

At $0.006/min, OpenAI Whisper API charges $0.36/hr of audio. Self-hosting on Spheron L40S costs $0.0214/hr of audio ($0.75 GPU-hour ÷ 35x RTF). The per-audio-hour rates are $0.36 (OpenAI) vs. $0.0214 (Spheron), a 17x difference. The practical breakeven question is about minimum volume: you need enough consistent audio to justify keeping a GPU instance running. At roughly 15-20 hours of audio per month, the cost savings offset the overhead of managing the instance. Above that, self-hosting consistently wins. If you use the GPU for other workloads alongside ASR, the effective cost per ASR audio-hour drops further.

The managed APIs make sense when you have low and unpredictable volume, no DevOps capacity to manage GPU instances, or need a per-minute SLA without managing uptime yourself. Above 20 hours/month of consistent volume, the math consistently favors self-hosting on Spheron.


faster-whisper delivers production-grade ASR at a fraction of API costs. For voice agents, meeting transcription, or call center pipelines, the math on self-hosting consistently favors GPU cloud above a few dozen hours of audio per month. For the downstream voice agent layer, see the speech-to-speech deployment guide for integrating Moshi, CSM, and Hertz-dev.

Rent L40S on Spheron → | Rent H100 → | View all GPU pricing →

STEPS / 05

Quick Setup Guide

  1. Select a model size and quantization mode

    Start with large-v3 at INT8 for most production use cases. Load with: from faster_whisper import WhisperModel; model = WhisperModel('large-v3', device='cuda', compute_type='int8'). INT8 uses ~3GB VRAM for large-v3 vs ~6GB at FP16 - it cuts memory in half with no meaningful WER regression on clean audio. For latency-sensitive voice agents, switch to large-v3 Turbo: WhisperModel('deepdml/faster-whisper-large-v3-turbo-ct2', device='cuda', compute_type='int8'). For very high throughput or English-only workloads, distil-large-v3 (Systran/faster-distil-whisper-large-v3) processes 6x faster than large-v3 at INT8.

  2. Provision an L40S instance on Spheron

    Go to app.spheron.ai and deploy a GPU instance. Select L40S PCIe (48GB VRAM, $0.75/hr on-demand) for production workloads. Select Ubuntu 22.04, allocate 50GB+ disk for model storage, and open port 8000 in the firewall rules (a single uvicorn process serves both the batch and WebSocket endpoints). SSH in and verify CUDA is available: nvidia-smi should show the L40S with 48GB. If only running a single stream or testing, an RTX 4090 PCIe is sufficient and cheaper - refer to the GPU sizing section for the full decision matrix.

  3. Install dependencies and download model

    pip install faster-whisper fastapi uvicorn websockets silero-vad. Models are downloaded automatically on first load from HuggingFace, but pre-caching is recommended for production: python -c "from faster_whisper import WhisperModel; WhisperModel('large-v3', device='cuda', compute_type='int8')". Cache is stored in ~/.cache/huggingface/hub. To use a custom model path: WhisperModel('/path/to/model', device='cuda', compute_type='int8'). For Docker deployments, bake the model into the image or mount a persistent volume at /root/.cache/huggingface to avoid re-downloading on container restart.

  4. Deploy the FastAPI transcription server

    Use the reference FastAPI server in this post. It exposes two endpoints: POST /transcribe for batch file transcription and a WebSocket at /ws/stream for real-time audio. A single uvicorn process serves both endpoints. Start with: uvicorn server:app --host 0.0.0.0 --port 8000 --workers 1. Test with: curl -X POST http://localhost:8000/transcribe -F 'audio=@test.wav' which returns a JSON object with segments, word timestamps, language, and duration.

  5. Configure VAD, beam size, and chunking for your workload

    For real-time streaming: beam_size=1, condition_on_previous_text=False, vad_filter=True, vad_parameters={'min_silence_duration_ms': 500}. For batch accuracy: beam_size=5, condition_on_previous_text=True, word_timestamps=True, language='en' (or omit for auto-detection). Language auto-detection costs ~50ms per segment on large-v3 - set the language explicitly if you know it. Word-level timestamps add ~5-10% latency overhead but are essential for subtitle generation, karaoke sync, or diarization alignment.

FAQ / 05

Frequently Asked Questions

faster-whisper is a reimplementation of OpenAI Whisper using the CTranslate2 inference engine. It supports INT8 and FP16 quantization, runs 4x faster than the reference PyTorch implementation, and uses roughly 50% less VRAM at INT8 on the same model size. The transcription output is identical to Whisper - the difference is purely in the inference backend. For production use, faster-whisper is almost always preferred over the reference implementation for its lower latency and memory footprint.

For voice agents and real-time streaming: large-v3 Turbo (809M params, ~1.6GB VRAM at INT8) gives the best latency-accuracy tradeoff. For batch transcription where accuracy matters: large-v3 (1,550M params, ~3GB at INT8). For low-resource environments or edge deployments: medium (769M params, ~1.5GB) or small (244M params, ~500MB). distil-large-v3 is a distilled variant that runs 6x faster than large-v3 with only a small accuracy drop - worth testing for high-throughput English batch workloads.

An L40S (48GB VRAM, $0.75/hr on Spheron) is the production sweet spot for faster-whisper. It handles 100+ concurrent large-v3-turbo INT8 streams and processes batch audio at 30-40x real time. For a single stream or dev environment, even a 6GB consumer GPU works. For very high-concurrency deployments (500+ simultaneous sessions), an H100 SXM5 (80GB VRAM) gives the memory headroom to co-locate models and run batched decoding at scale.

Chunk incoming audio into 3-5 second windows with 0.5-second overlaps, run inference on each chunk with beam_size=1 and condition_on_previous_text=False to prevent hypothesis drift, then merge outputs by discarding the overlap regions. Use the Silero VAD integration (vad_filter=True) to skip silent segments and reduce hallucinations. For WebSocket-based streaming, send each finalized chunk result as a JSON message with word-level timestamps.

At self-host rates on Spheron, faster-whisper on an L40S ($0.75/hr on-demand) processes audio at 30-40x real time, meaning 100 hours of audio finishes in roughly 2.5-3.5 hours. Total cost: $1.88-$2.63 for 100 hours. OpenAI's Whisper API charges $0.006 per minute, putting 100 hours (6,000 minutes) at $36. Self-hosting breaks even at approximately 15-20 hours of audio per month - above that threshold, Spheron is cheaper.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.