What GPU do I need to run Kokoro-82M in production?

Kokoro-82M model weights fit under 1GB VRAM (total GPU memory during inference, including CUDA buffers, is typically 2-3GB) and it runs on any modern NVIDIA GPU. An A100 PCIe 80GB ($1.04/hr on Spheron) handles 50+ concurrent Kokoro streams at well under 0.1 real-time factor. For very high throughput (1000+ concurrent requests), scale horizontally with additional A100 or H100 PCIe instances.

How does self-hosting TTS compare in cost to ElevenLabs at scale?

ElevenLabs charges $0.00018 per character as the overage rate on Scale plans ($0.18 per 1,000 characters overage; Scale plans include 2M-4M characters in the $330/month base). At 10 million characters per month in overage, that is $1,800 in usage fees alone. Running Kokoro on a dedicated A100 PCIe at $1.04/hr is $748.80/month and handles far more than 10M characters at that volume. Self-hosting breaks even around 4-5M characters per month and gets significantly cheaper as volume grows.

What is the real-time factor (RTF) for Kokoro vs Fish Speech?

RTF is generation time divided by output audio duration - lower is better. Kokoro-82M achieves RTF of about 0.03 on an A100, meaning 10 seconds of audio generates in well under a second. Fish Speech RTF is approximately 0.15-0.25 on A100. Hume TADA has limited public benchmarks but is estimated at 0.20-0.30 RTF on A100.

Can I run multiple TTS models on the same GPU?

Yes. Kokoro-82M model weights fit under 1GB (total GPU memory during inference is typically 2-3GB). Fish Speech needs 12GB minimum (24GB recommended). Hume TADA needs approximately 2.5GB (1B model) or 9GB (3B model with bf16). An A100 80GB can still host multiple models simultaneously and route requests based on language, quality tier, or latency requirements. No separate instances needed until you hit throughput limits.

Which open-source TTS model is best for a voice AI agent pipeline?

For low-latency English voice agents, Kokoro-82M is the fastest option with the smallest VRAM footprint. Pair it with Whisper Large v3 for ASR and a 7B-13B LLM. For multilingual agents or applications needing style and emotion control, Fish Speech is the better choice despite higher VRAM requirements. See the voice AI GPU infrastructure guide for full pipeline recommendations.

Deploy Open-Source TTS on GPU Cloud: Kokoro, Fish Speech, and Hume TADA Guide (2026)

Three things happened in quick succession. Kokoro-82M went viral because it matched or beat much larger TTS models while running on consumer hardware. Fish Speech 1.5 ranked first among open-source models on TTS-Arena as of early 2025 (FishAudio's newer S1 model has since claimed the #1 spot on TTS-Arena V2). Hume released TADA in March 2026, a model built specifically to eliminate hallucinations in long-form synthesis. In the same period, ElevenLabs raised their API prices and PlayHT shut down after Meta acquired it in July 2025.

Developers started asking the obvious question: if open-source TTS is this good, why keep paying per character?

This guide answers that practically. Which GPU you need, what a full deployment looks like, what it costs, and how to wire TTS into a production voice agent. For context on the full voice AI pipeline (ASR + LLM + TTS), see the voice AI GPU infrastructure guide.

The 2026 Open-Source TTS Models Worth Deploying

Kokoro-82M has 82M parameters and an Apache 2.0 license. The v1.0 release (January 27, 2025) ships 54 voices across 8 languages. Model weights are under 1GB at FP16, though total GPU memory during inference (including CUDA kernels and buffers) runs 2-3GB. It hits an RTF of about 0.03 on an A100. The key advantage is footprint: you can pack many instances onto a single GPU. A community-maintained Docker image (ghcr.io/remsky/kokoro-fastapi-gpu) exposes an OpenAI-compatible API with zero configuration.

Fish Speech 1.5 has an unconfirmed parameter count (estimated ~500M, but no official figure has been published) and ranked first among open-source models on TTS-Arena as of early 2025. Note that FishAudio has since released S1, a newer model that holds the #1 position on TTS-Arena V2. Fish Speech 1.5 remains the relevant self-hostable option covered here. It supports 13 languages including Chinese, Japanese, and Korean, with emotion and style control via conditioning parameters. VRAM requirement is 12GB minimum, with 24GB recommended for production workloads. Voice cloning from reference audio is built in, no fine-tuning required. License is CC BY-NC-SA 4.0, which means non-commercial use only. Commercial use requires a separate agreement from FishAudio.

Hume TADA (Text-Acoustic Dual Alignment) was released in March 2026 by Hume AI. The headline claim is zero hallucinations on the LibriTTSR test set in long-form synthesis: the model stops and signals rather than inventing words when context is ambiguous. Expressive synthesis with emotional alignment. VRAM is approximately 2.5GB for the 1B model and 9GB for the 3B model with bf16, though independent benchmarks are limited as of April 2026. Weights are available for self-hosting for research and commercial customers.

NVIDIA PersonaPlex-7B is a 7B parameter real-time speech-to-speech conversational model requiring 16GB VRAM minimum, with 24GB+ recommended for smooth real-time performance. It is designed for full-duplex conversations with simultaneous listening and speaking, not a traditional TTS pipeline. Licensed under NVIDIA Open Model License (weights) with MIT license (code). Include it if your application needs live conversational voice interaction; for batch or streaming TTS use cases, Kokoro or Fish Speech are more appropriate.

Model comparison:

Model	Parameters	VRAM	RTF (A100)	Languages	License	Best For
Kokoro-82M v1.0	82M	~1GB weights (2-3GB total)	~0.03	8	Apache 2.0	High-throughput English TTS
Fish Speech 1.5	unconfirmed	~12GB min	~0.20	13	CC BY-NC-SA 4.0	Multilingual, style control
Hume TADA	~unknown	~2.5GB (1B) / ~9GB (3B)	~0.25 est.	English (multi planned)	Commercial	Expressive voice agents
PersonaPlex-7B	7B	16GB min / 24GB+ rec.	~0.50	English	NVIDIA OML / MIT	Full-duplex conversational voice

RTF figures are estimates based on model architecture and available community benchmarks. Run your own benchmarks with your audio length distribution before capacity planning.

GPU Requirements and Real-Time Factors

RTF (real-time factor) is generation time divided by output audio duration. An RTF above 1.0 means the model cannot keep up with real-time playback. Anything below 0.1 means the GPU is mostly idle when serving a single stream, so you can pack in more concurrent users.

RTF by GPU:

GPU	Spheron Price	Kokoro-82M RTF	Fish Speech RTF	Concurrent Kokoro streams	Concurrent Fish Speech streams
L40S PCIe	$1.80/hr	~0.08	~0.30	~30	~8
A100 PCIe 80GB	$1.04/hr	~0.03	~0.20	~50	~12
H100 PCIe 80GB	$2.63/hr	~0.02	~0.12	~80	~20

Pricing fluctuates based on GPU availability. The prices above are based on 09 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Community benchmarks for Kokoro show RTF of ~0.04-0.06 on RTX 4090, which is comparable to the L40S PCIe figures above. Spheron does not currently list RTX 4090 in the GPU catalog. L40S PCIe ($1.80/hr) is the closest available alternative at a similar price point and performs comparably for inference-only workloads.

Step-by-Step: Deploy Kokoro-82M on Spheron GPU Cloud

Provision Your Instance

Go to app.spheron.ai
Select A100 PCIe 80GB ($1.04/hr): sufficient for 50+ concurrent Kokoro streams
Choose Ubuntu 22.04 with at least 50GB storage
SSH into the instance once it is running

Deploy with Docker

bash

# Pull the community-maintained FastAPI image (GPU variant)
docker pull ghcr.io/remsky/kokoro-fastapi-gpu:latest

# Run with GPU access and expose the API port
docker run -d \
  --name kokoro \
  --gpus all \
  -p 8880:8880 \
  -e KOKORO_WORKERS=4 \
  ghcr.io/remsky/kokoro-fastapi-gpu:latest

Check the server is ready:

bash

curl http://localhost:8880/health

Generate Audio

The server exposes an OpenAI-compatible /v1/audio/speech endpoint. You can point any OpenAI TTS client at it by changing the base URL:

bash

curl http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro",
    "input": "The GPU cloud is the fastest path from prototype to production.",
    "voice": "af_bella",
    "response_format": "wav"
  }' \
  --output output.wav

Available Voices

Kokoro v1.0 ships 54 voices across 8 languages. Key English voices:

Voice ID	Style	Notes
`af_bella`	Female, warm	Default, most tested
`af_sarah`	Female, clear	Good for customer service
`am_adam`	Male, neutral	Good for narration
`am_michael`	Male, authoritative	Good for enterprise apps
`bf_emma`	Female, British	UK English accent
`bm_george`	Male, British	UK English accent

Full voice list: curl http://localhost:8880/v1/voices

Streaming Configuration

For voice agents, enable sentence-level streaming to reduce time-to-first-audio:

bash

docker run -d \
  --name kokoro \
  --gpus all \
  -p 8880:8880 \
  -e KOKORO_WORKERS=4 \
  -e KOKORO_STREAM=true \
  -e KOKORO_CHUNK_SIZE=50 \
  ghcr.io/remsky/kokoro-fastapi-gpu:latest

With streaming enabled, the server begins emitting audio chunks as soon as the first 50 tokens generate. For a voice agent, this means the user hears the start of the response while the GPU is still processing the tail end.

Step-by-Step: Deploy Fish Speech

Instance Requirements

Fish Speech 1.5 needs 12GB VRAM minimum per model instance, with 24GB recommended for production. An A100 PCIe 80GB can run up to 6 instances in parallel at minimum VRAM. H100 PCIe improves throughput by roughly 2x for high-concurrency serving. Start with A100 PCIe unless you are targeting sub-100ms latency at scale.

Installation

bash

# Clone the repository
git clone https://github.com/fishaudio/fish-speech
cd fish-speech

# Install with CUDA-version-specific extras (use cu129, cu128, cu126, or cpu based on your CUDA version)
pip install -e '.[cu126]'

# Download model weights (~1.5GB)
pip install huggingface_hub
huggingface-cli download fishaudio/fish-speech-1.5 \
  --local-dir checkpoints/fish-speech-1.5

Start the Inference Server

bash

# Start the web UI and API server on all interfaces
python tools/run_webui.py \
  --listen 0.0.0.0:7860 \
  --checkpoint-path checkpoints/fish-speech-1.5

The API is available at /api/v1/tts. For production deployments, run behind Nginx with rate limiting. Do not expose port 7860 directly; use an SSH tunnel for testing.

Generate Speech with Language and Emotion Control

python

import requests

response = requests.post(
    "http://localhost:7860/api/v1/tts",
    json={
        "text": "Your GPU deployment is ready.",
        "language": "en",
        "speaker": None,       # None uses default speaker
        "emotion": "neutral",  # Options: neutral, happy, sad, angry, fearful, disgusted, surprised
        "format": "wav",
        "streaming": False
    }
)
response.raise_for_status()

with open("output.wav", "wb") as f:
    f.write(response.content)

Voice Cloning

Fish Speech clones voices from a reference audio clip. No fine-tuning required:

python

import requests

with open("reference.wav", "rb") as ref:
    response = requests.post(
        "http://localhost:7860/api/v1/tts",
        data={
            "text": "Cloned voice generation test.",
            "language": "en",
            "format": "wav"
        },
        files={
            "reference_audio": ref,
            "reference_text": (None, "Transcript of the reference audio clip.")
        }
    )
    response.raise_for_status()

with open("cloned.wav", "wb") as f:
    f.write(response.content)

Reference audio recommendations: 5-15 seconds, clean recording, minimal background noise, consistent energy. Shorter clips work but speaker similarity drops below 85%.

License Note

Fish Speech 1.5 is licensed CC BY-NC-SA 4.0. This allows non-commercial use with attribution. For commercial applications, contact FishAudio for a commercial license before deploying to production.

Serving Architecture: Batch vs Real-Time Streaming

Batch Processing

Batch processing fits workloads where latency is not the constraint: audiobook generation, podcast production, pre-rendered game dialogue, content localization.

Architecture:

Request queue (Redis or SQS)
Worker pool pulling from queue
Output written to object storage (S3-compatible)
Webhook notification on completion

python

# Worker loop for batch processing
import redis
import requests
import json
import time

queue = redis.Redis(host='localhost', port=6379)
connection_retries = 0
MAX_RETRIES = 3

while True:
    try:
        _, job = queue.blpop('tts_queue')
        connection_retries = 0  # Reset backoff on successful connection

        try:
            job_data = json.loads(job)
        except (json.JSONDecodeError, KeyError) as e:
            print(f"Malformed job payload, dead-lettering: {e}. Raw: {job!r}")
            queue.rpush('tts_queue_failed', job)
            continue

        try:
            response = requests.post(
                "http://localhost:8880/v1/audio/speech",
                json={
                    "model": "kokoro",
                    "input": job_data["text"],
                    "voice": job_data.get("voice", "af_bella"),
                    "response_format": "mp3"
                }
            )
            response.raise_for_status()

            # Write to S3 or local storage
            upload_to_storage(response.content, job_data["output_key"])
            notify_webhook(job_data["callback_url"], job_data["output_key"])
        except Exception as e:
            # Track retry count to avoid re-queuing bad jobs indefinitely
            attempts = job_data.get("_attempts", 0) + 1
            if attempts < MAX_RETRIES:
                job_data["_attempts"] = attempts
                print(f"Job failed (attempt {attempts}/{MAX_RETRIES}): {e}. Re-queuing.")
                queue.rpush('tts_queue', json.dumps(job_data))
            else:
                # Move to dead-letter queue after max retries
                print(f"Job failed permanently after {MAX_RETRIES} attempts: {e}. Moving to dead-letter queue.")
                queue.rpush('tts_queue_failed', json.dumps(job_data))
    except redis.exceptions.ConnectionError as e:
        # Back off exponentially to avoid tight CPU spin on Redis outage
        wait = min(2 ** connection_retries, 60)
        print(f"Redis connection error: {e}. Retrying in {wait}s.")
        time.sleep(wait)
        connection_retries += 1

Throughput target on A100 PCIe with Kokoro: approximately 50 million characters per hour for typical short-form text.

Real-Time Streaming

Streaming is for interactive applications where the user hears audio before the full response is synthesized: voice agents, interactive demos, live narration.

The key technique is sentence-boundary detection. The LLM generates tokens, the application detects sentence boundaries (period, question mark, exclamation), and passes each complete sentence to TTS immediately. The user begins hearing audio ~200-400ms after the LLM generates the first sentence.

python

def stream_llm_to_tts(llm_stream, tts_client):
    """Pipeline LLM tokens into TTS with sentence-boundary buffering."""
    buffer = ""
    sentence_enders = {'.', '!', '?', ':', ';'}

    for token in llm_stream:
        buffer += token

        # Flush on sentence boundary
        if buffer and buffer[-1] in sentence_enders and len(buffer) > 10:
            audio_chunk = tts_client.synthesize(buffer.strip())
            yield audio_chunk
            buffer = ""

    # Flush remainder
    if buffer.strip():
        audio_chunk = tts_client.synthesize(buffer.strip())
        yield audio_chunk

For the full voice AI pipeline combining ASR, LLM, and TTS on a single GPU, see the voice AI GPU infrastructure guide.

Cost Analysis: Self-Hosted TTS vs ElevenLabs and PlayHT

The comparison below uses on-demand pricing as the baseline, not spot instances. This gives a conservative view of self-hosting costs. Spot instances are cheaper but interruptible, which is fine for batch workloads but not for real-time production serving.

Cost comparison (characters per month):

Monthly Volume	ElevenLabs (Scale)	PlayHT (Pro)	Kokoro on A100 PCIe	Fish Speech on A100 PCIe
1M chars	$180	$49	$748.80*	$748.80*
10M chars	$1,800	$490	$748.80*	$748.80*
50M chars	$9,000	$2,450	$748.80*	$748.80*
100M chars	$18,000	$4,900	$748.80*	$748.80* (or 2x A100)

*$748.80/month = $1.04/hr x 720 hours (dedicated A100 PCIe). One A100 running Kokoro handles approximately 3.6 billion characters per month at steady utilization.

Note: ElevenLabs Scale plan pricing is approximate and changes periodically. Verify current rates at elevenlabs.io before building a cost model.

Pricing fluctuates based on GPU availability. The Spheron GPU costs above are based on 09 Apr 2026 and may have changed. Check current GPU pricing for live rates.

At under 4M characters per month, API pricing is usually cheaper because you avoid the fixed GPU cost. Self-hosting becomes economical at 4-5M+ characters per month and substantially cheaper above 10M.

Fish Speech on A100 PCIe handles approximately 200-400 million characters per month at steady utilization (lower than Kokoro due to higher RTF). At 100M+ chars/month, one A100 covers it. Above that, add instances.

Production Scaling: Handling 1000+ Concurrent Requests

Single GPU Capacity

GPU	Kokoro streams	Fish Speech streams	Notes
A100 PCIe 80GB	50	12	Good starting point
H100 PCIe 80GB	80	20	Lower latency under load

Horizontal Scaling

Beyond one GPU, run independent TTS containers on multiple instances and load balance with Nginx:

nginx

upstream kokoro_backends {
    least_conn;
    server gpu-instance-1:8880;
    server gpu-instance-2:8880;
    server gpu-instance-3:8880;
    server gpu-instance-4:8880;
}

server {
    listen 80;

    location /v1/audio/speech {
        proxy_pass http://kokoro_backends;
        proxy_read_timeout 30s;
        proxy_send_timeout 30s;
    }

    location /health {
        proxy_pass http://kokoro_backends;
    }
}

Nginx least_conn distributes requests to the backend with the fewest active connections. This performs better than round-robin for TTS because request duration varies significantly by text length.

GPU Monitoring

Monitor GPU utilization with nvidia-smi during load testing:

bash

# Real-time GPU stats, 2-second refresh
watch -n 2 nvidia-smi

# Log utilization to CSV for capacity planning
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.free \
  --format=csv,noheader \
  --loop=5 >> gpu_utilization.csv

Target 80-90% GPU utilization for cost efficiency. Below 70% means the instance is over-provisioned. Above 95% means requests are queuing.

Building Voice AI Agents: Combining TTS with STT and LLM

The standard architecture is three stages on one GPU: ASR (Whisper), LLM response generation, TTS synthesis. The models stack well on a single H100 PCIe 80GB:

H100 PCIe 80GB: Whisper Large v3 (~4GB) + a 13B LLM (~26GB FP16) + Kokoro (~2-3GB total) leaves 47GB+ headroom for KV cache
A100 PCIe 80GB: Whisper Medium (~3GB) + a 7B LLM (~14GB FP16) + Kokoro (~2-3GB total) fits comfortably

TTS is the cheapest stage in the pipeline by VRAM for Kokoro. Kokoro's ~2-3GB total GPU footprint leaves plenty of room for a larger LLM on the same instance. Fish Speech at 12GB minimum still fits alongside a 7B LLM and Whisper on an 80GB GPU, though with less headroom.

The sentence-streaming pattern keeps end-to-end latency under 500ms. As the LLM generates tokens, detect sentence boundaries and pass each complete sentence to TTS immediately. The user hears the first sentence while the LLM is still generating the rest of the response.

The voice AI GPU infrastructure guide has VRAM requirement breakdowns for each pipeline stage, latency budget analysis, and full GPU recommendations for ASR + LLM + TTS co-location. Kokoro's ~2-3GB total GPU footprint leaves plenty of room for a 7B-13B LLM and Whisper on the same H100 PCIe 80GB instance. Fish Speech at 12GB minimum also fits alongside a 7B LLM and Whisper on an H100 PCIe 80GB, with roughly 40GB remaining for KV cache.

For NeuTTS Air users who need faster throughput or voice cloning, see the NeuTTS Air deployment guide.

Which Model Should You Deploy?

Use Kokoro-82M if: you are building an English-first voice application, you need maximum throughput on minimal hardware, or you want an OpenAI-compatible API with zero configuration.

Use Fish Speech if: your application handles multiple languages, you need style or emotion control, or you want voice cloning without fine-tuning for a non-English audience.

Use Hume TADA if: you are building an emotionally expressive voice agent and can tolerate higher VRAM requirements and less community documentation as of April 2026.

Use PersonaPlex-7B if: you need full-duplex real-time conversational voice (simultaneous listening and speaking), not a traditional TTS pipeline. It is not suited for batch synthesis or standard streaming TTS use cases.

For the cost-sensitive production path: start with Kokoro on A100 PCIe. Add Fish Speech on the same instance if you need multilingual support. Upgrade to H100 PCIe when you cross 80 concurrent streams.

Open-source TTS on GPU cloud is genuinely cost-effective at scale. A dedicated A100 PCIe running Kokoro costs $1.04/hr and covers 50+ concurrent streams, at a fraction of per-character API pricing above 4-5M characters per month. Spheron has A100 and H100 instances on-demand with no minimums or long-term contracts.
Rent A100 → | Rent H100 → | View all GPU pricing → | Get started on Spheron →

The 2026 Open-Source TTS Models Worth Deploying

GPU Requirements and Real-Time Factors

Step-by-Step: Deploy Kokoro-82M on Spheron GPU Cloud

Provision Your Instance

Deploy with Docker

Generate Audio

Available Voices

Streaming Configuration

Step-by-Step: Deploy Fish Speech

Instance Requirements

Installation

Start the Inference Server

Generate Speech with Language and Emotion Control

Voice Cloning

License Note

Serving Architecture: Batch vs Real-Time Streaming

Batch Processing

Real-Time Streaming

Cost Analysis: Self-Hosted TTS vs ElevenLabs and PlayHT

Production Scaling: Handling 1000+ Concurrent Requests

Single GPU Capacity

Horizontal Scaling

GPU Monitoring

Building Voice AI Agents: Combining TTS with STT and LLM

Which Model Should You Deploy?

Build what's next.