Three things happened in quick succession. Kokoro-82M went viral because it matched or beat much larger TTS models while running on consumer hardware. Fish Speech 1.5 ranked first among open-source models on TTS-Arena as of early 2025 (FishAudio's newer S1 model has since claimed the #1 spot on TTS-Arena V2). Hume released TADA in March 2026, a model built specifically to eliminate hallucinations in long-form synthesis. In the same period, ElevenLabs raised their API prices and PlayHT shut down after Meta acquired it in July 2025.
Developers started asking the obvious question: if open-source TTS is this good, why keep paying per character?
This guide answers that practically. Which GPU you need, what a full deployment looks like, what it costs, and how to wire TTS into a production voice agent. For context on the full voice AI pipeline (ASR + LLM + TTS), see the voice AI GPU infrastructure guide.
The 2026 Open-Source TTS Models Worth Deploying
Kokoro-82M has 82M parameters and an Apache 2.0 license. The v1.0 release (January 27, 2025) ships 54 voices across 8 languages. Model weights are under 1GB at FP16, though total GPU memory during inference (including CUDA kernels and buffers) runs 2-3GB. It hits an RTF of about 0.03 on an A100. The key advantage is footprint: you can pack many instances onto a single GPU. A community-maintained Docker image (ghcr.io/remsky/kokoro-fastapi-gpu) exposes an OpenAI-compatible API with zero configuration.
Fish Speech 1.5 has an unconfirmed parameter count (estimated ~500M, but no official figure has been published) and ranked first among open-source models on TTS-Arena as of early 2025. Note that FishAudio has since released S1, a newer model that holds the #1 position on TTS-Arena V2. Fish Speech 1.5 remains the relevant self-hostable option covered here. It supports 13 languages including Chinese, Japanese, and Korean, with emotion and style control via conditioning parameters. VRAM requirement is 12GB minimum, with 24GB recommended for production workloads. Voice cloning from reference audio is built in, no fine-tuning required. License is CC BY-NC-SA 4.0, which means non-commercial use only. Commercial use requires a separate agreement from FishAudio.
Hume TADA (Text-Acoustic Dual Alignment) was released in March 2026 by Hume AI. The headline claim is zero hallucinations on the LibriTTSR test set in long-form synthesis: the model stops and signals rather than inventing words when context is ambiguous. Expressive synthesis with emotional alignment. VRAM is approximately 2.5GB for the 1B model and 9GB for the 3B model with bf16, though independent benchmarks are limited as of April 2026. Weights are available for self-hosting for research and commercial customers.
NVIDIA PersonaPlex-7B is a 7B parameter real-time speech-to-speech conversational model requiring 16GB VRAM minimum, with 24GB+ recommended for smooth real-time performance. It is designed for full-duplex conversations with simultaneous listening and speaking, not a traditional TTS pipeline. Licensed under NVIDIA Open Model License (weights) with MIT license (code). Include it if your application needs live conversational voice interaction; for batch or streaming TTS use cases, Kokoro or Fish Speech are more appropriate.
Model comparison:
| Model | Parameters | VRAM | RTF (A100) | Languages | License | Best For |
|---|---|---|---|---|---|---|
| Kokoro-82M v1.0 | 82M | ~1GB weights (2-3GB total) | ~0.03 | 8 | Apache 2.0 | High-throughput English TTS |
| Fish Speech 1.5 | unconfirmed | ~12GB min | ~0.20 | 13 | CC BY-NC-SA 4.0 | Multilingual, style control |
| Hume TADA | ~unknown | ~2.5GB (1B) / ~9GB (3B) | ~0.25 est. | English (multi planned) | Commercial | Expressive voice agents |
| PersonaPlex-7B | 7B | 16GB min / 24GB+ rec. | ~0.50 | English | NVIDIA OML / MIT | Full-duplex conversational voice |
RTF figures are estimates based on model architecture and available community benchmarks. Run your own benchmarks with your audio length distribution before capacity planning.
GPU Requirements and Real-Time Factors
RTF (real-time factor) is generation time divided by output audio duration. An RTF above 1.0 means the model cannot keep up with real-time playback. Anything below 0.1 means the GPU is mostly idle when serving a single stream, so you can pack in more concurrent users.
RTF by GPU:
| GPU | Spheron Price | Kokoro-82M RTF | Fish Speech RTF | Concurrent Kokoro streams | Concurrent Fish Speech streams |
|---|---|---|---|---|---|
| L40S PCIe | $1.80/hr | ~0.08 | ~0.30 | ~30 | ~8 |
| A100 PCIe 80GB | $1.04/hr | ~0.03 | ~0.20 | ~50 | ~12 |
| H100 PCIe 80GB | $2.63/hr | ~0.02 | ~0.12 | ~80 | ~20 |
Pricing fluctuates based on GPU availability. The prices above are based on 09 Apr 2026 and may have changed. Check current GPU pricing for live rates.
Community benchmarks for Kokoro show RTF of ~0.04-0.06 on RTX 4090, which is comparable to the L40S PCIe figures above. Spheron does not currently list RTX 4090 in the GPU catalog. L40S PCIe ($1.80/hr) is the closest available alternative at a similar price point and performs comparably for inference-only workloads.
Step-by-Step: Deploy Kokoro-82M on Spheron GPU Cloud
Provision Your Instance
- Go to app.spheron.ai
- Select A100 PCIe 80GB ($1.04/hr): sufficient for 50+ concurrent Kokoro streams
- Choose Ubuntu 22.04 with at least 50GB storage
- SSH into the instance once it is running
Deploy with Docker
# Pull the community-maintained FastAPI image (GPU variant)
docker pull ghcr.io/remsky/kokoro-fastapi-gpu:latest
# Run with GPU access and expose the API port
docker run -d \
--name kokoro \
--gpus all \
-p 8880:8880 \
-e KOKORO_WORKERS=4 \
ghcr.io/remsky/kokoro-fastapi-gpu:latestCheck the server is ready:
curl http://localhost:8880/healthGenerate Audio
The server exposes an OpenAI-compatible /v1/audio/speech endpoint. You can point any OpenAI TTS client at it by changing the base URL:
curl http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro",
"input": "The GPU cloud is the fastest path from prototype to production.",
"voice": "af_bella",
"response_format": "wav"
}' \
--output output.wavAvailable Voices
Kokoro v1.0 ships 54 voices across 8 languages. Key English voices:
| Voice ID | Style | Notes |
|---|---|---|
af_bella | Female, warm | Default, most tested |
af_sarah | Female, clear | Good for customer service |
am_adam | Male, neutral | Good for narration |
am_michael | Male, authoritative | Good for enterprise apps |
bf_emma | Female, British | UK English accent |
bm_george | Male, British | UK English accent |
Full voice list: curl http://localhost:8880/v1/voices
Streaming Configuration
For voice agents, enable sentence-level streaming to reduce time-to-first-audio:
docker run -d \
--name kokoro \
--gpus all \
-p 8880:8880 \
-e KOKORO_WORKERS=4 \
-e KOKORO_STREAM=true \
-e KOKORO_CHUNK_SIZE=50 \
ghcr.io/remsky/kokoro-fastapi-gpu:latestWith streaming enabled, the server begins emitting audio chunks as soon as the first 50 tokens generate. For a voice agent, this means the user hears the start of the response while the GPU is still processing the tail end.
Step-by-Step: Deploy Fish Speech
Instance Requirements
Fish Speech 1.5 needs 12GB VRAM minimum per model instance, with 24GB recommended for production. An A100 PCIe 80GB can run up to 6 instances in parallel at minimum VRAM. H100 PCIe improves throughput by roughly 2x for high-concurrency serving. Start with A100 PCIe unless you are targeting sub-100ms latency at scale.
Installation
# Clone the repository
git clone https://github.com/fishaudio/fish-speech
cd fish-speech
# Install with CUDA-version-specific extras (use cu129, cu128, cu126, or cpu based on your CUDA version)
pip install -e '.[cu126]'
# Download model weights (~1.5GB)
pip install huggingface_hub
huggingface-cli download fishaudio/fish-speech-1.5 \
--local-dir checkpoints/fish-speech-1.5Start the Inference Server
# Start the web UI and API server on all interfaces
python tools/run_webui.py \
--listen 0.0.0.0:7860 \
--checkpoint-path checkpoints/fish-speech-1.5The API is available at /api/v1/tts. For production deployments, run behind Nginx with rate limiting. Do not expose port 7860 directly; use an SSH tunnel for testing.
Generate Speech with Language and Emotion Control
import requests
response = requests.post(
"http://localhost:7860/api/v1/tts",
json={
"text": "Your GPU deployment is ready.",
"language": "en",
"speaker": None, # None uses default speaker
"emotion": "neutral", # Options: neutral, happy, sad, angry, fearful, disgusted, surprised
"format": "wav",
"streaming": False
}
)
response.raise_for_status()
with open("output.wav", "wb") as f:
f.write(response.content)Voice Cloning
Fish Speech clones voices from a reference audio clip. No fine-tuning required:
import requests
with open("reference.wav", "rb") as ref:
response = requests.post(
"http://localhost:7860/api/v1/tts",
data={
"text": "Cloned voice generation test.",
"language": "en",
"format": "wav"
},
files={
"reference_audio": ref,
"reference_text": (None, "Transcript of the reference audio clip.")
}
)
response.raise_for_status()
with open("cloned.wav", "wb") as f:
f.write(response.content)Reference audio recommendations: 5-15 seconds, clean recording, minimal background noise, consistent energy. Shorter clips work but speaker similarity drops below 85%.
License Note
Fish Speech 1.5 is licensed CC BY-NC-SA 4.0. This allows non-commercial use with attribution. For commercial applications, contact FishAudio for a commercial license before deploying to production.
Serving Architecture: Batch vs Real-Time Streaming
Batch Processing
Batch processing fits workloads where latency is not the constraint: audiobook generation, podcast production, pre-rendered game dialogue, content localization.
Architecture:
- Request queue (Redis or SQS)
- Worker pool pulling from queue
- Output written to object storage (S3-compatible)
- Webhook notification on completion
# Worker loop for batch processing
import redis
import requests
import json
import time
queue = redis.Redis(host='localhost', port=6379)
connection_retries = 0
MAX_RETRIES = 3
while True:
try:
_, job = queue.blpop('tts_queue')
connection_retries = 0 # Reset backoff on successful connection
try:
job_data = json.loads(job)
except (json.JSONDecodeError, KeyError) as e:
print(f"Malformed job payload, dead-lettering: {e}. Raw: {job!r}")
queue.rpush('tts_queue_failed', job)
continue
try:
response = requests.post(
"http://localhost:8880/v1/audio/speech",
json={
"model": "kokoro",
"input": job_data["text"],
"voice": job_data.get("voice", "af_bella"),
"response_format": "mp3"
}
)
response.raise_for_status()
# Write to S3 or local storage
upload_to_storage(response.content, job_data["output_key"])
notify_webhook(job_data["callback_url"], job_data["output_key"])
except Exception as e:
# Track retry count to avoid re-queuing bad jobs indefinitely
attempts = job_data.get("_attempts", 0) + 1
if attempts < MAX_RETRIES:
job_data["_attempts"] = attempts
print(f"Job failed (attempt {attempts}/{MAX_RETRIES}): {e}. Re-queuing.")
queue.rpush('tts_queue', json.dumps(job_data))
else:
# Move to dead-letter queue after max retries
print(f"Job failed permanently after {MAX_RETRIES} attempts: {e}. Moving to dead-letter queue.")
queue.rpush('tts_queue_failed', json.dumps(job_data))
except redis.exceptions.ConnectionError as e:
# Back off exponentially to avoid tight CPU spin on Redis outage
wait = min(2 ** connection_retries, 60)
print(f"Redis connection error: {e}. Retrying in {wait}s.")
time.sleep(wait)
connection_retries += 1Throughput target on A100 PCIe with Kokoro: approximately 50 million characters per hour for typical short-form text.
Real-Time Streaming
Streaming is for interactive applications where the user hears audio before the full response is synthesized: voice agents, interactive demos, live narration.
The key technique is sentence-boundary detection. The LLM generates tokens, the application detects sentence boundaries (period, question mark, exclamation), and passes each complete sentence to TTS immediately. The user begins hearing audio ~200-400ms after the LLM generates the first sentence.
def stream_llm_to_tts(llm_stream, tts_client):
"""Pipeline LLM tokens into TTS with sentence-boundary buffering."""
buffer = ""
sentence_enders = {'.', '!', '?', ':', ';'}
for token in llm_stream:
buffer += token
# Flush on sentence boundary
if buffer and buffer[-1] in sentence_enders and len(buffer) > 10:
audio_chunk = tts_client.synthesize(buffer.strip())
yield audio_chunk
buffer = ""
# Flush remainder
if buffer.strip():
audio_chunk = tts_client.synthesize(buffer.strip())
yield audio_chunkFor the full voice AI pipeline combining ASR, LLM, and TTS on a single GPU, see the voice AI GPU infrastructure guide.
Cost Analysis: Self-Hosted TTS vs ElevenLabs and PlayHT
The comparison below uses on-demand pricing as the baseline, not spot instances. This gives a conservative view of self-hosting costs. Spot instances are cheaper but interruptible, which is fine for batch workloads but not for real-time production serving.
Cost comparison (characters per month):
| Monthly Volume | ElevenLabs (Scale) | PlayHT (Pro) | Kokoro on A100 PCIe | Fish Speech on A100 PCIe |
|---|---|---|---|---|
| 1M chars | $180 | $49 | $748.80* | $748.80* |
| 10M chars | $1,800 | $490 | $748.80* | $748.80* |
| 50M chars | $9,000 | $2,450 | $748.80* | $748.80* |
| 100M chars | $18,000 | $4,900 | $748.80* | $748.80* (or 2x A100) |
*$748.80/month = $1.04/hr x 720 hours (dedicated A100 PCIe). One A100 running Kokoro handles approximately 3.6 billion characters per month at steady utilization.
Note: ElevenLabs Scale plan pricing is approximate and changes periodically. Verify current rates at elevenlabs.io before building a cost model.
Pricing fluctuates based on GPU availability. The Spheron GPU costs above are based on 09 Apr 2026 and may have changed. Check current GPU pricing for live rates.
At under 4M characters per month, API pricing is usually cheaper because you avoid the fixed GPU cost. Self-hosting becomes economical at 4-5M+ characters per month and substantially cheaper above 10M.
Fish Speech on A100 PCIe handles approximately 200-400 million characters per month at steady utilization (lower than Kokoro due to higher RTF). At 100M+ chars/month, one A100 covers it. Above that, add instances.
Production Scaling: Handling 1000+ Concurrent Requests
Single GPU Capacity
| GPU | Kokoro streams | Fish Speech streams | Notes |
|---|---|---|---|
| A100 PCIe 80GB | 50 | 12 | Good starting point |
| H100 PCIe 80GB | 80 | 20 | Lower latency under load |
Horizontal Scaling
Beyond one GPU, run independent TTS containers on multiple instances and load balance with Nginx:
upstream kokoro_backends {
least_conn;
server gpu-instance-1:8880;
server gpu-instance-2:8880;
server gpu-instance-3:8880;
server gpu-instance-4:8880;
}
server {
listen 80;
location /v1/audio/speech {
proxy_pass http://kokoro_backends;
proxy_read_timeout 30s;
proxy_send_timeout 30s;
}
location /health {
proxy_pass http://kokoro_backends;
}
}Nginx least_conn distributes requests to the backend with the fewest active connections. This performs better than round-robin for TTS because request duration varies significantly by text length.
GPU Monitoring
Monitor GPU utilization with nvidia-smi during load testing:
# Real-time GPU stats, 2-second refresh
watch -n 2 nvidia-smi
# Log utilization to CSV for capacity planning
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.free \
--format=csv,noheader \
--loop=5 >> gpu_utilization.csvTarget 80-90% GPU utilization for cost efficiency. Below 70% means the instance is over-provisioned. Above 95% means requests are queuing.
Building Voice AI Agents: Combining TTS with STT and LLM
The standard architecture is three stages on one GPU: ASR (Whisper), LLM response generation, TTS synthesis. The models stack well on a single H100 PCIe 80GB:
- H100 PCIe 80GB: Whisper Large v3 (~4GB) + a 13B LLM (~26GB FP16) + Kokoro (~2-3GB total) leaves 47GB+ headroom for KV cache
- A100 PCIe 80GB: Whisper Medium (~3GB) + a 7B LLM (~14GB FP16) + Kokoro (~2-3GB total) fits comfortably
TTS is the cheapest stage in the pipeline by VRAM for Kokoro. Kokoro's ~2-3GB total GPU footprint leaves plenty of room for a larger LLM on the same instance. Fish Speech at 12GB minimum still fits alongside a 7B LLM and Whisper on an 80GB GPU, though with less headroom.
The sentence-streaming pattern keeps end-to-end latency under 500ms. As the LLM generates tokens, detect sentence boundaries and pass each complete sentence to TTS immediately. The user hears the first sentence while the LLM is still generating the rest of the response.
The voice AI GPU infrastructure guide has VRAM requirement breakdowns for each pipeline stage, latency budget analysis, and full GPU recommendations for ASR + LLM + TTS co-location. Kokoro's ~2-3GB total GPU footprint leaves plenty of room for a 7B-13B LLM and Whisper on the same H100 PCIe 80GB instance. Fish Speech at 12GB minimum also fits alongside a 7B LLM and Whisper on an H100 PCIe 80GB, with roughly 40GB remaining for KV cache.
For NeuTTS Air users who need faster throughput or voice cloning, see the NeuTTS Air deployment guide.
Which Model Should You Deploy?
Use Kokoro-82M if: you are building an English-first voice application, you need maximum throughput on minimal hardware, or you want an OpenAI-compatible API with zero configuration.
Use Fish Speech if: your application handles multiple languages, you need style or emotion control, or you want voice cloning without fine-tuning for a non-English audience.
Use Hume TADA if: you are building an emotionally expressive voice agent and can tolerate higher VRAM requirements and less community documentation as of April 2026.
Use PersonaPlex-7B if: you need full-duplex real-time conversational voice (simultaneous listening and speaking), not a traditional TTS pipeline. It is not suited for batch synthesis or standard streaming TTS use cases.
For the cost-sensitive production path: start with Kokoro on A100 PCIe. Add Fish Speech on the same instance if you need multilingual support. Upgrade to H100 PCIe when you cross 80 concurrent streams.
Open-source TTS on GPU cloud is genuinely cost-effective at scale. A dedicated A100 PCIe running Kokoro costs $1.04/hr and covers 50+ concurrent streams, at a fraction of per-character API pricing above 4-5M characters per month. Spheron has A100 and H100 instances on-demand with no minimums or long-term contracts.
Rent A100 → | Rent H100 → | View all GPU pricing → | Get started on Spheron →
