Tutorial

Self-Host AI Voice Cloning on GPU Cloud: XTTS-2, F5-TTS, and OpenVoice V2 Production Deployment Guide (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 31, 2026
Self-Host Voice CloningXTTS-2 DeploymentF5-TTS ProductionOpenVoice V2AI Voice Cloning GPUVoice Cloning APIGPU CloudVoice AI
Self-Host AI Voice Cloning on GPU Cloud: XTTS-2, F5-TTS, and OpenVoice V2 Production Deployment Guide (2026)

At 10 million characters per month, ElevenLabs charges $1,800 to $3,000 depending on your plan tier. An H100 SXM5 on Spheron running XTTS-2 handles the same volume for roughly $8 in GPU compute at spot rates (at ~600 chars/sec, 10M chars takes about 4.6 GPU-hours at $1.73/hr). The math is not subtle. For teams generating voice output at scale, the per-character billing model stops working fast.

This post covers how to deploy three production-grade open-weight voice cloning models: XTTS-2, F5-TTS, and OpenVoice V2. It includes GPU sizing tables, per-1k-character cost benchmarks with live Spheron pricing, a complete FastAPI deployment walkthrough, and a break-even comparison against managed APIs. For context on the full voice AI pipeline (ASR + LLM + TTS co-location), see the voice AI GPU infrastructure guide. For non-cloning TTS models like Kokoro and Fish Speech, see the open-source TTS deployment guide.

Voice Cloning Is Not Generic TTS

Generic TTS takes text and returns audio in a fixed voice. Voice cloning takes text and returns audio that sounds like a specific person, derived from a reference recording. The difference is in three technical areas.

Speaker embeddings. Before any synthesis happens, the reference audio must be encoded into a high-dimensional vector that captures the speaker's timbre, pitch range, and speaking rhythm. This embedding is computed once per speaker and cached. Every synthesis request for that voice runs the decoder against the cached embedding rather than re-processing the reference clip. If you skip caching and recompute the embedding per request, you add 50-200ms of GPU time per request for no benefit.

Reference audio pipeline. XTTS-2 requires at least 3 seconds of clean reference audio. The clip goes through preprocessing (resampling to 22kHz, channel downmix, normalization) before embedding computation. Noisy reference audio degrades speaker similarity directly. A 3-second clip recorded in a quiet room at 44kHz will produce better embeddings than a 10-second clip with background music. The model has no mechanism to separate signal from noise at the embedding stage.

Inference latency profile. XTTS-2 uses autoregressive decoding: each audio token depends on all previous tokens. This means time to first audio (TTFA) scales with the length of the text being synthesized. The rough model is:

TTFA (ms) = embedding_lookup_time + (char_count / chars_per_second_on_GPU)

For XTTS-2 on an H100 SXM5 generating the first sentence of a longer passage, expect 200-400ms TTFA. F5-TTS uses flow-matching diffusion instead of autoregressive decoding, which eliminates this accumulation effect and gives more consistent TTFA regardless of passage length.

Model Landscape 2026: XTTS-2, F5-TTS, and OpenVoice V2

XTTS-2 (Coqui TTS)

XTTS-2 is a GPT-2 style autoregressive decoder paired with a VQVAE audio codec. The model weights are 1.88GB at FP32. Total VRAM at inference, including CUDA buffers and the speaker conditioning machinery, runs 3-4GB per worker at FP16. It supports 17 languages and can clone a voice across languages: you can clone an English speaker and synthesize French output with that speaker's vocal characteristics.

The original Coqui AI company shut down in January 2024. The model is maintained by the community and available as xtts_v2 on the XTTS HuggingFace hub. The correct install package is coqui-tts (not the abandoned TTS package): pip install coqui-tts. The model ID for download is tts_models/multilingual/multi-dataset/xtts_v2.

License note: XTTS-2 uses the Coqui CPML (Commercial Public Model License). Commercial use requires a CPML license; review the license terms before deploying XTTS-2 in any revenue-generating product.

F5-TTS (SWivid)

F5-TTS replaces the autoregressive decoder with a flow-matching diffusion model. The E2E variant eliminates the text encoder entirely by feeding grapheme input directly, reducing hallucination risk on longer passages. Model size is approximately 500M parameters, with a VRAM footprint of 2-3GB at FP16.

The practical advantage of flow-matching over autoregressive decoding is error accumulation. Autoregressive models have a known failure mode on long passages: each token prediction depends on all previous tokens, and early errors propagate. F5-TTS avoids this by computing output in parallel denoising steps rather than token-by-token. This makes it noticeably better for paragraph-length synthesis where XTTS-2 sometimes drifts in pronunciation or prosody near the end of long segments.

F5-TTS supports English and Chinese. It is MIT licensed. The canonical model on HuggingFace is SWivid/F5-TTS with checkpoint F5TTS_v1_Base.

OpenVoice V2 (MyShell)

OpenVoice V2 separates base TTS from voice cloning. The base TTS layer (MeloTTS) handles text-to-speech in a neutral voice. A lightweight tone-color converter then shifts the base output to match the target speaker embedding. The converter processes audio in under 100ms regardless of audio length, since it operates on the output waveform rather than doing full autoregressive decoding for each synthesis call.

This architecture is the right choice when your application pre-registers a set of voices at onboarding time and then serves those voices repeatedly. You pay the embedding computation cost once per registered voice, then the per-request overhead is just the MeloTTS base synthesis plus the 80-120ms converter pass.

OpenVoice V2 requires MeloTTS as a dependency: pip install MeloTTS alongside pip install openvoice. Without MeloTTS, the base TTS stage is missing and installation silently fails to produce audio. MIT licensed.

Comparison Table

ModelParamsVRAM (FP16)TTFA RTX 5090LanguagesLicenseBest For
XTTS-2~1.5B3-4GB~3.5s17CPMLCross-lingual cloning, widest language coverage
F5-TTS (E2E)~500M2-3GB~2.8sEnglish, ChineseMITLong-form English synthesis, lower hallucination
OpenVoice V2~300M (converter)~500MB converter~100ms (converter only)EnglishMITPre-registered voice catalogues, minimal per-request cost

Hardware Requirements and GPU Sizing

XTTS-2 at FP16 with CUDA buffers occupies roughly 3-4GB per worker. F5-TTS runs slightly lighter at 2-3GB per worker. OpenVoice V2's converter is ~500MB, but MeloTTS base TTS adds 1-2GB. Worker counts below assume a 2GB safety margin on total VRAM.

GPUVRAMXTTS-2 WorkersF5-TTS WorkersPricingRecommended Use
RTX 509032GB GDDR76-88-10Per-hour rate variesDev/testing, small-scale cloning API
L40S PCIe48GB GDDR68-1212-16~$0.91/hr on-demandProduction, multi-model, embedding cache
H100 SXM580GB HBM316-2020-26$3.90/hr on-demandHigh-concurrency multi-tenant APIs

For batched workloads where many synthesis requests queue against a worker pool, throughput scales with worker count. A single XTTS-2 worker processes approximately 300-600 characters per second depending on GPU tier. An 8-worker deployment on an L40S can sustain 2,400-4,800 characters per second throughput under load.

The memory bandwidth gap between GPU tiers matters for autoregressive decoding. XTTS-2's GPT-2 decoder reads model weights for each decode step. H100 SXM5's HBM3 (3.35 TB/s bandwidth) moves weights off memory 2-3x faster than GDDR7 (L40S uses GDDR6 at ~864 GB/s), which translates directly to higher tokens per second at the same batch size.

Spheron L40S instances hit a good balance for most production voice cloning workloads: enough VRAM to run 8-12 XTTS-2 workers or a mixed XTTS-2 plus F5-TTS deployment, at lower cost than an H100 SXM5 for single-model setups.

Container Setup and Inference Servers

FastAPI Wrapper for XTTS-2

The key optimization in a production XTTS-2 server is precomputing speaker embeddings at startup rather than per request. Each reference audio must go through preprocessing and encoding once. After that, the embedding can be cached in a dict and reused across all synthesis calls for that speaker.

python
from fastapi import FastAPI, HTTPException
from fastapi.responses import Response
from pydantic import BaseModel
from TTS.api import TTS
import io, numpy as np, soundfile as sf, threading, torch

app = FastAPI()
model = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
# Serialize inference calls: PyTorch autoregressive inference is not thread-safe.
# FastAPI runs sync endpoints in a thread pool, so concurrent requests would call
# the same model from different threads simultaneously without this lock.
_inference_lock = threading.Lock()

speaker_cache: dict[str, dict] = {}

@app.on_event("startup")
async def load_speaker_embeddings():
    import os, json
    embeddings_dir = "./speaker_embeddings"
    if os.path.exists(embeddings_dir):
        for fname in os.listdir(embeddings_dir):
            if fname.endswith(".json"):
                voice_id = fname[:-5]
                with open(f"{embeddings_dir}/{fname}") as f:
                    data = json.load(f)
                speaker_cache[voice_id] = {
                    "gpt_cond_latent": torch.tensor(data["gpt_cond_latent"]).to("cuda"),
                    "speaker_embedding": torch.tensor(data["speaker_embedding"]).to("cuda"),
                }

class SynthRequest(BaseModel):
    text: str
    voice_id: str
    language: str = "en"

@app.post("/synthesize")
def synthesize(req: SynthRequest):
    if req.voice_id not in speaker_cache:
        raise HTTPException(status_code=404, detail="voice_id not found")
    embeddings = speaker_cache[req.voice_id]
    with _inference_lock:
        wav = model.synthesizer.tts_model.inference(
            req.text,
            req.language,
            embeddings["gpt_cond_latent"],
            embeddings["speaker_embedding"],
        )
    samples = np.array(wav["wav"], dtype=np.float32)
    buf = io.BytesIO()
    sf.write(buf, samples, samplerate=24000, format="WAV")
    return Response(content=buf.getvalue(), media_type="audio/wav")

@app.get("/health")
async def health():
    return {"status": "ok", "voices": list(speaker_cache.keys())}

To pre-compute and serialize a speaker embedding from a reference clip:

python
from TTS.api import TTS
import json, torch

model = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path="reference.wav"
)
embedding_data = {
    "gpt_cond_latent": gpt_cond_latent.cpu().tolist(),
    "speaker_embedding": speaker_embedding.cpu().tolist(),
}
with open("speaker_embeddings/my_voice.json", "w") as f:
    json.dump(embedding_data, f)

F5-TTS REST API

F5-TTS ships with a Gradio-based inference server that exposes a REST-style endpoint. For integration into a production API, wrap the CLI inference path in FastAPI:

python
from fastapi import FastAPI, HTTPException
from fastapi.background import BackgroundTask
from fastapi.responses import FileResponse
from pydantic import BaseModel
from pathlib import Path
import subprocess, tempfile, os

app = FastAPI()

SAFE_VOICES_DIR = Path("./voices").resolve()

class F5Request(BaseModel):
    ref_audio_path: str
    gen_text: str
    cross_fade_duration: float = 0.15

@app.post("/synthesize")
def synthesize(req: F5Request):
    safe_path = (SAFE_VOICES_DIR / req.ref_audio_path).resolve()
    if not safe_path.is_relative_to(SAFE_VOICES_DIR):
        raise HTTPException(status_code=400, detail="Invalid ref_audio_path")
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
        output_path = tmp.name
    try:
        subprocess.run([
            "python", "src/f5_tts/infer/infer_cli.py",
            "--model", "F5-TTS",
            "--ref_audio", str(safe_path),
            "--gen_text", req.gen_text,
            "--output_file", output_path,
            "--cross_fade_duration", str(req.cross_fade_duration),
        ], check=True)
    except Exception:
        os.unlink(output_path)
        raise
    return FileResponse(output_path, media_type="audio/wav",
                        background=BackgroundTask(os.unlink, output_path))

Set cross_fade_duration=0.15 for smoother joins when streaming sentence-by-sentence output. Lower values (0.05-0.10) give tighter sentence boundaries at the cost of occasional mild artifacts at the join point.

Triton Custom Backend

For sustained concurrency above ~200 simultaneous synthesis streams, the Python-level FastAPI approach becomes a bottleneck. XTTS-2's transformer encoder (which converts text to a conditioning representation) is stateless and can be exported to ONNX and served as a Triton model. The autoregressive decoder requires a custom Python backend in Triton since it maintains per-session state.

This architecture is significantly more complex than the FastAPI approach but is the right choice for multi-tenant APIs where the decoder state per session grows with conversation length. Below 200 concurrent streams, the FastAPI wrapper with async workers handles load adequately.

Production Deployment Walkthrough

The Docker Compose setup below runs XTTS-2 as a FastAPI server with a persistent speaker embeddings volume:

yaml
version: "3.9"
services:
  xtts_api:
    image: python:3.11-slim
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - ./app:/app
      - ./speaker_embeddings:/app/speaker_embeddings
    working_dir: /app
    command: >
      bash -c "pip install coqui-tts fastapi uvicorn &&
               uvicorn server:app --host 0.0.0.0 --port 8880 --workers 4"
    ports:
      - "8880:8880"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

For Nginx in front:

nginx
upstream xtts_backend {
    server 127.0.0.1:8880;
    keepalive 64;
}

server {
    listen 80;
    location /synthesize {
        proxy_pass http://xtts_backend;
        proxy_read_timeout 30s;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
    location /health {
        proxy_pass http://xtts_backend;
    }
}

For streaming output, XTTS-2 supports chunk-by-chunk generation via the stream_chunk_sentences setting. Configure it in the server startup:

python
model = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
model.synthesizer.tts_model.config.stream_chunk_sentences = True

Each sentence chunk arrives as a PCM audio array. Stream these over WebSocket to the client:

python
import asyncio
from fastapi import WebSocket

@app.websocket("/stream")
async def stream_synthesis(ws: WebSocket):
    await ws.accept()
    try:
        data = await ws.receive_json()
        if data["voice_id"] not in speaker_cache:
            await ws.send_json({"error": "voice_id not found"})
            return
        embeddings = speaker_cache[data["voice_id"]]
        # inference_stream is a synchronous generator that blocks during GPU decoding.
        # Iterating it directly on the event loop thread would stall all other
        # connections for the full synthesis duration. Instead, push chunks from a
        # background thread into a queue and await each one here.
        queue: asyncio.Queue = asyncio.Queue()
        loop = asyncio.get_event_loop()

        def run_inference():
            try:
                with _inference_lock:
                    for chunk in model.synthesizer.tts_model.inference_stream(
                        data["text"], data["language"], embeddings["gpt_cond_latent"],
                        embeddings["speaker_embedding"]
                    ):
                        loop.call_soon_threadsafe(queue.put_nowait, chunk)
            finally:
                loop.call_soon_threadsafe(queue.put_nowait, None)  # always sent

        thread = threading.Thread(target=run_inference, daemon=True)
        thread.start()
        while True:
            chunk = await queue.get()
            if chunk is None:
                break
            await ws.send_bytes(chunk.tobytes())
    finally:
        await ws.close()

For async batch jobs (dubbing pipelines, automated call recordings), POST the final WAV to a webhook URL instead of streaming:

python
import asyncio
import httpx
import ipaddress
import socket
import ssl
from urllib.parse import urlparse

_BLOCKED_RANGES = [
    ipaddress.ip_network("10.0.0.0/8"),
    ipaddress.ip_network("172.16.0.0/12"),
    ipaddress.ip_network("192.168.0.0/16"),
    ipaddress.ip_network("127.0.0.0/8"),
    ipaddress.ip_network("169.254.0.0/16"),
    ipaddress.ip_network("::1/128"),
    ipaddress.ip_network("fc00::/7"),
    ipaddress.ip_network("fe80::/10"),
]

class _BlockedRangeError(ValueError):
    pass

def _is_blocked(addr: ipaddress.IPv4Address | ipaddress.IPv6Address) -> bool:
    # IPv4-mapped IPv6 addresses (e.g. ::ffff:10.0.0.1) have version == 6, so
    # they skip all IPv4 blocked ranges when checked with version matching alone.
    # Unwrap the mapped IPv4 address and test it against the blocked ranges too.
    effective = addr.ipv4_mapped if addr.ipv4_mapped is not None else addr
    return any(effective.version == net.version and effective in net for net in _BLOCKED_RANGES)

async def _validate_webhook_url(url: str) -> str:
    """Validates the webhook URL and returns the pre-resolved IP to prevent DNS rebinding."""
    parsed = urlparse(url)
    if parsed.scheme != "https":
        raise ValueError("webhook_url must use https")
    if not parsed.hostname:
        raise ValueError("webhook_url has no hostname")
    try:
        addr = ipaddress.ip_address(parsed.hostname)
        if _is_blocked(addr):
            raise _BlockedRangeError("webhook_url resolves to a blocked IP range")
        return str(addr)
    except _BlockedRangeError:
        raise
    except ValueError:
        try:
            # Use asyncio.to_thread so the blocking DNS syscall does not stall
            # the event loop while waiting for the resolver to respond.
            results = await asyncio.to_thread(socket.getaddrinfo, parsed.hostname, None)
        except socket.gaierror as e:
            raise ValueError(f"Cannot resolve hostname: {parsed.hostname}") from e
        resolved_ip = None
        for _, _, _, _, sockaddr in results:
            resolved_addr = ipaddress.ip_address(sockaddr[0])
            if _is_blocked(resolved_addr):
                raise _BlockedRangeError("webhook_url resolves to a blocked IP range")
            if resolved_ip is None:
                resolved_ip = sockaddr[0]
        if resolved_ip is None:
            raise ValueError(f"No usable address resolved for {parsed.hostname}")
        return resolved_ip

async def deliver_to_webhook(webhook_url: str, audio_bytes: bytes):
    resolved_ip = await _validate_webhook_url(webhook_url)
    parsed = urlparse(webhook_url)
    # Connect directly to the pre-validated IP to prevent DNS rebinding: httpx
    # would otherwise re-resolve the hostname independently, letting an attacker
    # flip the DNS record between validation and the actual request.
    port = parsed.port or 443
    path = (parsed.path or "/") + ("?" + parsed.query if parsed.query else "")
    host = f"[{resolved_ip}]" if ":" in resolved_ip else resolved_ip
    ip_url = f"https://{host}:{port}{path}"
    # Override TLS SNI to the original hostname so certificate validation succeeds.
    # Without this, httpx sends the IP address as the SNI value, which won't match
    # the server's domain certificate and causes SSLCertVerificationError.
    async with httpx.AsyncClient() as client:
        await client.post(
            ip_url, content=audio_bytes,
            headers={"Content-Type": "audio/wav", "Host": parsed.hostname},
            extensions={"sni_hostname": parsed.hostname.encode()},
        )

Cost Benchmarks: Per 1,000 Characters on Spheron

Methodology: Characters per second is measured at batch size 1 (single concurrent synthesis stream), which represents typical voice agent workloads. GPU-hours per 1,000 characters derives from that throughput. Cost is (GPU_seconds / 3600) * hourly_rate.

GPUTierPriceXTTS-2 chars/secCost per 1k charsF5-TTS chars/secCost per 1k chars
H100 SXM5On-demand$3.90/hr~600~$0.0018~480~$0.0023
H100 SXM5Spot$1.73/hr~600~$0.0008~480~$0.0010
H200 SXM5On-demand$2.51/hr~700~$0.0010~560~$0.0012
H200 SXM5Spot$1.40/hr~700~$0.0006~560~$0.0007

For RTX 5090 (~350 chars/sec XTTS-2) and L40S PCIe (~420 chars/sec XTTS-2), check current per-hour rates on their rental pages and apply the same formula: (1000 / chars_per_sec / 3600) * hourly_rate.

Pricing fluctuates based on GPU availability. The prices above are based on 31 May 2026 and may have changed. Check current GPU pricing for live rates.

The H200 SXM5 spot tier at $1.40/hr is the cheapest per-character option: ~$0.0006/1k chars for XTTS-2. For comparison, ElevenLabs charges $0.18-$0.30 per 1k chars. Self-hosting on Spheron is 100-500x cheaper per character once the GPU is allocated.

Comparison: Self-Hosted vs. ElevenLabs and Cartesia

ElevenLabs Pricing Model

ElevenLabs uses character-based billing. The Creator plan runs approximately $0.00022/character. The Scale plan (with higher concurrency limits) is roughly $0.000165/character on committed bundles, with overage at $0.00030/character. At 5 million characters per month:

  • Creator plan: ~$1,100/month
  • Scale plan committed: ~$825/month

These costs are predictable but scale linearly. There is no volume discount above the plan tier.

Break-Even Table

Monthly volumeElevenLabs (Creator)ElevenLabs (Scale)H100 SXM5 spot (burst)H100 SXM5 (24/7 spot)
500k chars~$110~$83~$0.40$1,244
1M chars~$220~$165~$0.80$1,244
5M chars~$1,100~$825~$4.00$1,244
10M chars~$2,200~$1,650~$8.00$1,244
50M chars~$11,000~$8,250~$40$1,244

"Burst" pricing assumes you only run the GPU when synthesizing (billed per second on Spheron). "24/7" assumes an always-on instance at 720 hrs/month.

For workloads that generate voice on-demand rather than constantly, self-hosting wins at any volume above roughly 100k characters per month. For workloads requiring an always-on voice API with sub-500ms cold start, the H100 SXM5 running 24/7 at spot rates breaks even against ElevenLabs at around 5-6M characters per month.

H100 on Spheron is the right tier for teams running multi-tenant cloning APIs where the GPU needs to be warm at all times. For burst workloads, any tier works since you only pay for GPU time used.

When to Stay on Managed APIs

Managed voice APIs make sense when:

  • Volume is under 200k characters per month. Below this threshold, the infrastructure overhead of self-hosting (deployment, monitoring, model updates) costs more than the character billing savings.
  • You need Cartesia's Sonic model. Sonic is not available as an open-weight model. If your use case requires Sonic's specific quality characteristics, there is no self-hosted substitute.
  • You need guaranteed uptime SLAs. ElevenLabs and Cartesia offer contractual uptime guarantees. Self-hosted GPU infrastructure on cloud has inherent spot preemption risk.
  • Your team has no GPU infrastructure experience. A misconfigured XTTS-2 deployment that crashes under load costs more than managed API billing.

Voice cloning infrastructure must handle consent and audit trails at the application layer. No open-weight model enforces this for you.

At enrollment time (when a user registers their voice), store:

  1. A consent timestamp alongside the speaker_id
  2. The audio source type (user-recorded vs. uploaded third-party clip)
  3. A hash of the reference audio for audit purposes
python
import hashlib, datetime

def enroll_speaker(speaker_id: str, audio_path: str, user_consent: bool):
    if not user_consent:
        raise ValueError("consent required before enrollment")
    with open(audio_path, "rb") as f:
        audio_hash = hashlib.sha256(f.read()).hexdigest()
    enrollment_record = {
        "speaker_id": speaker_id,
        "enrolled_at": datetime.datetime.utcnow().isoformat(),
        "audio_hash": audio_hash,
        "consent": True,
    }
    # Persist to your database alongside the speaker embedding
    return enrollment_record

Watermarking

AudioSeal (Meta, MIT license) adds imperceptible per-sample watermarks to synthesized audio. The watermark survives MP3 re-encoding at 128kbps and moderate noise injection. SilentCipher (Samsung Research) is an alternative with similar resistance to MP3 re-encoding and noise injection.

Apply AudioSeal at synthesis time:

python
from audioseal import AudioSeal

watermarker = AudioSeal.load_generator("audioseal_wm_16bits")
detector = AudioSeal.load_detector("audioseal_detector_16bits")

def watermark_audio(audio_tensor, sample_rate: int = 24000):
    watermarked, _ = watermarker(audio_tensor, sample_rate=sample_rate)
    return watermarked

def verify_watermark(audio_tensor, sample_rate: int = 24000):
    result, _ = detector(audio_tensor, sample_rate=sample_rate)
    return result[:, 1, :].mean().item() > 0.5

Detection Patterns

Run a /verify endpoint in your API that accepts an uploaded audio file and returns whether the watermark is present. This gives you a mechanism to investigate complaints about generated audio.

The baseline policy: consent at enrollment, watermark at generation, detect on complaint. Do not pre-stamp user-recorded audio (only apply watermarks to model-generated output). Log every synthesis call with the speaker_id, text, and output audio hash for a complete audit trail.

Integration with the Broader Voice AI Stack

Voice cloning TTS is the synthesis layer in a cascaded voice agent pipeline. The full architecture:

ASR (faster-whisper) -> LLM (7B-13B, vLLM) -> Voice Clone TTS (XTTS-2 / F5-TTS) -> WebRTC delivery

Latency budget per stage on an H100 SXM5 or L40S:

  • ASR: 30-80ms (faster-whisper large-v3-turbo INT8)
  • LLM TTFT: 150-250ms (7B model, batch size 1)
  • TTS TTFA: 200-400ms (XTTS-2 first chunk, sentence-streaming)
  • Total: 380-730ms end-to-end

For the WebRTC transport layer and barge-in handling, see the WebRTC LLM streaming voice agent guide. For speech-to-speech models that skip the text bottleneck entirely, see the speech-to-speech deployment guide.

Running XTTS-2 on a bare-metal RTX 5090 on Spheron alongside FasterWhisper and a 7B LLM is viable if the LLM fits within the remaining VRAM budget (32GB total minus ~7-8GB for ASR + TTS leaves ~24GB for the LLM at FP16, which covers Llama 3.2 11B comfortably). For the 13B LLM tier or multi-worker TTS setups, move to L40S or H100 SXM5.

For teams building multi-tenant cloning APIs where hundreds of speaker embeddings need to be cached in VRAM simultaneously, the H100 SXM5's 80GB is the right minimum. Each cached embedding is relatively small (~1MB), but the model weights plus the embedding cache plus activation buffers add up quickly under concurrent load.


Voice cloning inference costs on Spheron run 100-500x lower than ElevenLabs per character at moderate volume, using the same open-weight models your team already knows. L40S and H100 SXM5 instances handle production-grade concurrent synthesis without the per-character billing that kills margins at scale.

Get started on Spheron →

STEPS / 06

Quick Setup Guide

  1. Choose a voice cloning model based on your latency and license requirements

    Compare XTTS-2 (3.5s TTFA on RTX 5090, CPML license, 17 languages, requires 3-second reference clip), F5-TTS (2.8s TTFA on RTX 5090, MIT license, flow-matching backbone, lower hallucination rate on long text), and OpenVoice V2 (sub-100ms tone conversion after embedding, MIT license, best for pre-registered voices at scale). Pick XTTS-2 for multilingual cross-lingual use cases. Pick F5-TTS for long-form English synthesis. Pick OpenVoice V2 when voice catalogues are known upfront.

  2. Provision a GPU instance on Spheron

    Go to app.spheron.ai and deploy a GPU instance. For single-model development, an RTX 5090 (32GB GDDR7) handles 6-8 concurrent XTTS-2 workers. For production multi-model setups, an L40S PCIe (48GB VRAM) fits 8-12 XTTS-2 workers plus an embedding cache. For high-concurrency APIs, an H100 SXM5 (80GB HBM3) handles 16-20 XTTS-2 workers. Select Ubuntu 22.04 and allocate at least 50GB disk for model weights and voice embedding caches.

  3. Install XTTS-2 via coqui-tts and run a basic clone

    pip install coqui-tts then: from TTS.api import TTS; tts = TTS('tts_models/multilingual/multi-dataset/xtts_v2').to('cuda'); tts.tts_to_file(text='Hello.', speaker_wav='reference.wav', language='en', file_path='output.wav'). Weights download automatically on first run (~1.88GB). For a FastAPI server, load the model once at startup and pre-compute speaker embeddings by voice_id to avoid re-processing reference audio on every request.

  4. Deploy F5-TTS with the inference server

    git clone https://github.com/SWivid/F5-TTS and pip install -e '.[eval]'. Download weights: huggingface-cli download SWivid/F5-TTS --local-dir ckpts/F5TTS_v1_Base. Start the Gradio API server: python src/f5_tts/infer/infer_gradio.py --server_port 7860 --server_name 0.0.0.0. Wrap infer_cli.py in a FastAPI route for REST access. Set cross_fade_duration=0.15 for smoother sentence joins during streaming.

  5. Deploy OpenVoice V2 with MeloTTS and the tone-color converter

    git clone https://github.com/myshell-ai/OpenVoice and pip install -e . plus pip install MeloTTS (the required base TTS dependency). Run python -c 'import openvoice; openvoice.download_models()'. Two-stage use: run MeloTTS for base output, then apply tone-color converter: from openvoice.api import ToneColorConverter; converter = ToneColorConverter('checkpoints_v2/converter'); converter.convert(audio_src_path='base.wav', src_se=base_se, tgt_se=target_se, output_path='out.wav'). Pre-compute tgt_se at enrollment time.

  6. Configure streaming output and webhook delivery

    For real-time streaming, chunk synthesis by sentence: generate 1-2 sentences at a time and stream PCM audio bytes over WebSocket or HTTP chunked transfer. XTTS-2 supports streaming via its generator callback. F5-TTS returns WAV segments per sentence. Set output sample rate to 24kHz and encode as 16-bit PCM for WebRTC compatibility. For async batch jobs, POST the final WAV or Opus-encoded WebM to a caller callback URL after synthesis completes.

FAQ / 05

Frequently Asked Questions

XTTS-2 model weights are approximately 1.88GB at FP32 and load to about 3-4GB total VRAM once CUDA buffers are included. A single RTX 5090 (32GB GDDR7) handles 20-30 concurrent voice cloning streams at this footprint. For higher concurrency (50-100 simultaneous requests), an L40S (48GB VRAM) running 4-6 XTTS-2 workers fits everything in VRAM with headroom. H100 SXM5 (80GB HBM3) is the right choice for multi-tenant APIs where you need to serve dozens of cached speaker embeddings simultaneously.

XTTS-2 (Coqui) uses a VQVAE codec and GPT-2 style decoder with a 3-second reference audio requirement; it produces high-quality cross-lingual cloning and is the most widely deployed option. F5-TTS (SWivid) replaces autoregressive decoding with a flow-matching diffusion backbone, eliminating error accumulation on long passages and making it better for paragraph-length synthesis; it needs about 2-3GB VRAM for the full model. OpenVoice V2 (MyShell) separates base TTS from a lightweight tone-color converter, letting the cloning step run in under 100ms with no reference audio beyond the initial embedding; it's the best choice when you pre-register voices and need minimal per-request overhead.

ElevenLabs Creator and Scale plans charge $0.00018-$0.00030 per character for voice cloning output depending on the plan tier. At 5 million characters per month, that runs $900-$1,500/month. An H100 SXM5 on Spheron at $1.73/hr spot handles around 600 characters per second with XTTS-2, putting 5 million characters at roughly 2.3 GPU-hours, which is under $4 in compute. Self-hosting breaks even somewhere between 500,000 and 1,000,000 characters per month depending on GPU tier and how much idle time you're paying for.

XTTS-2 and F5-TTS technically run on CPU but produce audio 30-50x slower than real time, which is only viable for offline batch jobs. OpenVoice V2's tone-color converter is lighter (runs at 3-5x real time on a modern CPU core) but the base TTS model still needs GPU for real-time use. For any production voice cloning endpoint that must respond in under 2 seconds, a GPU is required.

XTTS-2 supports 17 languages including cross-lingual synthesis (clone an English voice and speak in French). F5-TTS supports English and Chinese with strong quality in both. OpenVoice V2 supports English and targets cross-lingual at the tone-conversion layer rather than the base model. If your agent must clone one speaker voice across multiple languages in a single session, XTTS-2 is the most capable option. For English-only or English/Chinese bilingual use, F5-TTS is worth testing for its lower error accumulation on longer passages.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.