In mid-2025, Suno and Udio both faced copyright infringement lawsuits from major record labels. The outcome changed how a lot of music-tech teams thought about hosted AI music services: if the platform can be sued into settlement or shutdown, what happens to your pipeline? Self-hosted open-source models became the safe answer.
This post covers the four models worth deploying in 2026: YuE 7B, ACE-Step 3.5B, MusicGen Stereo, and Stable Audio Open 1.5. You'll get exact VRAM numbers, working deployment code, cost-per-song math using live Spheron pricing, and a batch pipeline design for catalog-scale production. This is distinct from the voice AI GPU infrastructure guide, which covers TTS and ASR pipelines. Music generation runs differently: longer generation windows, different VRAM profiles, and batch-oriented rather than real-time workloads.
For teams adding voice narration alongside generative music, the open-source TTS deployment guide covers Kokoro, Fish Speech, and Hume TADA with similar production detail.
The 2026 Open-Source Music AI Models Worth Deploying
Four models are worth considering for production deployment. Each targets a different use case and VRAM tier.
YuE 7B (m-a-p/YuE-s1-7B-anneal-en-cot) is an autoregressive transformer trained by Multimodal Art Projection. It generates full songs with vocal melodies following a provided lyric sheet, at 44.1kHz stereo output. Minimum VRAM is 16GB at fp16. Track duration is 3-4 minutes. License is Apache 2.0, which means commercial use is permitted. This is the strongest model for anything requiring lyrics and full song structure.
ACE-Step 3.5B (ACE-Step/ACE-Step-v1-3.5B) uses a diffusion architecture rather than autoregressive generation. The practical benefit is faster iteration and more direct style steering through the prompt. Minimum VRAM is 8GB, making it the only model here that fits on a 12GB consumer GPU. Output duration is configurable. License is Apache 2.0. It is newer and has less community benchmarking than YuE, so treat performance estimates as preliminary.
MusicGen Stereo (facebook/musicgen-stereo-large) from Meta generates stereo background music from a text prompt. Output caps at around 30 seconds per call, though you can chain calls for longer tracks. VRAM requirement is 12GB at fp16. License is CC BY-NC 4.0, which restricts commercial use. Best for background music, loop libraries, and non-commercial production tools.
Stable Audio Open 1.5 (stabilityai/stable-audio-open-1.5) is optimized for sound design and textural audio rather than songs. The 47-second output limit makes it unsuitable for full tracks, but ideal for sample generation, SFX, and ambient textures. VRAM is 12GB. License is the Stability AI Community License, which allows commercial use for projects under a revenue threshold; verify the current terms on the model card before shipping.
| Model | Params | VRAM (fp16) | Max Duration | Sample Rate | License | Best For |
|---|---|---|---|---|---|---|
| YuE | 6B (named 7B in the model ID) | 16GB | 3-4 min | 44.1kHz | Apache 2.0 | Full songs with lyrics |
| ACE-Step | 3.5B | 8GB | configurable | 44.1kHz | Apache 2.0 | Fast iteration, style steering |
| MusicGen Stereo | 3.3B | 12GB | ~30 seconds | 32kHz | CC BY-NC 4.0 | Background music loops |
| Stable Audio Open 1.5 | ~1.1B | 12GB | 47 seconds | 44.1kHz | SA Community | Sound design, textures |
GPU Selection: Matching Hardware to Workload
Three tiers cover the range from hobbyist to catalog-scale production.
RTX 5090 (32GB, $0.86/hr on-demand) covers single-stream YuE and can run ACE-Step and MusicGen simultaneously on the same GPU. For developers building tools or exploring models before committing to production infrastructure, this is the lowest-cost entry point. See the Spheron getting-started guide for SSH setup and instance launch steps. RTX 5090 rental
L40S (48GB, $0.72/hr on-demand, $0.32/hr spot) is the studio tier. Three concurrent YuE workers fit in 48GB with headroom for CUDA buffers. At $0.32/hr spot, this is the best cost-per-song option for teams generating 100-500 tracks per day. L40S GPU rental
B200 (192GB, $2.06/hr spot, on-demand not currently available) handles 12+ concurrent YuE workers or a multi-model serving setup where YuE, ACE-Step, and MusicGen all run from the same instance. At full worker utilization, the per-song cost drops below the L40S single-worker cost. B200 GPU rental
| GPU | VRAM | YuE Workers | ACE-Step Workers | MusicGen Workers | On-Demand $/hr | Spot $/hr |
|---|---|---|---|---|---|---|
| RTX 5090 | 32GB | 1 | 4 | 2 | $0.86 | N/A |
| L40S | 48GB | 2-3 | 5-6 | 3-4 | $0.72 | $0.32 |
| H200 SXM | 141GB | 8+ | 16+ | 11+ | $5.58 | $1.19 |
| B200 | 192GB | 11+ | 22+ | 15+ | N/A | $2.06 |
Deployment Recipes
YuE 7B with transformers
git clone https://github.com/multimodal-art-projection/YuE
cd YuE
pip install -r requirements.txt
huggingface-cli download m-a-p/YuE-s1-7B-anneal-en-cot --local-dir checkpoints/yue-7bRun inference with fp16 precision:
python infer.py \
--stage1_model checkpoints/yue-7b \
--genre "pop rock" \
--lyrics path/to/lyrics.txt \
--output_dir ./outputs \
--torch_dtype float16For repeated batch requests, enable torch.compile to get 15-20% throughput improvement:
TORCH_COMPILE=1 python infer.py --stage1_model checkpoints/yue-7b ...ACE-Step with diffusers pipeline
from diffusers import ACEStepPipeline
import torch
pipe = ACEStepPipeline.from_pretrained(
"ACE-Step/ACE-Step-v1-3.5B",
torch_dtype=torch.float16
).to("cuda")
audio = pipe(
prompt="upbeat electronic with piano, 120bpm",
duration=30.0
).audios[0]For batch generation, clear the CUDA cache between jobs to avoid OOM errors accumulating across runs:
import torch
with torch.no_grad():
for i, prompt in enumerate(prompts):
audio = pipe(prompt=prompt, duration=30.0).audios[0]
save_audio(audio, f"output_{i}.wav")
torch.cuda.empty_cache()MusicGen Stereo with transformers
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipy
processor = AutoProcessor.from_pretrained("facebook/musicgen-stereo-large")
model = MusicgenForConditionalGeneration.from_pretrained(
"facebook/musicgen-stereo-large"
).to("cuda")
inputs = processor(
text=["ambient piano background, cinematic, slow"],
padding=True,
return_tensors="pt",
).to("cuda")
audio_values = model.generate(**inputs, max_new_tokens=1500)
scipy.io.wavfile.write(
"output.wav",
rate=model.config.audio_encoder.sampling_rate,
data=audio_values[0].cpu().numpy().T
)max_new_tokens=1500 produces roughly 30 seconds. Scale up to 3000 for ~60 seconds, though quality can degrade past 30 seconds with MusicGen.
Sliding window for long-form generation
MusicGen caps cleanly at 30 seconds per call. For longer tracks, chain calls with a 10-second overlap: generate the first 30 seconds, then generate the next 30 using the last 10 seconds of the previous clip as a conditioning prefix. Trim the overlap and concatenate.
import numpy as np
import soundfile as sf
import torch
SAMPLE_RATE = 32000
OVERLAP_SECONDS = 10
SEGMENT_TOKENS = 1500 # ~30 seconds
target_duration_seconds = 120 # 2-minute track; adjust as needed
# Each segment is ~30s (1500 tokens at 50 Hz); minus 10s overlap = 20s net new audio per segment
num_segments = max(1, round((target_duration_seconds - OVERLAP_SECONDS) / (30 - OVERLAP_SECONDS)))
segments = []
conditioning = None
for i in range(num_segments):
if conditioning is not None:
inputs = processor(
text=[prompt],
audio=conditioning,
sampling_rate=SAMPLE_RATE,
padding=True,
return_tensors="pt"
).to("cuda")
else:
inputs = processor(text=[prompt], padding=True, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=SEGMENT_TOKENS)
audio = output[0].cpu().numpy().T
overlap_samples = OVERLAP_SECONDS * SAMPLE_RATE
if i > 0:
audio = audio[overlap_samples:]
segments.append(audio)
conditioning = output[0].cpu().numpy()[:, -overlap_samples:]
full_track = np.concatenate(segments)
sf.write("long_track.wav", full_track, SAMPLE_RATE)Batch Generation Pipeline for Catalog Production
For generating hundreds or thousands of tracks, Ray gives you per-worker GPU isolation and queue-based dispatch.
import ray
from diffusers import ACEStepPipeline
import torch
ray.init()
@ray.remote(num_gpus=0.25)
class MusicWorker:
def __init__(self, model_id):
self.pipe = ACEStepPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16
).to("cuda")
def generate(self, prompt, duration):
with torch.no_grad():
audio = self.pipe(prompt=prompt, duration=duration).audios[0]
torch.cuda.empty_cache()
return audio
# Deploy 4 workers sharing a single L40S (0.25 GPU each, 4 × ~8GB fits in 48GB VRAM)
workers = [MusicWorker.remote("ACE-Step/ACE-Step-v1-3.5B") for _ in range(4)]
# Dispatch prompts from a list
futures = [
workers[i % len(workers)].generate.remote(prompt, 30.0)
for i, prompt in enumerate(prompts)
]
results = ray.get(futures)Throughput math: benchmark your actual generation speed first. Divide 3600 by seconds-per-track to get tracks-per-hour per worker, then multiply by worker count. With MusicGen Stereo at 30-second clips on an H200, expect roughly 4-6x realtime, producing 600-900 clips per hour per worker.
For 1,000 tracks/hour with MusicGen Stereo 30-second clips: 3 L40S instances with 4 workers each (12 workers) gets you there with margin. See the Ray Serve GPU cloud deployment guide for the full autoscaling config and serving layer setup.
Streaming Inference for DAW Integration
Autoregressive models like YuE generate the full audio before any output is available. You cannot stream mid-generation. ACE-Step's diffusion architecture allows chunked generation: the first 10-second segment can play while the next generates, making it better suited for interactive DAW workflows.
The practical architecture: run a FastAPI server on the GPU instance, and connect to it via OSC or a VST bridge in your DAW (Max4Live or a Reaper script both work). The generation buffer should be 30-60 seconds ahead of playback.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from diffusers import ACEStepPipeline
import torch
import io
import soundfile as sf
app = FastAPI()
pipe = ACEStepPipeline.from_pretrained(
"ACE-Step/ACE-Step-v1-3.5B",
torch_dtype=torch.float16
).to("cuda")
@app.post("/generate")
def generate(prompt: str, duration: float = 10.0):
with torch.no_grad():
audio = pipe(prompt=prompt, duration=duration).audios[0]
torch.cuda.empty_cache()
buf = io.BytesIO()
sf.write(buf, audio.T, 44100, format="WAV")
buf.seek(0)
return StreamingResponse(buf, media_type="audio/wav")For sub-5-second UI previews, use ACE-Step at 5-second durations. At 3.5B parameters on an L40S, generation is roughly 2-4x realtime, so a 5-second preview arrives in 2-10 seconds.
Copyright-Clean Training Data Considerations
This section covers what the model licenses and papers say. It is not legal advice. Consult counsel before commercial deployment.
YuE was trained on original data curated by Multimodal Art Projection, per the paper. ACE-Step's training data composition is described in its ACL paper as licensed audio. Both use Apache 2.0 licenses, which allow commercial use of the model itself. The training data provenance is a separate question from the model license, and independent legal review is appropriate before shipping commercial products.
MusicGen was trained on Meta's internal licensed music library and is not commercially licensed for output use (CC BY-NC 4.0). Using MusicGen output in a commercial product violates the license regardless of self-hosting.
Stable Audio Open 1.5 was trained on Freesound data with CC0, CC BY, and CC BY-NC licenses. The Stability AI Community License permits commercial use under a revenue threshold. Check the current terms on the model card before shipping.
Self-hosting does not change the legal status of model output. The model's training data and license terms govern commercial viability. When in doubt, use YuE or ACE-Step (Apache 2.0) and get legal sign-off before building a commercial product around any AI-generated audio.
Cost Per Minute of Generated Audio
Live pricing from the Spheron API (fetched 26 Apr 2026):
- L40S PCIe: $0.72/hr on-demand, $0.32/hr spot
- H200 SXM5: $5.58/hr on-demand, $1.19/hr spot
- RTX 5090: $0.86/hr on-demand
YuE 7B generates a 3-minute track in roughly 5 minutes of GPU time on an L40S (estimated based on model architecture and community reports; actual throughput varies with sequence length and hardware). That gives:
- Spot: 5/60 hr * $0.32 = $0.027/song ($0.009/min of audio)
- On-demand: 5/60 hr * $0.72 = $0.060/song ($0.020/min of audio)
ACE-Step at 30-second output, roughly 2 minutes GPU time on L40S spot (conservative estimate; actual time varies with diffusion step count and may be faster):
- Spot: 2/60 hr * $0.32 = $0.011/song ($0.022/min of audio)
MusicGen Stereo at 30-second output, roughly 30 seconds GPU time on L40S spot:
- Spot: 0.5/60 hr * $0.32 = $0.003/song ($0.006/min of audio)
Suno Pro at $96/year covers 500 songs per month (2,500 credits at 5 credits per song): $0.016/song at that plan tier, assuming you use the full allocation.
| Model | GPU | Mode | Generation Time | Cost Per Song | Cost Per Min Audio |
|---|---|---|---|---|---|
| YuE 7B | L40S | Spot | ~5 min | ~$0.027 | ~$0.009 |
| YuE 7B | L40S | On-demand | ~5 min | ~$0.060 | ~$0.020 |
| ACE-Step 3.5B | L40S | Spot | ~2 min | ~$0.011 | ~$0.022 |
| MusicGen Stereo | L40S | Spot | ~30 sec | ~$0.003 | ~$0.006 |
| Suno Pro (500/mo) | N/A | N/A | instant | ~$0.016 | ~$0.005 |
The math has a nuance. At low volumes (under 500 songs/month), Suno Pro at $0.016/song is cheaper per song than self-hosted YuE on L40S spot at $0.027/song. The case for self-hosting at lower volumes is not cost — it is commercial rights (YuE and ACE-Step are Apache 2.0; Suno restricts commercial use of generated audio) and pipeline control. Suno Pro caps at 500 songs per month, so any production pipeline that needs more has to move to a higher-cost plan. At that scale, self-hosting on spot instances gives unlimited generation, full output ownership, and marginal cost that does not increase with volume.
Pricing fluctuates based on GPU availability. The prices above are based on 26 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Scaling with Ray Serve and Autoscaling
The Ray worker pattern from the batch pipeline section pairs directly with Ray Serve's autoscaling config. The key fields are min_replicas, max_replicas, and target_num_ongoing_requests_per_replica:
from diffusers import ACEStepPipeline
import torch
import io
import soundfile as sf
from ray import serve
from starlette.responses import Response
@serve.deployment(
autoscaling_config={
"min_replicas": 1,
"max_replicas": 8,
"target_num_ongoing_requests_per_replica": 2,
},
ray_actor_options={"num_gpus": 1},
)
class MusicGenService:
def __init__(self):
self.pipe = ACEStepPipeline.from_pretrained(
"ACE-Step/ACE-Step-v1-3.5B",
torch_dtype=torch.float16
).to("cuda")
async def __call__(self, request):
data = await request.json()
with torch.no_grad():
audio = self.pipe(
prompt=data["prompt"],
duration=data.get("duration", 30.0)
).audios[0]
buf = io.BytesIO()
sf.write(buf, audio.T, 44100, format="WAV")
buf.seek(0)
return Response(content=buf.read(), media_type="audio/wav")When queue depth rises above target_num_ongoing_requests_per_replica, Ray Serve adds replicas up to max_replicas. Spheron spot instances are ideal for the burst replicas: provision on-demand, run until queue clears, release. See the Ray Serve GPU cloud deployment guide for the full deployment config including HTTP ingress and model warm-up.
For VRAM planning across concurrent workers, the same calculation logic applies here as for LLMs: model weights plus inference state plus activation buffers. The GPU memory requirements for LLMs guide has the formula and worked examples you can adapt for music model sizing.
Related Guides
- Deploy Open-Source TTS on GPU Cloud for voice narration pipelines alongside music generation.
- Voice AI GPU Infrastructure for full audio pipeline architecture covering ASR, LLM, and TTS.
- Ray Serve on GPU Cloud for production autoscaling setup.
- GPU Memory Requirements for LLMs for VRAM planning methodology applicable to music models.
Music-tech teams moving off Suno and Udio to self-hosted pipelines are running YuE and ACE-Step on L40S spot instances at a fraction of subscription costs. Rent an L40S → | Rent an H200 → | View all GPU pricing →
