What GPU do I need to run YuE 7B for music generation?

YuE 7B requires at least 16GB VRAM at fp16. An L40S (48GB) runs 4 concurrent generation jobs comfortably. The RTX 5090 (32GB) covers single-stream hobbyist use. For batch catalog production, the H200 SXM (141GB) fits 8+ YuE workers simultaneously.

How much does it cost to generate one song with self-hosted AI?

On an L40S at current Spheron spot pricing, a 3-minute stereo track with YuE 7B takes roughly 4-6 minutes of GPU time, putting cost per song well under $0.05 at spot rates. At on-demand L40S pricing ($0.72/hr) the cost rises to $0.05-0.07 per song. MusicGen Stereo is faster and cheaper per track but shorter outputs.

Can I use AI-generated music commercially if I self-host the models?

It depends on the model license, not the hosting method. MusicGen is CC BY-NC 4.0 (non-commercial only). YuE and ACE-Step use Apache 2.0 (commercial use allowed). Stable Audio Open uses the Stability AI Community License. Always verify the current license on the model's Hugging Face page before commercial deployment.

How does self-hosting AI music compare in cost to Suno or Udio?

Suno Pro costs $96/year for up to 500 songs per month (2,500 credits at 5 credits per song), around $0.016 per song. Self-hosted YuE on an L40S spot instance costs about $0.027 per song, so at low volume Suno is cheaper per track. The arguments for self-hosting are commercial rights (YuE is Apache 2.0; Suno restricts commercial use of outputs) and no monthly song cap. Once you exceed 500 songs per month, Suno requires a higher-tier plan, and self-hosted spot pricing becomes more competitive.

Can I integrate self-hosted music AI with a DAW in real-time?

Yes, but with caveats. Autoregressive models like YuE cannot stream audio tokens as they generate. ACE-Step's diffusion approach allows chunked generation where the first segment plays while the next generates. Practical DAW integration uses a local generation server with a VST or OSC bridge, with 30-60 second generation buffers.

Deploy Open-Source AI Music Generation on GPU Cloud: YuE, ACE-Step, MusicGen, and Stable Audio Open Guide (2026)

In mid-2025, Suno and Udio both faced copyright infringement lawsuits from major record labels. The outcome changed how a lot of music-tech teams thought about hosted AI music services: if the platform can be sued into settlement or shutdown, what happens to your pipeline? Self-hosted open-source models became the safe answer.

This post covers the four models worth deploying in 2026: YuE 7B, ACE-Step 3.5B, MusicGen Stereo, and Stable Audio Open 1.5. You'll get exact VRAM numbers, working deployment code, cost-per-song math using live Spheron pricing, and a batch pipeline design for catalog-scale production. This is distinct from the voice AI GPU infrastructure guide, which covers TTS and ASR pipelines. Music generation runs differently: longer generation windows, different VRAM profiles, and batch-oriented rather than real-time workloads.

For teams adding voice narration alongside generative music, the open-source TTS deployment guide covers Kokoro, Fish Speech, and Hume TADA with similar production detail.

The 2026 Open-Source Music AI Models Worth Deploying

Four models are worth considering for production deployment. Each targets a different use case and VRAM tier.

YuE 7B (m-a-p/YuE-s1-7B-anneal-en-cot) is an autoregressive transformer trained by Multimodal Art Projection. It generates full songs with vocal melodies following a provided lyric sheet, at 44.1kHz stereo output. Minimum VRAM is 16GB at fp16. Track duration is 3-4 minutes. License is Apache 2.0, which means commercial use is permitted. This is the strongest model for anything requiring lyrics and full song structure.

ACE-Step 3.5B (ACE-Step/ACE-Step-v1-3.5B) uses a diffusion architecture rather than autoregressive generation. The practical benefit is faster iteration and more direct style steering through the prompt. Minimum VRAM is 8GB, making it the only model here that fits on a 12GB consumer GPU. Output duration is configurable. License is Apache 2.0. It is newer and has less community benchmarking than YuE, so treat performance estimates as preliminary.

MusicGen Stereo (facebook/musicgen-stereo-large) from Meta generates stereo background music from a text prompt. Output caps at around 30 seconds per call, though you can chain calls for longer tracks. VRAM requirement is 12GB at fp16. License is CC BY-NC 4.0, which restricts commercial use. Best for background music, loop libraries, and non-commercial production tools.

Stable Audio Open 1.5 (stabilityai/stable-audio-open-1.5) is optimized for sound design and textural audio rather than songs. The 47-second output limit makes it unsuitable for full tracks, but ideal for sample generation, SFX, and ambient textures. VRAM is 12GB. License is the Stability AI Community License, which allows commercial use for projects under a revenue threshold; verify the current terms on the model card before shipping.

Model	Params	VRAM (fp16)	Max Duration	Sample Rate	License	Best For
YuE	6B (named 7B in the model ID)	16GB	3-4 min	44.1kHz	Apache 2.0	Full songs with lyrics
ACE-Step	3.5B	8GB	configurable	44.1kHz	Apache 2.0	Fast iteration, style steering
MusicGen Stereo	3.3B	12GB	~30 seconds	32kHz	CC BY-NC 4.0	Background music loops
Stable Audio Open 1.5	~1.1B	12GB	47 seconds	44.1kHz	SA Community	Sound design, textures

GPU Selection: Matching Hardware to Workload

Three tiers cover the range from hobbyist to catalog-scale production.

RTX 5090 (32GB, $0.86/hr on-demand) covers single-stream YuE and can run ACE-Step and MusicGen simultaneously on the same GPU. For developers building tools or exploring models before committing to production infrastructure, this is the lowest-cost entry point. See the Spheron getting-started guide for SSH setup and instance launch steps. RTX 5090 rental

L40S (48GB, $0.72/hr on-demand, $0.32/hr spot) is the studio tier. Three concurrent YuE workers fit in 48GB with headroom for CUDA buffers. At $0.32/hr spot, this is the best cost-per-song option for teams generating 100-500 tracks per day. L40S GPU rental

B200 (192GB, $2.06/hr spot, on-demand not currently available) handles 12+ concurrent YuE workers or a multi-model serving setup where YuE, ACE-Step, and MusicGen all run from the same instance. At full worker utilization, the per-song cost drops below the L40S single-worker cost. B200 GPU rental

GPU	VRAM	YuE Workers	ACE-Step Workers	MusicGen Workers	On-Demand $/hr	Spot $/hr
RTX 5090	32GB	1	4	2	$0.86	N/A
L40S	48GB	2-3	5-6	3-4	$0.72	$0.32
H200 SXM	141GB	8+	16+	11+	$5.58	$1.19
B200	192GB	11+	22+	15+	N/A	$2.06

Deployment Recipes

YuE 7B with transformers

bash

git clone https://github.com/multimodal-art-projection/YuE
cd YuE
pip install -r requirements.txt
huggingface-cli download m-a-p/YuE-s1-7B-anneal-en-cot --local-dir checkpoints/yue-7b

Run inference with fp16 precision:

bash

python infer.py \
  --stage1_model checkpoints/yue-7b \
  --genre "pop rock" \
  --lyrics path/to/lyrics.txt \
  --output_dir ./outputs \
  --torch_dtype float16

For repeated batch requests, enable torch.compile to get 15-20% throughput improvement:

bash

TORCH_COMPILE=1 python infer.py --stage1_model checkpoints/yue-7b ...

ACE-Step with diffusers pipeline

python

from diffusers import ACEStepPipeline
import torch

pipe = ACEStepPipeline.from_pretrained(
    "ACE-Step/ACE-Step-v1-3.5B",
    torch_dtype=torch.float16
).to("cuda")

audio = pipe(
    prompt="upbeat electronic with piano, 120bpm",
    duration=30.0
).audios[0]

For batch generation, clear the CUDA cache between jobs to avoid OOM errors accumulating across runs:

python

import torch

with torch.no_grad():
    for i, prompt in enumerate(prompts):
        audio = pipe(prompt=prompt, duration=30.0).audios[0]
        save_audio(audio, f"output_{i}.wav")
        torch.cuda.empty_cache()

MusicGen Stereo with transformers

python

from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipy

processor = AutoProcessor.from_pretrained("facebook/musicgen-stereo-large")
model = MusicgenForConditionalGeneration.from_pretrained(
    "facebook/musicgen-stereo-large"
).to("cuda")

inputs = processor(
    text=["ambient piano background, cinematic, slow"],
    padding=True,
    return_tensors="pt",
).to("cuda")

audio_values = model.generate(**inputs, max_new_tokens=1500)

scipy.io.wavfile.write(
    "output.wav",
    rate=model.config.audio_encoder.sampling_rate,
    data=audio_values[0].cpu().numpy().T
)

max_new_tokens=1500 produces roughly 30 seconds. Scale up to 3000 for ~60 seconds, though quality can degrade past 30 seconds with MusicGen.

Sliding window for long-form generation

MusicGen caps cleanly at 30 seconds per call. For longer tracks, chain calls with a 10-second overlap: generate the first 30 seconds, then generate the next 30 using the last 10 seconds of the previous clip as a conditioning prefix. Trim the overlap and concatenate.

python

import numpy as np
import soundfile as sf
import torch

SAMPLE_RATE = 32000
OVERLAP_SECONDS = 10
SEGMENT_TOKENS = 1500  # ~30 seconds

target_duration_seconds = 120  # 2-minute track; adjust as needed
# Each segment is ~30s (1500 tokens at 50 Hz); minus 10s overlap = 20s net new audio per segment
num_segments = max(1, round((target_duration_seconds - OVERLAP_SECONDS) / (30 - OVERLAP_SECONDS)))

segments = []
conditioning = None

for i in range(num_segments):
    if conditioning is not None:
        inputs = processor(
            text=[prompt],
            audio=conditioning,
            sampling_rate=SAMPLE_RATE,
            padding=True,
            return_tensors="pt"
        ).to("cuda")
    else:
        inputs = processor(text=[prompt], padding=True, return_tensors="pt").to("cuda")

    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=SEGMENT_TOKENS)

    audio = output[0].cpu().numpy().T
    overlap_samples = OVERLAP_SECONDS * SAMPLE_RATE

    if i > 0:
        audio = audio[overlap_samples:]

    segments.append(audio)
    conditioning = output[0].cpu().numpy()[:, -overlap_samples:]

full_track = np.concatenate(segments)
sf.write("long_track.wav", full_track, SAMPLE_RATE)

Batch Generation Pipeline for Catalog Production

For generating hundreds or thousands of tracks, Ray gives you per-worker GPU isolation and queue-based dispatch.

python

import ray
from diffusers import ACEStepPipeline
import torch

ray.init()

@ray.remote(num_gpus=0.25)
class MusicWorker:
    def __init__(self, model_id):
        self.pipe = ACEStepPipeline.from_pretrained(
            model_id,
            torch_dtype=torch.float16
        ).to("cuda")

    def generate(self, prompt, duration):
        with torch.no_grad():
            audio = self.pipe(prompt=prompt, duration=duration).audios[0]
        torch.cuda.empty_cache()
        return audio

# Deploy 4 workers sharing a single L40S (0.25 GPU each, 4 × ~8GB fits in 48GB VRAM)
workers = [MusicWorker.remote("ACE-Step/ACE-Step-v1-3.5B") for _ in range(4)]

# Dispatch prompts from a list
futures = [
    workers[i % len(workers)].generate.remote(prompt, 30.0)
    for i, prompt in enumerate(prompts)
]

results = ray.get(futures)

Throughput math: benchmark your actual generation speed first. Divide 3600 by seconds-per-track to get tracks-per-hour per worker, then multiply by worker count. With MusicGen Stereo at 30-second clips on an H200, expect roughly 4-6x realtime, producing 600-900 clips per hour per worker.

For 1,000 tracks/hour with MusicGen Stereo 30-second clips: 3 L40S instances with 4 workers each (12 workers) gets you there with margin. See the Ray Serve GPU cloud deployment guide for the full autoscaling config and serving layer setup.

Streaming Inference for DAW Integration

Autoregressive models like YuE generate the full audio before any output is available. You cannot stream mid-generation. ACE-Step's diffusion architecture allows chunked generation: the first 10-second segment can play while the next generates, making it better suited for interactive DAW workflows.

The practical architecture: run a FastAPI server on the GPU instance, and connect to it via OSC or a VST bridge in your DAW (Max4Live or a Reaper script both work). The generation buffer should be 30-60 seconds ahead of playback.

python

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from diffusers import ACEStepPipeline
import torch
import io
import soundfile as sf

app = FastAPI()
pipe = ACEStepPipeline.from_pretrained(
    "ACE-Step/ACE-Step-v1-3.5B",
    torch_dtype=torch.float16
).to("cuda")

@app.post("/generate")
def generate(prompt: str, duration: float = 10.0):
    with torch.no_grad():
        audio = pipe(prompt=prompt, duration=duration).audios[0]
    torch.cuda.empty_cache()

    buf = io.BytesIO()
    sf.write(buf, audio.T, 44100, format="WAV")
    buf.seek(0)
    return StreamingResponse(buf, media_type="audio/wav")

For sub-5-second UI previews, use ACE-Step at 5-second durations. At 3.5B parameters on an L40S, generation is roughly 2-4x realtime, so a 5-second preview arrives in 2-10 seconds.

Copyright-Clean Training Data Considerations

This section covers what the model licenses and papers say. It is not legal advice. Consult counsel before commercial deployment.

YuE was trained on original data curated by Multimodal Art Projection, per the paper. ACE-Step's training data composition is described in its ACL paper as licensed audio. Both use Apache 2.0 licenses, which allow commercial use of the model itself. The training data provenance is a separate question from the model license, and independent legal review is appropriate before shipping commercial products.

MusicGen was trained on Meta's internal licensed music library and is not commercially licensed for output use (CC BY-NC 4.0). Using MusicGen output in a commercial product violates the license regardless of self-hosting.

Stable Audio Open 1.5 was trained on Freesound data with CC0, CC BY, and CC BY-NC licenses. The Stability AI Community License permits commercial use under a revenue threshold. Check the current terms on the model card before shipping.

Self-hosting does not change the legal status of model output. The model's training data and license terms govern commercial viability. When in doubt, use YuE or ACE-Step (Apache 2.0) and get legal sign-off before building a commercial product around any AI-generated audio.

Cost Per Minute of Generated Audio

Live pricing from the Spheron API (fetched 26 Apr 2026):

L40S PCIe: $0.72/hr on-demand, $0.32/hr spot
H200 SXM5: $5.58/hr on-demand, $1.19/hr spot
RTX 5090: $0.86/hr on-demand

YuE 7B generates a 3-minute track in roughly 5 minutes of GPU time on an L40S (estimated based on model architecture and community reports; actual throughput varies with sequence length and hardware). That gives:

Spot: 5/60 hr * $0.32 = $0.027/song ($0.009/min of audio)
On-demand: 5/60 hr * $0.72 = $0.060/song ($0.020/min of audio)

ACE-Step at 30-second output, roughly 2 minutes GPU time on L40S spot (conservative estimate; actual time varies with diffusion step count and may be faster):

Spot: 2/60 hr * $0.32 = $0.011/song ($0.022/min of audio)

MusicGen Stereo at 30-second output, roughly 30 seconds GPU time on L40S spot:

Spot: 0.5/60 hr * $0.32 = $0.003/song ($0.006/min of audio)

Suno Pro at $96/year covers 500 songs per month (2,500 credits at 5 credits per song): $0.016/song at that plan tier, assuming you use the full allocation.

Model	GPU	Mode	Generation Time	Cost Per Song	Cost Per Min Audio
YuE 7B	L40S	Spot	~5 min	~$0.027	~$0.009
YuE 7B	L40S	On-demand	~5 min	~$0.060	~$0.020
ACE-Step 3.5B	L40S	Spot	~2 min	~$0.011	~$0.022
MusicGen Stereo	L40S	Spot	~30 sec	~$0.003	~$0.006
Suno Pro (500/mo)	N/A	N/A	instant	~$0.016	~$0.005

The math has a nuance. At low volumes (under 500 songs/month), Suno Pro at $0.016/song is cheaper per song than self-hosted YuE on L40S spot at $0.027/song. The case for self-hosting at lower volumes is not cost — it is commercial rights (YuE and ACE-Step are Apache 2.0; Suno restricts commercial use of generated audio) and pipeline control. Suno Pro caps at 500 songs per month, so any production pipeline that needs more has to move to a higher-cost plan. At that scale, self-hosting on spot instances gives unlimited generation, full output ownership, and marginal cost that does not increase with volume.

Pricing fluctuates based on GPU availability. The prices above are based on 26 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Scaling with Ray Serve and Autoscaling

The Ray worker pattern from the batch pipeline section pairs directly with Ray Serve's autoscaling config. The key fields are min_replicas, max_replicas, and target_num_ongoing_requests_per_replica:

python

from diffusers import ACEStepPipeline
import torch
import io
import soundfile as sf
from ray import serve
from starlette.responses import Response

@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 8,
        "target_num_ongoing_requests_per_replica": 2,
    },
    ray_actor_options={"num_gpus": 1},
)
class MusicGenService:
    def __init__(self):
        self.pipe = ACEStepPipeline.from_pretrained(
            "ACE-Step/ACE-Step-v1-3.5B",
            torch_dtype=torch.float16
        ).to("cuda")

    async def __call__(self, request):
        data = await request.json()
        with torch.no_grad():
            audio = self.pipe(
                prompt=data["prompt"],
                duration=data.get("duration", 30.0)
            ).audios[0]
        buf = io.BytesIO()
        sf.write(buf, audio.T, 44100, format="WAV")
        buf.seek(0)
        return Response(content=buf.read(), media_type="audio/wav")

When queue depth rises above target_num_ongoing_requests_per_replica, Ray Serve adds replicas up to max_replicas. Spheron spot instances are ideal for the burst replicas: provision on-demand, run until queue clears, release. See the Ray Serve GPU cloud deployment guide for the full deployment config including HTTP ingress and model warm-up.

For VRAM planning across concurrent workers, the same calculation logic applies here as for LLMs: model weights plus inference state plus activation buffers. The GPU memory requirements for LLMs guide has the formula and worked examples you can adapt for music model sizing.

Related Guides

Deploy Open-Source TTS on GPU Cloud for voice narration pipelines alongside music generation.
Voice AI GPU Infrastructure for full audio pipeline architecture covering ASR, LLM, and TTS.
Ray Serve on GPU Cloud for production autoscaling setup.
GPU Memory Requirements for LLMs for VRAM planning methodology applicable to music models.

Music-tech teams moving off Suno and Udio to self-hosted pipelines are running YuE and ACE-Step on L40S spot instances at a fraction of subscription costs. Rent an L40S → | Rent an H200 → | View all GPU pricing →
Get started on Spheron →

The 2026 Open-Source Music AI Models Worth Deploying

GPU Selection: Matching Hardware to Workload

Deployment Recipes

YuE 7B with transformers

ACE-Step with diffusers pipeline

MusicGen Stereo with transformers

Sliding window for long-form generation

Batch Generation Pipeline for Catalog Production

Streaming Inference for DAW Integration

Copyright-Clean Training Data Considerations

Cost Per Minute of Generated Audio

Scaling with Ray Serve and Autoscaling

Related Guides

Build what's next.