Single-model inference has a quality ceiling. Past a certain point, a 70B model running alone stops improving no matter how you tune it. What changes the trajectory is running multiple models on the same query and letting a separate model synthesize their outputs. That's Mixture of Agents (MoA), not to be confused with Mixture of Experts (MoE), which is a different architecture entirely.
MoE is internal to a single model: tokens are routed to specialized sub-networks at inference time. MoA is a multi-model system: N independent LLMs each generate a complete response, and a separate aggregator LLM synthesizes them into a final answer. The Together AI paper (arXiv:2406.04692) showed a 6-proposer MoA outperforming GPT-4 Turbo on AlpacaEval 2.0 despite each individual proposer being weaker than GPT-4.
This guide covers everything you need to deploy a production MoA stack on GPU cloud: architecture, GPU sizing for 4-6 concurrent models, a full vLLM reference implementation, latency/cost math, and when MoA is actually the right choice. See our LLM inference router guide for single-model routing; this guide covers what happens when you run all of them at once.
What Is Mixture of Agents (And What It Isn't)
MoA is an inference architecture, not a model or a training technique. A fixed set of proposer LLMs each independently answer the same query. An aggregator LLM takes all their responses as context and produces a single synthesized output. The proposers never see each other's outputs. Only the aggregator does.
The key insight is that diversity matters more than raw capability. A proposer trained on different data with a different RLHF recipe will produce systematically different responses to the same question. The aggregator can then identify the parts each proposer got right and synthesize them into something better than any individual answer.
Here's how MoA compares to related concepts:
| Concept | What It Does | Runs At | Related Post |
|---|---|---|---|
| Mixture of Agents (MoA) | N full LLMs generate responses, one aggregator synthesizes them | Inference time, multi-model | This post |
| Mixture of Experts (MoE) | One model routes tokens to specialized sub-networks internally | Inside a single forward pass | - |
| Multi-agent system | Agents with tools take turns reasoning and acting | Multiple inference steps, orchestrated | - |
| LLM-as-judge | A separate LLM scores or compares outputs for evaluation | Eval time, not production serving | LLM-as-judge guide |
MoA is a quality multiplier for inference, not an evaluation mechanism or a routing strategy.
MoA Architecture: Proposers, Aggregators, and Layered Refinement
The data flow for a single MoA query:
User Query
|
v
+----+----+----+----+
| P1 | P2 | P3 | P4 | <-- Proposers run in parallel
+----+----+----+----+
| | | |
+----+----+----+
|
v
Aggregator
|
v
Final OutputAll proposers receive the same query and run concurrently. The total latency is determined by the slowest proposer, not the sum. Once all proposer responses arrive, the aggregator processes them as a single long input and produces the final response.
Single-layer vs multi-layer MoA. The original Together AI paper used 3 refinement layers: proposers feed an aggregator, whose output is fed back as context to the same proposers for a second round, and so on. In practice, one layer covers most production use cases and keeps latency predictable. Multi-layer MoA triples latency for 2-4 MT-Bench point gains - worth it for high-stakes async tasks, not worth it for interactive serving.
The aggregator prompt template. The prompt structure that works in production:
AGGREGATOR_SYSTEM_PROMPT = """You are a synthesis model. You will receive several independent responses to the same question from different AI models. Your task is to synthesize them into a single, high-quality response.
Instructions:
- Identify the strongest points from each response
- Resolve any factual conflicts by reasoning from first principles
- Do not mention that you are synthesizing multiple responses
- Output only the final synthesized answer
Proposer responses:
{proposer_outputs}
Question: {query}
Synthesize the above responses into a single, complete answer:"""Why model diversity matters. A proposer trained by Meta (Llama), one trained by Alibaba (Qwen), and one trained by Mistral AI will have different strengths, failure modes, and knowledge gaps from their distinct data mixtures and RLHF recipes. Running three identical models produces three nearly identical outputs and wastes compute. Running three diverse architectures produces genuinely different responses that the aggregator can synthesize into something better than any single answer.
Quality Benchmarks: MoA vs Single Frontier Models
The Together AI paper (arXiv:2406.04692) benchmarked several MoA configurations against single frontier models. These numbers reflect models available at publication time (mid-2024), so absolute scores will be higher with 2026 model generations, but the relative ordering is directionally accurate.
| Configuration | AlpacaEval 2.0 LC Win Rate | MT-Bench Score | Notes |
|---|---|---|---|
| GPT-4 Omni (single) | 57.5% (baseline) | 9.32 | Single model comparison point |
| Claude 3 Opus (single) | 40.5% | 9.00 | Strong single model |
| MoA 6 proposers | 65.1% | 9.65 | 6-proposer, single layer, beats GPT-4 |
On AlpacaEval 2.0, the 6-proposer MoA beat GPT-4 Omni by ~7.5 percentage points despite each individual proposer being weaker. The MT-Bench improvement is smaller (9.32 to 9.65) but consistent.
Where MoA falls short. Two task categories where single large models outperform MoA:
- Creative long-form writing. When you need a coherent narrative voice across 2,000 words, aggregating outputs from 4 proposers that each have different styles creates obvious seams. A single 72B model with good instruction following wins here.
- Complex multi-step code generation. A function that spans 200 lines with internal consistency requirements is hard to aggregate. Proposers generate valid but incompatible implementations. The aggregator's attempt to synthesize them often produces code that compiles but has logic errors.
For structured factual Q&A, summarization, classification, and reasoning tasks, MoA's quality advantage is reliable and consistent.
GPU Footprint Planning
Before provisioning, work out VRAM requirements per model:
| Model Size | GPU Fit | VRAM Used | Recommended GPU |
|---|---|---|---|
| 7B FP16 | 1x RTX 4090 (24GB) | ~14GB | RTX 4090 |
| 7B FP8 | 1x RTX 4090 (24GB) | ~7GB | RTX 4090 |
| 14B FP8 | 1x RTX 4090 (24GB) | ~14GB | RTX 4090 |
| 30B A3B MoE FP8 | 1x RTX 4090 (24GB) | ~14GB active | RTX 4090 or A100 |
| 70B FP8 | 1x H100 80GB | ~40GB | H100 SXM5 |
| 72B FP8 | 1x H100 80GB | ~40GB | H100 SXM5 or H200 |
The aggregator VRAM requirement grows with the number of proposers. A 4-proposer stack where each proposer outputs 500 tokens adds 2,000 tokens to the aggregator's input context. For a 72B aggregator at FP8, an H200 (141GB HBM3e) gives comfortable headroom for both model weights and the enlarged KV cache. See the KV cache optimization guide for cache sizing and eviction strategies.
Recommended hardware tiers for production:
| Tier | Proposers | Aggregator | GPU Configuration | On-Demand Cost/hr | Est. Max Concurrent Requests |
|---|---|---|---|---|---|
| Budget | 3x RTX 4090 (7B FP8 each) | 1x A100 80G (32B FP8) | 4 GPU nodes total | ~$3.35/hr | 20-40 |
| Standard | 4x A100 80G (32B FP8 each) | 1x H100 SXM5 (70B FP8) | 5 GPU nodes total | ~$8.39/hr | 60-100 |
| Production | 4x H100 SXM5 (70B FP8 each) | 1x H200 SXM5 (72B FP8) | 5 GPU nodes total | ~$21.86/hr | 150-250 |
Prices above use live Spheron rates: RTX 4090 $0.77/hr, A100 80G PCIe $1.04/hr, H100 SXM5 $4.21/hr, H200 SXM5 $5.02/hr. For a comparison of Dedicated (on-demand) vs Spot instance availability on Spheron, see the instance types guide.
Reference Implementation on Spheron
Provision one on-demand H100 instance per 70B proposer model and one H200 on Spheron for the aggregator. SSH into each node and launch the relevant vLLM server. If you haven't chosen a serving framework yet, see the Ollama vs vLLM comparison for a breakdown of when each makes sense, or the Spheron LLM deployment guide for step-by-step container setup on Spheron instances.
Proposer deployments:
# Proposer 1 - Llama 4 Scout 17B on RTX 4090
docker run --gpus all --rm -p 8001:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--quantization fp8 \
--max-model-len 16384# Proposer 2 - Qwen 3 30B A3B on RTX 4090
docker run --gpus all --rm -p 8002:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-30B-A3B \
--quantization fp8 \
--max-model-len 16384# Proposer 3 - Mistral Small 4 24B on A100 80G
docker run --gpus all --rm -p 8003:8000 \
vllm/vllm-openai:latest \
--model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \
--quantization fp8 \
--max-model-len 32768# Proposer 4 - Llama-3.3-70B on H100 SXM5
docker run --gpus all --rm -p 8004:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--quantization fp8 \
--max-model-len 32768Aggregator deployment (H200, large context window required):
# Aggregator - Qwen3-72B on H200 SXM5
# --max-model-len must be large enough to hold all proposer outputs concatenated
# 4 proposers x 500 tokens each + original query + system prompt = ~3000-4000 tokens overhead
docker run --gpus all --rm -p 8010:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-72B \
--quantization fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90FastAPI orchestration layer:
import asyncio
import httpx
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
app = FastAPI()
PROPOSER_ENDPOINTS = [
"http://proposer-1-host:8001/v1",
"http://proposer-2-host:8002/v1",
"http://proposer-3-host:8003/v1",
"http://proposer-4-host:8004/v1",
]
PROPOSER_MODELS = [
"meta-llama/Llama-4-Scout-17B-16E-Instruct",
"Qwen/Qwen3-30B-A3B",
"mistralai/Mistral-Small-3.2-24B-Instruct-2506",
"meta-llama/Llama-3.3-70B-Instruct",
]
AGGREGATOR_ENDPOINT = "http://aggregator-host:8010/v1"
AGGREGATOR_MODEL = "Qwen/Qwen3-72B"
AGGREGATOR_SYSTEM_PROMPT = (
"You are a synthesis model. You will receive several independent responses to the "
"same question from different AI models. Synthesize them into a single, high-quality "
"response. Identify the strongest points from each, resolve factual conflicts by "
"reasoning from first principles, and do not mention that you are synthesizing. "
"Output only the final synthesized answer."
)
class ChatRequest(BaseModel):
messages: list[dict]
max_tokens: int = 1024
temperature: float = 0.7
async def call_proposer(client: httpx.AsyncClient, endpoint: str, model: str, request: ChatRequest) -> str:
"""Call a single proposer and return its text response."""
resp = await client.post(
f"{endpoint}/chat/completions",
json={
"model": model,
"messages": request.messages,
"max_tokens": request.max_tokens,
"temperature": request.temperature,
},
timeout=120.0,
)
resp.raise_for_status()
return resp.json()["choices"][0]["message"]["content"]
@app.post("/v1/chat/completions")
async def moa_chat(request: ChatRequest):
async with httpx.AsyncClient() as client:
# Fan out to all proposers concurrently
proposer_tasks = [
call_proposer(client, endpoint, model, request)
for endpoint, model in zip(PROPOSER_ENDPOINTS, PROPOSER_MODELS)
]
results = await asyncio.gather(*proposer_tasks, return_exceptions=True)
proposer_outputs = [r for r in results if not isinstance(r, BaseException)]
if len(proposer_outputs) < 2:
raise HTTPException(status_code=502, detail="Too few proposers succeeded")
# Build aggregator prompt
formatted_outputs = "\n\n".join(
f"Response {i+1}:\n{output}"
for i, output in enumerate(proposer_outputs)
)
original_query = next(
(m["content"] for m in reversed(request.messages) if m["role"] == "user"),
""
)
aggregator_messages = [
{"role": "system", "content": AGGREGATOR_SYSTEM_PROMPT},
{
"role": "user",
"content": (
f"Proposer responses:\n{formatted_outputs}\n\n"
f"Original question: {original_query}\n\n"
"Synthesize the above into a single, complete answer:"
),
},
]
# Call aggregator
try:
agg_resp = await client.post(
f"{AGGREGATOR_ENDPOINT}/chat/completions",
json={
"model": AGGREGATOR_MODEL,
"messages": aggregator_messages,
"max_tokens": request.max_tokens,
"temperature": 0.3,
},
timeout=180.0,
)
agg_resp.raise_for_status()
except httpx.HTTPStatusError as exc:
raise HTTPException(
status_code=502,
detail=f"Aggregator error: {exc.response.status_code}",
)
except httpx.TimeoutException:
raise HTTPException(status_code=504, detail="Aggregator request timed out")
except httpx.RequestError as exc:
raise HTTPException(status_code=502, detail=f"Aggregator connection error: {exc}")
return agg_resp.json()Test with curl:
curl -X POST http://moa-orchestrator:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Explain gradient checkpointing and when to use it."}],
"max_tokens": 512
}'Latency and Quality Tradeoffs
| Optimization | Latency Impact | Quality Impact | Notes |
|---|---|---|---|
| Parallel proposer calls (baseline) | Latency = slowest proposer | Baseline quality | Default architecture |
| 3 proposers | Moderate p99 | Good quality | Safe starting point |
| 6 proposers | Higher p99 (more variance) | Better quality, diminishing returns past 4 | Slowest proposer dominates |
| FP8 quantization on proposers | -20-30% TTFT | Minimal quality loss | Use for all proposers |
| Semantic cache on proposers | -80-90% on cache hits | No change on cached responses | High-value optimization |
| Multi-layer MoA (1 to 3 layers) | 3x latency | +2-4 MT-Bench points | For async, non-interactive tasks only |
The most impactful single optimization is running proposers in parallel (the baseline architecture). Going from sequential to parallel cuts latency by (N-1)/N where N is proposer count. For 4 proposers at 2 seconds each, sequential = 8 seconds; parallel = 2 seconds. For a deep dive on how vLLM achieves throughput gains via continuous batching and paged attention, see the LLM serving optimization guide.
Semantic caching for proposers deserves special attention. Because MoA proposers respond to the same user query deterministically (temperature 0.7 with a fixed seed), the same query maps to nearly identical proposer outputs. See the semantic cache guide for full deployment instructions. Adding a cache in front of each proposer endpoint cuts GPU-hours by 60-80% on workloads with any repetition.
Cost Model: MoA vs Single Large Model
Worked example: 100K queries/month, 4-proposer stack vs single 70B model
Assumptions: 500 token average query, 400 token average response, all models at FP8.
Single 70B model on H100 SXM5 ($4.21/hr):
- Tokens per month: 100K queries x 900 tokens = 90M tokens
- At 300 tokens/sec throughput: 90M / 300 = 300,000 GPU-seconds = ~83 GPU-hours
- Monthly cost: 83 x $4.21 = ~$350/month
4-proposer MoA (4x Llama-3.3-70B proposers + 1x Qwen3-72B aggregator):
- Each proposer generates 400 tokens per query: 100K x 400 = 40M tokens/proposer
- At 300 tokens/sec on H100: 40M / 300 = 133,333 GPU-seconds = ~37 GPU-hours per proposer
- 4 proposers run concurrently: billed as 4 x 37 = 148 GPU-hours, elapsed time = 37 GPU-hours
- Aggregator input: ~2,200 tokens (4x400 proposer outputs + 200 original), output 400 tokens
- Aggregator at 250 tokens/sec on H200: 100K x 2,600 tokens / 250 = 1,040,000 GPU-seconds = ~289 GPU-hours
- Total billed GPU-hours: 148 (proposers) + 289 (aggregator) = 437 GPU-hours
| Configuration | GPU-Hours/month | Cost/month | Quality Level |
|---|---|---|---|
| Single 70B (H100 SXM5) | 83 | ~$350 | Strong (baseline) |
| 4-proposer MoA (H100 proposers + H200 aggregator) | 437 | ~$2,074 | Better (+7.5pp AlpacaEval) |
| 4-proposer MoA with H200 spot aggregator | ~437 | ~$967 (H100 proposers on-demand + H200 spot at $1.19/hr) | Better |
| Budget MoA (RTX 4090 proposers + A100 aggregator) | Varies | ~$400-600 | Good |
MoA costs more than a single model at equivalent scale. The value case is where the quality improvement justifies the cost, or where MoA enables a product capability (e.g., high-stakes factual Q&A where accuracy directly drives revenue) that a single model cannot match.
Pricing fluctuates based on GPU availability. The prices above are based on 08 May 2026 and may have changed. Check current GPU pricing → for live rates.
Production Patterns
When to use MoA vs a single large model:
| Scenario | Recommendation |
|---|---|
| High-stakes factual Q&A | MoA - diversity catches individual model errors |
| Customer support classification | MoA - higher accuracy on edge cases |
| Document summarization | MoA - better coverage of key points |
| Reasoning over structured data | MoA - multiple models catch different logical paths |
| Creative long-form writing | Single large model - coherence requires one voice |
| Complex multi-step code generation | Single large model - consistency across 200+ line functions |
| Classification at scale | Router first, then MoA for borderline cases (see inference router guide) |
Fallback paths. If a proposer is unavailable during a request, run MoA with N-1 proposers. If only one proposer is available, treat it as a single-model request. If the aggregator is unavailable, fall back to the highest-confidence proposer output (use a secondary small judge model to rank proposer outputs by perplexity).
Semantic caching for proposer outputs. Proposer models in a MoA pipeline are ideal cache targets: the same user query maps to deterministic proposer outputs, and caching at the proposer layer cuts GPU-hours by 60-80% on repeated queries. See the semantic cache guide for full stack deployment with GPTCache and Redis.
Monitoring in Production
Proposer disagreement metric. Compute the average pairwise cosine similarity between proposer response embeddings for each query. Low similarity (< 0.7) signals high disagreement, where MoA adds the most value. High similarity (> 0.92) signals proposers are converging, possibly because the query has a single obvious answer or because proposer models have drifted toward a common checkpoint.
import numpy as np
from sentence_transformers import SentenceTransformer
embed_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
def compute_mean_similarity(proposer_outputs: list[str]) -> float:
"""Mean pairwise cosine similarity. Higher = more agreement, lower = more diverse."""
if len(proposer_outputs) < 2:
return 1.0
embeddings = embed_model.encode(proposer_outputs, normalize_embeddings=True)
n = len(embeddings)
similarities = []
for i in range(n):
for j in range(i + 1, n):
similarities.append(float(np.dot(embeddings[i], embeddings[j])))
return float(np.mean(similarities))Aggregator quality drift. Run a weekly LLM-as-judge pass on a sampled 1% of MoA outputs. Compare aggregator quality scores over time. A declining trend usually means proposer diversity has dropped (proposers have converged) or the aggregator's synthesis quality has degraded.
Prometheus metrics to track:
from prometheus_client import Counter, Histogram, Gauge
moa_proposer_latency = Histogram(
"moa_proposer_latency_seconds",
"Latency per proposer call",
["proposer_id"],
buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)
moa_agreement_score = Gauge(
"moa_proposer_agreement_score",
"Mean pairwise cosine similarity across proposer outputs (higher = more agreement, lower = more diverse)"
)
moa_aggregator_latency = Histogram(
"moa_aggregator_latency_seconds",
"Aggregator synthesis latency",
buckets=[1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
)
moa_cache_hit_rate = Counter(
"moa_cache_hits_total",
"Proposer cache hits",
["proposer_id"]
)Track agreement score over rolling 24-hour windows. Sudden rises in mean agreement score across queries signal that proposer models have converged - check for accidental model version alignment across proposers and restore diversity by swapping in a proposer from a different training lineage.
MoA needs multiple models running concurrently - the architecture is a natural fit for on-demand bare-metal GPU access where you pay per hour, not per token. Rent H100 instances on Spheron for your proposers and an H200 for the aggregator without per-token markup.
