What is the difference between Mixture of Agents (MoA) and Mixture of Experts (MoE)?

MoE is a model architecture where a single model routes tokens to specialized sub-networks at inference time - it runs one model internally. MoA is a multi-model system where N fully independent LLMs (proposers) each generate a complete response, and a separate LLM (aggregator) synthesizes them into a final output. MoE reduces per-token compute; MoA increases output quality at the cost of running multiple full models concurrently.

How many GPUs do I need to run a complete MoA stack?

It depends on model size. A minimal 3-proposer MoA with 7B models fits on 3x RTX 4090 (24GB each). A production 4-proposer stack with 70B proposers and a 72B aggregator needs roughly 5-6 H100 80GB nodes (one per 70B model at FP8, two for the aggregator). Budget for the aggregator to have at least as much VRAM as your largest proposer.

Does MoA always outperform a single large model?

On MT-Bench and AlpacaEval 2.0 the Together AI MoA paper showed consistent wins over GPT-4-class single models. However, MoA underperforms on tasks that require tight coherence across a long response - creative writing and complex multi-step coding - where aggregating divergent outputs can introduce inconsistency. It is strongest on factual Q&A, summarization, reasoning tasks, and classification.

How do I reduce the end-to-end latency of a MoA query?

Three levers: (1) run all proposer models in parallel and wait for the last to finish - this caps latency at your slowest proposer, not the sum; (2) use FP8 quantization and continuous batching on each proposer to maximize tokens/sec per GPU; (3) add a semantic cache in front of proposers so repeated or near-duplicate queries return cached outputs instead of generating fresh responses.

What models work best as MoA proposers vs aggregators?

Proposers should be diverse: different architectures or training lineages produce more varied outputs that the aggregator can synthesize meaningfully. Good 2026 proposer combinations: Llama 4 Scout 17B, Qwen 3 30B A3B, Mistral Small 3.2 (24B dense), and Llama-3.3-70B-Instruct. The aggregator needs strong instruction-following and synthesis ability - Llama 4 Maverick 400B A17B, Qwen 3 72B, or Nemotron Ultra 253B work well.

Mixture of Agents (MoA) on GPU Cloud: Deploy Multi-LLM Voting Architectures (2026 Guide)

Single-model inference has a quality ceiling. Past a certain point, a 70B model running alone stops improving no matter how you tune it. What changes the trajectory is running multiple models on the same query and letting a separate model synthesize their outputs. That's Mixture of Agents (MoA), not to be confused with Mixture of Experts (MoE), which is a different architecture entirely.

MoE is internal to a single model: tokens are routed to specialized sub-networks at inference time. MoA is a multi-model system: N independent LLMs each generate a complete response, and a separate aggregator LLM synthesizes them into a final answer. The Together AI paper (arXiv:2406.04692) showed a 6-proposer MoA outperforming GPT-4 Turbo on AlpacaEval 2.0 despite each individual proposer being weaker than GPT-4.

This guide covers everything you need to deploy a production MoA stack on GPU cloud: architecture, GPU sizing for 4-6 concurrent models, a full vLLM reference implementation, latency/cost math, and when MoA is actually the right choice. See our LLM inference router guide for single-model routing; this guide covers what happens when you run all of them at once.

What Is Mixture of Agents (And What It Isn't)

MoA is an inference architecture, not a model or a training technique. A fixed set of proposer LLMs each independently answer the same query. An aggregator LLM takes all their responses as context and produces a single synthesized output. The proposers never see each other's outputs. Only the aggregator does.

The key insight is that diversity matters more than raw capability. A proposer trained on different data with a different RLHF recipe will produce systematically different responses to the same question. The aggregator can then identify the parts each proposer got right and synthesize them into something better than any individual answer.

Here's how MoA compares to related concepts:

Concept	What It Does	Runs At	Related Post
Mixture of Agents (MoA)	N full LLMs generate responses, one aggregator synthesizes them	Inference time, multi-model	This post
Mixture of Experts (MoE)	One model routes tokens to specialized sub-networks internally	Inside a single forward pass	-
Multi-agent system	Agents with tools take turns reasoning and acting	Multiple inference steps, orchestrated	-
LLM-as-judge	A separate LLM scores or compares outputs for evaluation	Eval time, not production serving	LLM-as-judge guide

MoA is a quality multiplier for inference, not an evaluation mechanism or a routing strategy.

MoA Architecture: Proposers, Aggregators, and Layered Refinement

The data flow for a single MoA query:

User Query
     |
     v
+----+----+----+----+
| P1 | P2 | P3 | P4 |   <-- Proposers run in parallel
+----+----+----+----+
     |    |    |    |
     +----+----+----+
              |
              v
         Aggregator
              |
              v
        Final Output

All proposers receive the same query and run concurrently. The total latency is determined by the slowest proposer, not the sum. Once all proposer responses arrive, the aggregator processes them as a single long input and produces the final response.

Single-layer vs multi-layer MoA. The original Together AI paper used 3 refinement layers: proposers feed an aggregator, whose output is fed back as context to the same proposers for a second round, and so on. In practice, one layer covers most production use cases and keeps latency predictable. Multi-layer MoA triples latency for 2-4 MT-Bench point gains - worth it for high-stakes async tasks, not worth it for interactive serving.

The aggregator prompt template. The prompt structure that works in production:

python

AGGREGATOR_SYSTEM_PROMPT = """You are a synthesis model. You will receive several independent responses to the same question from different AI models. Your task is to synthesize them into a single, high-quality response.

Instructions:
- Identify the strongest points from each response
- Resolve any factual conflicts by reasoning from first principles
- Do not mention that you are synthesizing multiple responses
- Output only the final synthesized answer

Proposer responses:
{proposer_outputs}

Question: {query}

Synthesize the above responses into a single, complete answer:"""

Why model diversity matters. A proposer trained by Meta (Llama), one trained by Alibaba (Qwen), and one trained by Mistral AI will have different strengths, failure modes, and knowledge gaps from their distinct data mixtures and RLHF recipes. Running three identical models produces three nearly identical outputs and wastes compute. Running three diverse architectures produces genuinely different responses that the aggregator can synthesize into something better than any single answer.

Quality Benchmarks: MoA vs Single Frontier Models

The Together AI paper (arXiv:2406.04692) benchmarked several MoA configurations against single frontier models. These numbers reflect models available at publication time (mid-2024), so absolute scores will be higher with 2026 model generations, but the relative ordering is directionally accurate.

Configuration	AlpacaEval 2.0 LC Win Rate	MT-Bench Score	Notes
GPT-4 Omni (single)	57.5% (baseline)	9.32	Single model comparison point
Claude 3 Opus (single)	40.5%	9.00	Strong single model
MoA 6 proposers	65.1%	9.65	6-proposer, single layer, beats GPT-4

On AlpacaEval 2.0, the 6-proposer MoA beat GPT-4 Omni by ~7.5 percentage points despite each individual proposer being weaker. The MT-Bench improvement is smaller (9.32 to 9.65) but consistent.

Where MoA falls short. Two task categories where single large models outperform MoA:

Creative long-form writing. When you need a coherent narrative voice across 2,000 words, aggregating outputs from 4 proposers that each have different styles creates obvious seams. A single 72B model with good instruction following wins here.
Complex multi-step code generation. A function that spans 200 lines with internal consistency requirements is hard to aggregate. Proposers generate valid but incompatible implementations. The aggregator's attempt to synthesize them often produces code that compiles but has logic errors.

For structured factual Q&A, summarization, classification, and reasoning tasks, MoA's quality advantage is reliable and consistent.

GPU Footprint Planning

Before provisioning, work out VRAM requirements per model:

Model Size	GPU Fit	VRAM Used	Recommended GPU
7B FP16	1x RTX 4090 (24GB)	~14GB	RTX 4090
7B FP8	1x RTX 4090 (24GB)	~7GB	RTX 4090
14B FP8	1x RTX 4090 (24GB)	~14GB	RTX 4090
30B A3B MoE FP8	1x RTX 4090 (24GB)	~14GB active	RTX 4090 or A100
70B FP8	1x H100 80GB	~40GB	H100 SXM5
72B FP8	1x H100 80GB	~40GB	H100 SXM5 or H200

The aggregator VRAM requirement grows with the number of proposers. A 4-proposer stack where each proposer outputs 500 tokens adds 2,000 tokens to the aggregator's input context. For a 72B aggregator at FP8, an H200 (141GB HBM3e) gives comfortable headroom for both model weights and the enlarged KV cache. See the KV cache optimization guide for cache sizing and eviction strategies.

Recommended hardware tiers for production:

Tier	Proposers	Aggregator	GPU Configuration	On-Demand Cost/hr	Est. Max Concurrent Requests
Budget	3x RTX 4090 (7B FP8 each)	1x A100 80G (32B FP8)	4 GPU nodes total	~$3.35/hr	20-40
Standard	4x A100 80G (32B FP8 each)	1x H100 SXM5 (70B FP8)	5 GPU nodes total	~$8.39/hr	60-100
Production	4x H100 SXM5 (70B FP8 each)	1x H200 SXM5 (72B FP8)	5 GPU nodes total	~$21.86/hr	150-250

Prices above use live Spheron rates: RTX 4090 $0.77/hr, A100 80G PCIe $1.04/hr, H100 SXM5 $4.21/hr, H200 SXM5 $5.02/hr. For a comparison of Dedicated (on-demand) vs Spot instance availability on Spheron, see the instance types guide.

Reference Implementation on Spheron

Provision one on-demand H100 instance per 70B proposer model and one H200 on Spheron for the aggregator. SSH into each node and launch the relevant vLLM server. If you haven't chosen a serving framework yet, see the Ollama vs vLLM comparison for a breakdown of when each makes sense, or the Spheron LLM deployment guide for step-by-step container setup on Spheron instances.

Proposer deployments:

bash

# Proposer 1 - Llama 4 Scout 17B on RTX 4090
docker run --gpus all --rm -p 8001:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --quantization fp8 \
  --max-model-len 16384

bash

# Proposer 2 - Qwen 3 30B A3B on RTX 4090
docker run --gpus all --rm -p 8002:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-30B-A3B \
  --quantization fp8 \
  --max-model-len 16384

bash

# Proposer 3 - Mistral Small 4 24B on A100 80G
docker run --gpus all --rm -p 8003:8000 \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \
  --quantization fp8 \
  --max-model-len 32768

bash

# Proposer 4 - Llama-3.3-70B on H100 SXM5
docker run --gpus all --rm -p 8004:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --max-model-len 32768

Aggregator deployment (H200, large context window required):

bash

# Aggregator - Qwen3-72B on H200 SXM5
# --max-model-len must be large enough to hold all proposer outputs concatenated
# 4 proposers x 500 tokens each + original query + system prompt = ~3000-4000 tokens overhead
docker run --gpus all --rm -p 8010:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-72B \
  --quantization fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

FastAPI orchestration layer:

python

import asyncio
import httpx
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI()

PROPOSER_ENDPOINTS = [
    "http://proposer-1-host:8001/v1",
    "http://proposer-2-host:8002/v1",
    "http://proposer-3-host:8003/v1",
    "http://proposer-4-host:8004/v1",
]
PROPOSER_MODELS = [
    "meta-llama/Llama-4-Scout-17B-16E-Instruct",
    "Qwen/Qwen3-30B-A3B",
    "mistralai/Mistral-Small-3.2-24B-Instruct-2506",
    "meta-llama/Llama-3.3-70B-Instruct",
]
AGGREGATOR_ENDPOINT = "http://aggregator-host:8010/v1"
AGGREGATOR_MODEL = "Qwen/Qwen3-72B"

AGGREGATOR_SYSTEM_PROMPT = (
    "You are a synthesis model. You will receive several independent responses to the "
    "same question from different AI models. Synthesize them into a single, high-quality "
    "response. Identify the strongest points from each, resolve factual conflicts by "
    "reasoning from first principles, and do not mention that you are synthesizing. "
    "Output only the final synthesized answer."
)

class ChatRequest(BaseModel):
    messages: list[dict]
    max_tokens: int = 1024
    temperature: float = 0.7


async def call_proposer(client: httpx.AsyncClient, endpoint: str, model: str, request: ChatRequest) -> str:
    """Call a single proposer and return its text response."""
    resp = await client.post(
        f"{endpoint}/chat/completions",
        json={
            "model": model,
            "messages": request.messages,
            "max_tokens": request.max_tokens,
            "temperature": request.temperature,
        },
        timeout=120.0,
    )
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]


@app.post("/v1/chat/completions")
async def moa_chat(request: ChatRequest):
    async with httpx.AsyncClient() as client:
        # Fan out to all proposers concurrently
        proposer_tasks = [
            call_proposer(client, endpoint, model, request)
            for endpoint, model in zip(PROPOSER_ENDPOINTS, PROPOSER_MODELS)
        ]
        results = await asyncio.gather(*proposer_tasks, return_exceptions=True)
        proposer_outputs = [r for r in results if not isinstance(r, BaseException)]
        if len(proposer_outputs) < 2:
            raise HTTPException(status_code=502, detail="Too few proposers succeeded")

        # Build aggregator prompt
        formatted_outputs = "\n\n".join(
            f"Response {i+1}:\n{output}"
            for i, output in enumerate(proposer_outputs)
        )
        original_query = next(
            (m["content"] for m in reversed(request.messages) if m["role"] == "user"),
            ""
        )
        aggregator_messages = [
            {"role": "system", "content": AGGREGATOR_SYSTEM_PROMPT},
            {
                "role": "user",
                "content": (
                    f"Proposer responses:\n{formatted_outputs}\n\n"
                    f"Original question: {original_query}\n\n"
                    "Synthesize the above into a single, complete answer:"
                ),
            },
        ]

        # Call aggregator
        try:
            agg_resp = await client.post(
                f"{AGGREGATOR_ENDPOINT}/chat/completions",
                json={
                    "model": AGGREGATOR_MODEL,
                    "messages": aggregator_messages,
                    "max_tokens": request.max_tokens,
                    "temperature": 0.3,
                },
                timeout=180.0,
            )
            agg_resp.raise_for_status()
        except httpx.HTTPStatusError as exc:
            raise HTTPException(
                status_code=502,
                detail=f"Aggregator error: {exc.response.status_code}",
            )
        except httpx.TimeoutException:
            raise HTTPException(status_code=504, detail="Aggregator request timed out")
        except httpx.RequestError as exc:
            raise HTTPException(status_code=502, detail=f"Aggregator connection error: {exc}")

        return agg_resp.json()

Test with curl:

bash

curl -X POST http://moa-orchestrator:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Explain gradient checkpointing and when to use it."}],
    "max_tokens": 512
  }'

Latency and Quality Tradeoffs

Optimization	Latency Impact	Quality Impact	Notes
Parallel proposer calls (baseline)	Latency = slowest proposer	Baseline quality	Default architecture
3 proposers	Moderate p99	Good quality	Safe starting point
6 proposers	Higher p99 (more variance)	Better quality, diminishing returns past 4	Slowest proposer dominates
FP8 quantization on proposers	-20-30% TTFT	Minimal quality loss	Use for all proposers
Semantic cache on proposers	-80-90% on cache hits	No change on cached responses	High-value optimization
Multi-layer MoA (1 to 3 layers)	3x latency	+2-4 MT-Bench points	For async, non-interactive tasks only

The most impactful single optimization is running proposers in parallel (the baseline architecture). Going from sequential to parallel cuts latency by (N-1)/N where N is proposer count. For 4 proposers at 2 seconds each, sequential = 8 seconds; parallel = 2 seconds. For a deep dive on how vLLM achieves throughput gains via continuous batching and paged attention, see the LLM serving optimization guide.

Semantic caching for proposers deserves special attention. Because MoA proposers respond to the same user query deterministically (temperature 0.7 with a fixed seed), the same query maps to nearly identical proposer outputs. See the semantic cache guide for full deployment instructions. Adding a cache in front of each proposer endpoint cuts GPU-hours by 60-80% on workloads with any repetition.

Cost Model: MoA vs Single Large Model

Worked example: 100K queries/month, 4-proposer stack vs single 70B model

Assumptions: 500 token average query, 400 token average response, all models at FP8.

Single 70B model on H100 SXM5 ($4.21/hr):

Tokens per month: 100K queries x 900 tokens = 90M tokens
At 300 tokens/sec throughput: 90M / 300 = 300,000 GPU-seconds = ~83 GPU-hours
Monthly cost: 83 x $4.21 = ~$350/month

4-proposer MoA (4x Llama-3.3-70B proposers + 1x Qwen3-72B aggregator):

Each proposer generates 400 tokens per query: 100K x 400 = 40M tokens/proposer
At 300 tokens/sec on H100: 40M / 300 = 133,333 GPU-seconds = ~37 GPU-hours per proposer
4 proposers run concurrently: billed as 4 x 37 = 148 GPU-hours, elapsed time = 37 GPU-hours
Aggregator input: ~2,200 tokens (4x400 proposer outputs + 200 original), output 400 tokens
Aggregator at 250 tokens/sec on H200: 100K x 2,600 tokens / 250 = 1,040,000 GPU-seconds = ~289 GPU-hours
Total billed GPU-hours: 148 (proposers) + 289 (aggregator) = 437 GPU-hours

Configuration	GPU-Hours/month	Cost/month	Quality Level
Single 70B (H100 SXM5)	83	~$350	Strong (baseline)
4-proposer MoA (H100 proposers + H200 aggregator)	437	~$2,074	Better (+7.5pp AlpacaEval)
4-proposer MoA with H200 spot aggregator	~437	~$967 (H100 proposers on-demand + H200 spot at $1.19/hr)	Better
Budget MoA (RTX 4090 proposers + A100 aggregator)	Varies	~$400-600	Good

MoA costs more than a single model at equivalent scale. The value case is where the quality improvement justifies the cost, or where MoA enables a product capability (e.g., high-stakes factual Q&A where accuracy directly drives revenue) that a single model cannot match.

Pricing fluctuates based on GPU availability. The prices above are based on 08 May 2026 and may have changed. Check current GPU pricing → for live rates.

Production Patterns

When to use MoA vs a single large model:

Scenario	Recommendation
High-stakes factual Q&A	MoA - diversity catches individual model errors
Customer support classification	MoA - higher accuracy on edge cases
Document summarization	MoA - better coverage of key points
Reasoning over structured data	MoA - multiple models catch different logical paths
Creative long-form writing	Single large model - coherence requires one voice
Complex multi-step code generation	Single large model - consistency across 200+ line functions
Classification at scale	Router first, then MoA for borderline cases (see inference router guide)

Fallback paths. If a proposer is unavailable during a request, run MoA with N-1 proposers. If only one proposer is available, treat it as a single-model request. If the aggregator is unavailable, fall back to the highest-confidence proposer output (use a secondary small judge model to rank proposer outputs by perplexity).

Semantic caching for proposer outputs. Proposer models in a MoA pipeline are ideal cache targets: the same user query maps to deterministic proposer outputs, and caching at the proposer layer cuts GPU-hours by 60-80% on repeated queries. See the semantic cache guide for full stack deployment with GPTCache and Redis.

Monitoring in Production

Proposer disagreement metric. Compute the average pairwise cosine similarity between proposer response embeddings for each query. Low similarity (< 0.7) signals high disagreement, where MoA adds the most value. High similarity (> 0.92) signals proposers are converging, possibly because the query has a single obvious answer or because proposer models have drifted toward a common checkpoint.

python

import numpy as np
from sentence_transformers import SentenceTransformer

embed_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def compute_mean_similarity(proposer_outputs: list[str]) -> float:
    """Mean pairwise cosine similarity. Higher = more agreement, lower = more diverse."""
    if len(proposer_outputs) < 2:
        return 1.0
    embeddings = embed_model.encode(proposer_outputs, normalize_embeddings=True)
    n = len(embeddings)
    similarities = []
    for i in range(n):
        for j in range(i + 1, n):
            similarities.append(float(np.dot(embeddings[i], embeddings[j])))
    return float(np.mean(similarities))

Aggregator quality drift. Run a weekly LLM-as-judge pass on a sampled 1% of MoA outputs. Compare aggregator quality scores over time. A declining trend usually means proposer diversity has dropped (proposers have converged) or the aggregator's synthesis quality has degraded.

Prometheus metrics to track:

python

from prometheus_client import Counter, Histogram, Gauge

moa_proposer_latency = Histogram(
    "moa_proposer_latency_seconds",
    "Latency per proposer call",
    ["proposer_id"],
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)

moa_agreement_score = Gauge(
    "moa_proposer_agreement_score",
    "Mean pairwise cosine similarity across proposer outputs (higher = more agreement, lower = more diverse)"
)

moa_aggregator_latency = Histogram(
    "moa_aggregator_latency_seconds",
    "Aggregator synthesis latency",
    buckets=[1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
)

moa_cache_hit_rate = Counter(
    "moa_cache_hits_total",
    "Proposer cache hits",
    ["proposer_id"]
)

Track agreement score over rolling 24-hour windows. Sudden rises in mean agreement score across queries signal that proposer models have converged - check for accidental model version alignment across proposers and restore diversity by swapping in a proposer from a different training lineage.

MoA needs multiple models running concurrently - the architecture is a natural fit for on-demand bare-metal GPU access where you pay per hour, not per token. Rent H100 instances on Spheron for your proposers and an H200 for the aggregator without per-token markup.
Rent H100 → | Rent H200 → | View all GPU pricing →

What Is Mixture of Agents (And What It Isn't)

MoA Architecture: Proposers, Aggregators, and Layered Refinement

Quality Benchmarks: MoA vs Single Frontier Models

GPU Footprint Planning

Reference Implementation on Spheron

Latency and Quality Tradeoffs

Cost Model: MoA vs Single Large Model

Production Patterns

Monitoring in Production

Build what's next.