Tutorial

Deploy A2A (Agent2Agent) Multi-Agent Systems on GPU Cloud: Cross-Framework Agent Interop with Self-Hosted LLM Backends (2026 Guide)

A2A ProtocolAgent2Agent ProtocolA2A vs MCPMulti-Agent GPU CloudSelf-Host A2A AgentsA2A LangGraphA2A CrewAIMulti-Agent Interop GPUGPU CloudvLLM
Deploy A2A (Agent2Agent) Multi-Agent Systems on GPU Cloud: Cross-Framework Agent Interop with Self-Hosted LLM Backends (2026 Guide)

Two agents from different frameworks usually cannot talk to each other without custom glue code between them. A2A (Agent2Agent Protocol) solves that at the protocol level. This guide covers the full deployment stack: what A2A is, how it composes with MCP, and how to run a production A2A agent mesh on GPU cloud with self-hosted vLLM backends.

TL;DR

Start both tiers with:

bash
# Orchestrator (H200 SXM5, 70B model)
docker run --gpus all --ipc=host -p 8001:8000 vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 32

# Worker (A100 80GB, 14B model)
docker run --gpus all --ipc=host -p 8002:8000 vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-14B-Instruct \
  --quantization fp8 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 64

Send an A2A task from orchestrator to worker:

python
import httpx, json

# Fetch worker Agent Card
card = httpx.get("http://worker-host/.well-known/agent-card.json").json()

# Send a task
result = httpx.post(
    "http://worker-host/tasks/send",
    headers={"Authorization": "Bearer <shared-secret>"},
    json={
        "task_id": "task-001",
        "input": {
            "query": "Summarize the top 3 GPU cloud pricing trends in 2026"
        },
        "context": {}
    },
    timeout=60.0
).json()

print(result["output"])

The 2026 Agent Protocol Stack

Two protocols now underpin almost every production multi-agent system. They solve different problems.

MCP (Model Context Protocol) defines how an agent calls a tool. A web search tool, a database query endpoint, a code execution sandbox: each is exposed via a standard JSON-RPC interface. The agent sends a tool call and gets a structured result back. The direction is vertical: agent calls tool.

A2A (Agent2Agent Protocol) defines how one agent delegates a task to another agent. The orchestrator sends a task payload to a worker agent and gets a structured result back. The direction is horizontal: agent calls agent.

ProtocolDirectionWhat It DefinesStandardized By
MCPAgent to ToolTool schemas, context passing, JSON-RPC transportAnthropic
A2AAgent to AgentTask delegation, Agent Cards, result streamingLinux Foundation / Google

They compose rather than compete. A LangGraph orchestrator uses A2A to delegate a research sub-task to a CrewAI specialist. The CrewAI specialist uses MCP to call a web search tool. The LangGraph orchestrator never needs to know which MCP tools the CrewAI agent uses. The CrewAI agent never needs to know what framework called it.

This layering is what makes cross-framework agent meshes practical. Without A2A, connecting a LangGraph orchestrator to a CrewAI worker requires custom HTTP clients, custom schemas, and custom error handling on both sides. With A2A, you agree once on the protocol, and any two conforming agents can interoperate.

A2A Core Concepts

Agent Cards

An Agent Card is a JSON document served at /.well-known/agent-card.json (the discovery path standardized in A2A v1.0). It is the discovery and contract mechanism for A2A.

A minimal Agent Card looks like this:

json
{
  "name": "research-worker",
  "version": "1.0.0",
  "description": "Web research specialist. Accepts a query and returns a structured summary.",
  "capabilities": ["web_research", "document_summarization"],
  "task_schema": {
    "input": {
      "type": "object",
      "properties": {
        "query": {"type": "string"},
        "max_sources": {"type": "integer", "default": 5}
      },
      "required": ["query"]
    },
    "output": {
      "type": "object",
      "properties": {
        "summary": {"type": "string"},
        "sources": {"type": "array", "items": {"type": "string"}}
      }
    }
  },
  "authentication": {
    "schemes": ["bearer"],
    "required": true
  },
  "endpoint": "http://research-worker.internal:8080"
}

The orchestrator fetches this card at startup and builds a routing table: which agent to call for which capability. When a task arrives, the orchestrator matches the sub-task type against the capability list in each card and routes accordingly.

Tasks

A Task is the unit of work in an A2A exchange. It has three fields:

  • task_id: a unique identifier for this specific delegation. Used for tracing and retries.
  • input: the task payload, validated against the receiving agent's task_schema.input.
  • context: optional shared session state, such as a conversation ID or user profile the worker might need.
json
{
  "task_id": "orch-2a3b-step-1",
  "input": {
    "query": "GPU cloud pricing trends 2026",
    "max_sources": 3
  },
  "context": {
    "session_id": "user-session-xyz",
    "user_tier": "pro"
  }
}

Message Passing

The orchestrator POSTs the Task to the worker's /tasks/send endpoint. The worker executes its LLM workflow, then returns an A2A Task Result. Note: the canonical A2A transport is JSON-RPC 2.0 over a single HTTP endpoint; the original method was tasks/send, renamed to message/send in the current spec. The REST-style path shown throughout this guide is a teaching simplification - readers building strict spec compliance should use the JSON-RPC envelope.

json
{
  "task_id": "orch-2a3b-step-1",
  "status": "completed",
  "output": {
    "summary": "GPU cloud pricing dropped 20-30% in H1 2026...",
    "sources": ["source1.com", "source2.com"]
  }
}

Status values include completed, failed, and in_progress (for streaming). The orchestrator checks status before consuming the output.

The key point: the research worker in this example could be built on LangGraph, CrewAI, AutoGen, or a raw FastAPI app. The orchestrator does not need to know. It sees an HTTP endpoint, an Agent Card, and a Task schema. That is all A2A requires.

Why GPU Cloud Is the Infrastructure Layer for A2A

Every A2A agent needs its own always-resident LLM backend. Unlike a REST API that can be shared across requests, an agent mesh has each agent role running a dedicated inference endpoint with model weights loaded in VRAM.

In a 5-agent mesh, that is 5 separate inference processes, each holding model weights in GPU memory simultaneously. A 70B orchestrator at FP8 takes roughly 70GB of VRAM. A 14B worker at FP8 takes roughly 14GB. If you run three 14B workers, you need 3 instances with 14GB+ VRAM each. Your total fleet has one H200 SXM5 and three A100 80GB instances.

The alternative is closed API calls. At 10,000 A2A task delegations per day across a 5-agent mesh (50,000 total LLM calls at 2,000 tokens average output), at $5 per million output tokens, that is $500/day in API costs. On self-hosted GPU cloud at current Spheron prices, the same workload runs on two A100 80GB workers at $1.69/hr each plus one H200 orchestrator at $3.70/hr - roughly $175/day at 100% utilization, and you control the model weights, data, and latency. The break-even point for most production agent fleets is around 3,000-5,000 delegations per day.

Spheron's GPU marketplace lets you mix and match GPU tiers, with data center partners globally and per-minute billing so you pay only while agents are actively running.

Reference Architecture: Heterogeneous Model Routing

The most cost-effective A2A deployment splits agent roles across two GPU tiers: frontier models for orchestration, smaller models for workers.

User Request
    |
    v
Orchestrator Agent (Llama-3.3-70B on H200 SXM5, $3.70/hr on-demand)
    |
    +-- A2A Task --> Research Worker (Qwen2.5-14B on A100 80GB, $1.69/hr on-demand)
    |                    |-- MCP --> Web Search Tool
    |
    +-- A2A Task --> Synthesis Worker (Qwen2.5-14B on A100 80GB, spot ~$0.82/hr)
    |
    +-- A2A Task --> Code Worker (Qwen2.5-7B on A100 80GB, spot ~$0.82/hr)

The orchestrator runs once per user request. Workers handle 90%+ of the token volume but on cheaper GPUs. The blended cost per task drops 70-80% compared to running every agent on the H200.

Agent RoleModelGPUOn-demandSpot
Orchestrator/PlannerLlama-3.3-70BH200 SXM5$3.70/hr$3.31/hr
Research WorkerQwen2.5-14BA100 80GB SXM4$1.69/hr$0.82/hr
Synthesis WorkerQwen2.5-14BA100 80GB SXM4$1.69/hr$0.82/hr

Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing for live rates.

For the orchestrator tier, H200 on Spheron gives you 141GB HBM3e at 4.8 TB/s memory bandwidth - enough headroom to run the 70B planner model plus generous KV cache for concurrent sessions. For worker agents handling document retrieval and code tasks, an A100 80GB GPU rental fits three 14B FP8 workers comfortably across separate instances at a fraction of the orchestrator cost.

Hands-On: Stand Up Two A2A Agents

Step 1: Deploy the Orchestrator's vLLM Backend on H200

Provision an H200 SXM5 instance on Spheron. SSH in, then:

bash
docker run --gpus all --ipc=host -p 8001:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 32 \
  --max-model-len 8192

If you prefer H100 as an alternative (also sufficient for Llama-3.3-70B at FP8), H100 on Spheron works with the same command. Verify the endpoint is serving:

bash
curl http://localhost:8001/v1/models

Step 2: Deploy the Worker's vLLM Backend on A100

Provision an A100 80GB SXM4 instance. Start the worker model:

bash
docker run --gpus all --ipc=host -p 8002:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-14B-Instruct \
  --quantization fp8 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 64 \
  --max-model-len 4096

SGLang is a drop-in alternative to vLLM for the worker tier, with similar OpenAI-compatible API compatibility. The equivalent launch command is:

bash
python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-14B-Instruct \
  --quantization fp8 \
  --port 8002

The VLLM_BASE constant in the agent code below works unchanged against an SGLang server - both expose the same /v1/chat/completions endpoint. Use whichever runtime benchmarks better for your specific model and batch size.

Step 3: Implement Agent Cards

Each agent serves a GET /.well-known/agent-card.json endpoint (the path standardized in A2A v1.0). Here is the worker's FastAPI implementation:

python
from fastapi import FastAPI, HTTPException, Header
from pydantic import BaseModel
from typing import Optional, Dict, Any
import httpx

app = FastAPI()

AGENT_CARD = {
    "name": "research-worker",
    "version": "1.0.0",
    "description": "Web research and summarization agent. Accepts a query, returns a structured summary.",
    "capabilities": ["web_research", "document_summarization"],
    "task_schema": {
        "input": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "max_sources": {"type": "integer", "default": 5}
            },
            "required": ["query"]
        },
        "output": {
            "type": "object",
            "properties": {
                "summary": {"type": "string"},
                "sources": {"type": "array"}
            }
        }
    },
    "authentication": {"schemes": ["bearer"], "required": True}
}

SHARED_SECRET = "your-agent-shared-secret"
VLLM_BASE = "http://localhost:8002/v1"

@app.get("/.well-known/agent-card.json")
async def get_agent_card():
    return AGENT_CARD


class A2ATask(BaseModel):
    task_id: str
    input: Dict[str, Any]
    context: Optional[Dict[str, Any]] = {}


@app.post("/tasks/send")
async def handle_task(task: A2ATask, authorization: str = Header(...)):
    token = authorization.replace("Bearer ", "")
    if token != SHARED_SECRET:
        raise HTTPException(status_code=401, detail="Invalid token")

    query = task.input.get("query")
    if not query:
        raise HTTPException(status_code=400, detail="Missing required field: query")

    # Call the vLLM backend
    async with httpx.AsyncClient(timeout=60.0) as client:
        response = await client.post(
            f"{VLLM_BASE}/chat/completions",
            json={
                "model": "Qwen/Qwen2.5-14B-Instruct",
                "messages": [
                    {
                        "role": "system",
                        "content": "You are a research assistant. Summarize findings concisely with sources."
                    },
                    {"role": "user", "content": query}
                ],
                "max_tokens": 512
            }
        )
        response.raise_for_status()
        llm_result = response.json()

    summary = llm_result["choices"][0]["message"]["content"]

    return {
        "task_id": task.task_id,
        "status": "completed",
        "output": {
            "summary": summary,
            "sources": []
        }
    }

Step 4: Implement the Orchestrator's A2A Client

The orchestrator fetches each worker's Agent Card at startup and builds a routing table. On each user request, it runs the planner LLM to decompose the task, dispatches sub-tasks to workers over A2A, then aggregates the results.

python
import httpx
from typing import Dict, Any, List
import asyncio

WORKER_HOSTS = {
    "research-worker": "http://research-worker.internal:8080",
    "synthesis-worker": "http://synthesis-worker.internal:8080",
}
SHARED_SECRET = "your-agent-shared-secret"
ORCHESTRATOR_VLLM = "http://localhost:8001/v1"


async def load_agent_registry() -> Dict[str, Any]:
    registry = {}
    async with httpx.AsyncClient(timeout=10.0) as client:
        for name, host in WORKER_HOSTS.items():
            resp = await client.get(f"{host}/.well-known/agent-card.json")
            resp.raise_for_status()
            card = resp.json()
            registry[name] = {"card": card, "host": host}
    return registry


async def call_planner(task: str, registry: Dict) -> List[Dict]:
    """Call the orchestrator's vLLM to decompose the task."""
    capability_list = [
        f"{name}: {reg['card']['description']}"
        for name, reg in registry.items()
    ]
    system_prompt = (
        "You are a task planner. Given a user task and a list of worker agents, "
        "decompose the task into sub-tasks and assign each to the most suitable worker. "
        "Return a JSON list: [{\"worker\": \"<name>\", \"query\": \"<sub-task query>\"}]"
    )
    user_prompt = (
        f"Available workers:\n" + "\n".join(capability_list) +
        f"\n\nUser task: {task}\n\nReturn only the JSON list."
    )

    async with httpx.AsyncClient(timeout=60.0) as client:
        response = await client.post(
            f"{ORCHESTRATOR_VLLM}/chat/completions",
            json={
                "model": "meta-llama/Llama-3.3-70B-Instruct",
                "messages": [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_prompt}
                ],
                "max_tokens": 512
            }
        )

    import json
    response.raise_for_status()
    content = response.json()["choices"][0]["message"]["content"]
    # LLMs sometimes prefix JSON with explanatory text; extract the JSON array.
    start, end = content.find("["), content.rfind("]") + 1
    if start == -1 or end == 0:
        return []
    try:
        return json.loads(content[start:end])
    except json.JSONDecodeError:
        return []


async def dispatch_a2a_task(
    worker_host: str, task_id: str, query: str
) -> Dict[str, Any]:
    """Send one A2A task to a worker and return its result."""
    async with httpx.AsyncClient(timeout=120.0) as client:
        result = await client.post(
            f"{worker_host}/tasks/send",
            headers={"Authorization": f"Bearer {SHARED_SECRET}"},
            json={"task_id": task_id, "input": {"query": query}, "context": {}}
        )
    result.raise_for_status()
    return result.json()


async def run_agent_mesh(user_task: str) -> str:
    registry = await load_agent_registry()
    sub_tasks = await call_planner(user_task, registry)

    # Dispatch all sub-tasks in parallel
    dispatch_calls = []
    indexed_tasks = []
    for i, sub_task in enumerate(sub_tasks):
        worker_name = sub_task["worker"]
        if worker_name not in registry:
            continue
        host = registry[worker_name]["host"]
        dispatch_calls.append(
            dispatch_a2a_task(host, f"task-{i}", sub_task["query"])
        )
        indexed_tasks.append(sub_task)

    worker_results = await asyncio.gather(*dispatch_calls, return_exceptions=True)

    # Synthesize results with the planner model
    results_text = "\n\n".join(
        f"[{indexed_tasks[i]['worker']}]: {r.get('output', {}).get('summary', '')}"
        for i, r in enumerate(worker_results)
        if not isinstance(r, Exception) and r.get("status") == "completed"
    )

    async with httpx.AsyncClient(timeout=60.0) as client:
        synthesis = await client.post(
            f"{ORCHESTRATOR_VLLM}/chat/completions",
            json={
                "model": "meta-llama/Llama-3.3-70B-Instruct",
                "messages": [
                    {"role": "system", "content": "Synthesize the following agent outputs into a coherent final answer."},
                    {"role": "user", "content": results_text}
                ],
                "max_tokens": 1024
            }
        )

    synthesis.raise_for_status()
    return synthesis.json()["choices"][0]["message"]["content"]

Step 5: Run the Two-Agent Pipeline

python
import asyncio

result = asyncio.run(run_agent_mesh(
    "Research GPU cloud pricing trends in 2026 and summarize the key findings."
))
print(result)

The orchestrator decomposes the task, dispatches the sub-task to the research worker over A2A, and synthesizes the worker's output into a final response. The worker never sees the orchestrator's framework. The orchestrator never sees how the worker retrieves information.

Scaling and Cost Optimization

Spot for Workers, On-Demand for Orchestrators

Worker agents handle individual sub-tasks that are inherently retryable. If a spot instance is reclaimed mid-task, the orchestrator catches the failed A2A response (status "failed" or a network timeout) and re-dispatches to the next available worker. Spot A100 80GB instances at ~$0.82/hr cut worker costs by 51% versus on-demand.

Orchestrators should not run on spot. The orchestrator holds per-session state: the task decomposition, the list of in-flight sub-tasks, and the partial results waiting for synthesis. A spot preemption here breaks the entire task chain. On-demand orchestrators cost more but save you from implementing complex state recovery.

Queue-Based Autoscaling

Use Redis as the task queue between orchestrator and workers:

python
import redis

r = redis.Redis(host="redis.internal", port=6379)

# Orchestrator pushes to worker-specific queues
r.rpush("queue:research-worker", task_json)

# Workers pop and process
task_json = r.blpop("queue:research-worker", timeout=30)

Scale worker instances when queue depth exceeds your threshold:

python
queue_depth = r.llen("queue:research-worker")
if queue_depth > 20:
    # Provision an additional worker via Spheron API
    provision_new_worker_instance()

Scale in when the queue has been empty for 5 consecutive minutes. The two tiers scale independently: planner saturation (slow planning TTFT) and worker saturation (deep worker queues) have different root causes and different remedies.

Per-Agent Cost Attribution

After each A2A task call, read the token usage from the vLLM response and compute cost:

python
usage = vllm_response["usage"]
tokens_used = usage["prompt_tokens"] + usage["completion_tokens"]

# GPU hourly cost / estimated tokens per hour at this model size
cost_per_token = hourly_gpu_cost / tokens_per_hour
task_cost = tokens_used * cost_per_token

# Store per task_id
r.hset(f"cost:{task.task_id}", mapping={
    "agent": agent_name,
    "tokens": tokens_used,
    "cost_usd": task_cost
})

Track this per agent role over time. Research workers tend to have high prompt tokens (long web scrape context). Synthesis workers have high completion tokens (long output). These cost profiles are different and point to different optimization levers: prefix caching for prompt-heavy workers, max_tokens caps for completion-heavy ones.

Observability

Tracing across A2A hops requires propagating context through each task delegation. Add a trace header to every A2A POST request:

python
import uuid

# In orchestrator, generate a trace ID for the full task
trace_id = str(uuid.uuid4())

await client.post(
    f"{worker_host}/tasks/send",
    headers={
        "Authorization": f"Bearer {SHARED_SECRET}",
        "X-Trace-ID": trace_id,
        "X-Parent-Task-ID": orchestrator_task_id
    },
    json=task_payload
)

Each worker logs the X-Trace-ID and its own task_id together. A Langfuse or Arize Phoenix collector receiving spans from all agents can then reconstruct the full task chain: which orchestrator task triggered which worker tasks, how long each hop took, and where latency accumulated.

For GPU-level metrics, run DCGM on each instance and correlate GPU utilization spikes with A2A task volume. When a worker's GPU utilization exceeds 85% for more than 30 seconds, that is usually a sign the queue is backing up and you need another instance.

For SSH setup and instance management, see the Spheron documentation.

Security

A2A meshes have an expanded attack surface compared to single-agent systems: every inter-agent endpoint is a potential target.

Authentication between agents. Declare the accepted auth scheme in each Agent Card's authentication field. For internal meshes where all agents run in the same private network on Spheron, a shared secret per agent pair is sufficient. Generate a unique secret for each orchestrator-worker pair, not a single global secret. Rotate monthly.

For deployments that expose worker agents to external orchestrators (for example, offering a worker as a service to third parties), OAuth 2.0 with short-lived tokens is the right choice. The Agent Card's authentication.schemes field should list oauth2 and include the token endpoint URL.

Validate every task input. Worker agents should validate incoming task payloads against their declared task_schema before passing input to the LLM backend. A schema mismatch from a misconfigured orchestrator should return a 400, not trigger an LLM call. Pydantic makes this one line:

python
class ResearchTaskInput(BaseModel):
    query: str
    max_sources: int = 5

validated = ResearchTaskInput(**task.input)  # raises ValidationError on bad input

Network isolation. Keep all agent HTTP endpoints on a private network. Only the user-facing orchestrator endpoint should be publicly reachable. Worker agents that accept A2A tasks from the orchestrator should bind to internal IPs only. On Spheron, use a private subnet for inter-agent traffic and expose only the orchestrator via a public IP.

When to Use A2A vs Direct HTTP

A2A adds a layer of protocol overhead: an Agent Card fetch at startup, a structured task payload, and a defined response schema. For two agents that will always be deployed together and never swap frameworks, direct HTTP with a custom schema is simpler.

Use A2A when:

  • Agents are built on different frameworks and need to interoperate (LangGraph orchestrator calling a CrewAI specialist).
  • You want to publish worker agents as reusable services that multiple orchestrators can discover and call.
  • You need a standard audit trail: A2A task IDs and Agent Cards give you a clean record of which agent did what.
  • Your team expects agent roles to change over time. A2A's Agent Card discovery means adding a new worker is just deploying a new service with a valid Agent Card. The orchestrator discovers it at next startup.

Use direct HTTP when you are prototyping, the agents are tightly coupled, or you need to cut latency to the minimum and the protocol overhead matters.


Running a multi-agent A2A mesh means each agent needs its own inference backend. Spheron's GPU marketplace lets you mix H200 capacity on Spheron for orchestrators and A100 80GB for worker agents on per-minute billing - no contracts, no idle cost.

Explore Spheron GPU pricing

STEPS / 06

Quick Setup Guide

  1. Design your A2A agent mesh topology

    Decide which agents you need and which model tier each requires. Map each agent role to a GPU tier: orchestrators and planners need high-VRAM GPUs (H200, H100) for 70B+ frontier models. Specialist workers (web retrieval, code synthesis, doc parsing) run on A100 80GB with 7B-14B models. Each distinct model tier becomes a separate vLLM deployment on its own GPU instance. Sketch the Agent Card for each role: what task types it accepts, what schema its output follows, and what auth token it requires.

  2. Deploy vLLM LLM backends for each agent tier

    For the orchestrator (planner) tier on H200: docker run --gpus all --ipc=host -p 8001:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.3-70B-Instruct --quantization fp8 --gpu-memory-utilization 0.90 --max-num-seqs 32 --port 8001. For each worker tier on A100 80GB: docker run --gpus all --ipc=host -p 8002:8000 vllm/vllm-openai:latest --model Qwen/Qwen2.5-14B-Instruct --quantization fp8 --gpu-memory-utilization 0.85 --max-num-seqs 64 --port 8002. Verify each endpoint: curl http://localhost:800X/v1/models.

  3. Implement Agent Cards and A2A HTTP endpoints

    For each agent, create a FastAPI app that serves GET /.well-known/agent-card.json returning the Agent Card JSON (name, version, description, capabilities, task_schema). Implement POST /tasks/send which accepts an A2A Task object, runs the agent's LLM workflow against the vLLM backend, and returns an A2A result. Use pydantic for task schema validation. Expose each agent service on a stable hostname or IP that other agents can reach.

  4. Wire the orchestrator to discover and route to worker agents

    In the orchestrator agent, load each worker's Agent Card at startup: agent_registry = {name: fetch(f'http://{host}/.well-known/agent-card.json') for name, host in WORKER_HOSTS.items()}. On each incoming user task, call the planner LLM to produce a task decomposition JSON. For each sub-task, select the worker agent whose Agent Card capabilities match, POST the sub-task to that worker's /tasks/send endpoint, and await the A2A result. Aggregate all worker results and call the planner LLM once more for synthesis.

  5. Add authentication between agents

    A2A supports two authentication modes: API key (simplest, suitable for internal meshes) and OAuth 2.0 (for external or multi-tenant deployments). For an internal Spheron deployment, generate a shared secret per agent pair and pass it as Authorization: Bearer <token> in all A2A task POST requests. Each agent validates the token before processing. Add this auth requirement to the Agent Card's authentication field so the orchestrator knows which credential to use per worker.

  6. Configure per-agent cost attribution and autoscaling

    Track token usage per A2A agent role. After each task, read prompt_tokens and completion_tokens from the vLLM response header. Store in Redis keyed by agent_name + task_id. Calculate cost per task as (tokens / tokens_per_hour) * hourly_GPU_cost. For autoscaling worker agents, monitor the per-worker task queue depth in Redis. When queue depth exceeds 20 pending tasks per GPU, provision an additional worker instance via the Spheron API. Scale in when queue is empty for 5 consecutive minutes.

FAQ / 05

Frequently Asked Questions

MCP (Model Context Protocol) defines how an agent calls external tools - databases, APIs, code runners - using a standardized JSON-RPC interface. A2A (Agent2Agent Protocol) defines how one agent delegates a task to another agent and receives a structured result back. MCP is vertical (agent calls tool). A2A is horizontal (agent calls agent). In a production multi-agent system, both protocols are active at once: an orchestrator agent uses A2A to delegate sub-tasks to specialist agents, and each specialist agent uses MCP to call the tools it needs. They compose, not compete.

Each A2A agent is backed by its own LLM inference endpoint, so GPU requirements multiply by the number of distinct agent roles in your mesh. A heterogeneous tier approach works best: run the orchestrator (planning, task decomposition) on a high-VRAM GPU like H200 SXM5 with a 70B+ model. Run worker agents (document retrieval, web search synthesis, code execution) on A100 80GB GPUs with 7B-14B models. Workers handle 90%+ of token volume, so routing them to cheaper GPUs cuts the fleet's blended cost-per-token by 70-85% compared to running every agent on a frontier model.

Yes. A2A is protocol-level interop: as long as each agent exposes an A2A-compatible HTTP endpoint with a valid Agent Card (a JSON document describing the agent's capabilities and task schema), it does not matter what framework it is built on. A LangGraph-based research agent and a CrewAI-based synthesis agent can be wired together in the same A2A mesh. The receiving agent only sees the task payload, not the orchestration framework of the caller.

An Agent Card is a JSON document that an A2A-compatible agent serves at a well-known URL (/.well-known/agent-card.json per the current A2A v1.0 spec). It describes the agent's identity, capabilities, supported task schemas, and authentication requirements. Agent Cards are the discovery mechanism for A2A meshes: an orchestrator agent fetches Agent Cards from known endpoints at startup to build a routing table. The A2A specification (including the Agent Card spec) is maintained by the Linux Foundation. MCP is governed by Anthropic.

Use spot GPUs for worker agents and on-demand for orchestrators. Worker agents handle individual sub-tasks that can be retried if a spot instance is reclaimed. Spot A100 instances typically cost 40-60% less than on-demand (spot ~$0.82/hr vs. on-demand at $1.69/hr). H200 spot savings are smaller, around 10-12% at current prices ($3.31/hr spot vs. $3.70/hr on-demand). Orchestrators hold per-session state and route all other agents, so a preemption here breaks the whole task - keep orchestrators on on-demand instances with automatic restart configured. This hybrid billing strategy cuts fleet cost by 30-50% compared to running everything on on-demand.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.