Azure OpenAI is the default backend for Microsoft Agent Framework. It is also the fastest path to a large bill when your agents start running at scale.
A 5-step ReAct loop with tool use can consume 10,000-25,000 tokens per user request. Run that for 100 concurrent users at 50 requests per day and you are looking at 50-125M tokens daily. At gpt-4o pricing, that is $500-$1,250 per day, before adding any orchestration overhead. MAF does not require Azure OpenAI. It accepts any OpenAI-compatible endpoint. This guide shows you how to replace the Azure backend with a self-hosted vLLM server on GPU cloud, what model to pick for your agent tier, and what the cost math looks like at scale.
Microsoft Agent Framework 1.0 in 2026: What You Actually Get
Microsoft Agent Framework 1.0 is the unified successor to AutoGen and Semantic Kernel, which reached 1.0 GA in April 2026. Rather than treating these as two separate projects, Microsoft merged the orchestration capabilities of AutoGen with the enterprise integration patterns of Semantic Kernel under a single framework.
The 1.0 GA feature set includes:
Agent runtime. A production deployment layer for managing agent lifecycle. Instead of running individual agent scripts, you deploy agents into the runtime, which handles process management, restarts, and health checks. This makes MAF suitable for always-on production deployments.
CodeAct mode. Agents write and execute Python rather than emitting JSON tool calls. Instead of calling a search tool with structured JSON, a CodeAct agent writes results = search_web(query) and runs it directly. This reduces the round-trip overhead of tool call parsing and tends to produce fewer failed tool invocations on models with weaker function-calling accuracy.
Multi-agent graph with supervisor routing. Agents connect as nodes in a directed graph. A supervisor agent receives incoming tasks, evaluates which worker agent should handle them, and routes accordingly. The graph handles the state passing between agents, so workers receive only the context they need.
Built-in tool and function calling. Standard Python functions with type annotations become tools. MAF generates the JSON schema from the annotations automatically.
MCP client integration. MAF ships a first-class MCP client. Any MCP server you run alongside your inference backend can be wired directly into your agents without building a custom tool wrapper.
Note: MAF's hosted agent feature (fully managed deployment to Azure infrastructure) is Azure-specific and remains on Azure. The self-hosted approach in this guide covers the agent orchestration layer on your own compute while connecting to a GPU cloud inference backend.
Why Running MAF on Azure OpenAI Gets Expensive at Agent Scale
The token math for agents is worse than for chatbots. A chatbot turn might consume 500-1,000 tokens. An agent turn is different:
- The system prompt carries the agent's role definition, tool schemas, and any accumulated memory: 2,000-5,000 tokens per turn
- Each tool call result gets appended to context: another 500-2,000 tokens
- In a 5-step ReAct loop, context grows with every step
A single user request might generate 20,000-30,000 tokens across the full agent run. At gpt-4o pricing on Azure (approximately $5/1M input + $15/1M output as of mid-2026, though check the Azure pricing page for current rates), a 30,000-token request with a 50/50 input/output split costs about $0.30. That sounds manageable until you multiply by volume.
At 5,000 requests per day (100 concurrent users, 50 requests each):
- Daily token volume: 5,000 requests * 25,000 tokens = 125M tokens
- Daily cost at gpt-4o rates: roughly $500-$1,000 depending on input/output ratio
- Monthly: $15,000-$30,000
For gpt-4o-mini ($0.15/1M input + $0.60/1M output), the same workload runs cheaper: roughly $37-$75/day, or $1,100-$2,250/month. But gpt-4o-mini is a small model; for multi-step agent reasoning with complex tool chains, most teams find they need at least a mid-size model to maintain acceptable task completion rates.
Self-hosting a 32B or 70B open-weight model on GPU cloud cuts this to a flat infrastructure cost regardless of token volume.
MAF Accepts Any OpenAI-Compatible Endpoint
MAF's model client layer is backend-agnostic. The framework uses a model client object to make inference calls, and that object just needs to implement the OpenAI chat completions API surface. Swapping Azure OpenAI for a self-hosted backend means changing the client configuration, not the agent code.
For reference, a standard Azure-connected MAF setup uses the Azure client with Azure-specific parameters. Switching to a self-hosted vLLM server requires only a client change:
# For Azure, you would pass azure_endpoint and credential instead of base_url:
# OpenAIChatCompletionClient(model=..., azure_endpoint=..., api_version=..., credential=...)
# For vLLM (Chat Completions API), use OpenAIChatCompletionClient with base_url:
from agent_framework.openai import OpenAIChatCompletionClient
model_client = OpenAIChatCompletionClient(
model="Qwen/Qwen3-32B", # must match the model ID loaded in vLLM
base_url="http://YOUR_GPU_IP:8000/v1",
api_key="placeholder", # vLLM ignores this by default
)The chat completions, function-calling JSON format, streaming, and tool use all flow through the same wire protocol. vLLM fully implements the OpenAI /v1/chat/completions endpoint including function calling and parallel tool calls.
If you are running multiple model backends (for example, one H100 for the supervisor and two L40S for worker agents), a proxy layer distributes requests. For that pattern, check the self-hosted OpenAI-compatible API guide for the single-backend setup, and the LiteLLM proxy for load balancing across multiple backends with a unified OpenAI-compatible API surface.
Choosing the Right Model Backend for Agent Workloads
Agent workloads stress function-calling accuracy more than general chat benchmarks measure. A model that scores well on MMLU but hallucinates tool arguments causes agent loops to fail mid-run. The open-weight models with the strongest verified tool-calling accuracy on BFCL v4 and tau-Bench as of mid-2026:
Qwen3 32B (Qwen/Qwen3-32B): The most cost-effective option for production agent worker tiers. Strong function-calling accuracy, fits in FP8 on a single L40S 48GB, and handles 32K context comfortably.
Llama 3.3 70B (meta-llama/Llama-3.3-70B-Instruct): Best open-weight option for supervisor agents where reasoning quality matters. Fits in FP8 on a single H100 SXM5 80GB. Use it for the orchestrator tier in a supervisor/worker graph.
Llama 4 Scout (meta-llama/Llama-4-Scout-17B-16E-Instruct): 17B active parameters out of 109B total with sparse MoE routing. Fits in INT4 on a single H100 SXM5 80GB (at FP8, 109B total parameters would require ~109GB, exceeding the card's 80GB VRAM; INT4 brings the footprint to ~55-60GB). Good throughput relative to its active parameter count makes it suitable for agent deployments where you need more than a 32B model but want better latency than a dense 70B. See the Llama 4 deployment guide for a full vLLM setup walkthrough.
Phi-5: Small, fast, and accurate enough for constrained tool-calling tasks. 16GB VRAM requirement in FP8 means it runs on an L40S with significant headroom for KV cache. Suitable for worker agents with narrow, well-defined tool sets.
GPU sizing table with on-demand pricing (as of 30 Jun 2026):
| Model | Quantization | VRAM Required | Recommended GPU | On-Demand $/hr |
|---|---|---|---|---|
| Qwen3 32B | FP8 | ~35 GB | L40S 48GB | $0.96/hr |
| Llama 3.3 70B | FP8 | ~70 GB | H100 SXM5 80GB | $4.41/hr |
| Llama 4 Scout | INT4 | ~55-60 GB | H100 SXM5 80GB | $4.41/hr |
| Phi-5 | FP8 | ~16 GB | L40S 48GB | $0.96/hr |
Pricing fluctuates based on GPU availability. The prices above are based on 30 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
For the supervisor tier, an H100 SXM5 on Spheron handles Llama 3.3 70B at ~1,500 tokens/sec peak throughput. For worker agent tiers, an L40S instance runs Qwen3-32B at ~3,500 tokens/sec with 48GB VRAM allowing generous KV cache.
Deploying the vLLM Backend on Spheron
First, provision your GPU instance on Spheron and SSH in. The Spheron setup guide covers instance types, regions, and how to connect to your instance once it is running.
vLLM is the standard inference server for agent workloads. It handles continuous batching (multiple concurrent agent calls share one GPU without scheduling overhead), prefix caching (repeated system prompts hit the cache instead of recomputing), and the full OpenAI function-calling API including parallel tool calls and the tool response format.
For a Qwen3-32B FP8 deployment on L40S:
docker run --gpus all --ipc=host -p 8000:8000 \
-e HF_TOKEN=$HF_TOKEN \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-32B \
--dtype fp8 \
--gpu-memory-utilization 0.92 \
--max-model-len 32768 \
--max-num-seqs 64 \
--enable-auto-tool-choice \
--tool-call-parser hermesTwo flags matter specifically for agent workloads:
--enable-auto-tool-choice activates vLLM's tool call interception. Without it, the model generates tool call JSON as plain text and the framework has to parse it manually. With it, vLLM intercepts the generation when it detects a tool call pattern and formats it correctly as the OpenAI tool_calls response field.
--tool-call-parser hermes specifies the parser for Hermes-format models, which includes Qwen3-Instruct. For Llama 4 Scout, use --tool-call-parser llama4_json instead.
For Llama 3.3 70B on H100 SXM5, change --model to meta-llama/Llama-3.3-70B-Instruct, --tool-call-parser to llama3_json, and reduce --max-num-seqs to 32 given the tighter VRAM budget from the larger model.
For a full vLLM production deployment setup with multi-GPU tensor parallelism, log aggregation, and health checks, the vLLM production deployment guide covers that stack in detail.
SGLang is an alternative worth considering specifically for multi-turn agent workloads. Its RadixAttention prefix cache is more aggressive than vLLM's prefix caching, which helps when many agent sessions share the same large system prompt. If your agents have a 3,000+ token system prompt and you run dozens of concurrent sessions, SGLang's cache hit rate can reduce TTFT by 40-60% compared to vLLM on the same hardware. The SGLang deployment guide covers the configuration.
Wiring MAF to the Self-Hosted Endpoint
Full agent setup with tool definitions and the model client pointed at vLLM:
import asyncio
from agent_framework import tool
from agent_framework.openai import OpenAIChatCompletionClient
# Configure the model client pointing at your vLLM server
model_client = OpenAIChatCompletionClient(
model="Qwen/Qwen3-32B",
base_url="http://YOUR_GPU_IP:8000/v1",
api_key="placeholder", # vLLM ignores this by default
)
# Define tools with the @tool decorator
@tool
async def search_web(query: str) -> str:
"""Search the web for information about a topic."""
# Your actual search implementation here
return f"Search results for: {query}"
@tool
async def run_code(code: str, language: str = "python") -> str:
"""Execute code and return the output."""
# Your sandboxed execution environment here
return "Execution result"
# Create an agent with tool access
agent = model_client.as_agent(
name="research_agent",
instructions="""You are a research agent. Use available tools to gather
information and synthesize answers. When you have a complete answer, say DONE.""",
tools=[search_web, run_code],
)
async def main():
result = await agent.run(
"Research the current state of GPU cloud pricing for LLM inference."
)
return result
asyncio.run(main())Tool definitions flow as standard OpenAI function-calling JSON. MAF generates the schema from the function signature and docstring, sends it to vLLM in the tools field of the chat completion request, and vLLM routes the model's tool call response back in the standard tool_calls format.
For MCP integration, MAF's built-in MCP client connects to any MCP server. Point it at your self-hosted GPU-backed MCP server for inference-heavy tools (document processing, embedding, reranking), or at filesystem and browser MCP servers for standard agent tool use. The MCP connection adds no overhead to the inference path since tool execution and model inference run on separate infrastructure.
Multi-Agent Graph Workflows and the Agent Runtime
The supervisor/worker topology is the most common production pattern for MAF at scale. A supervisor agent handles routing decisions (a relatively cheap inference call), while specialized worker agents handle domain-specific tasks (more expensive, higher quality).
Example setup: supervisor on Llama 3.3 70B (H100 SXM5), worker agents on Qwen3-32B (L40S):
import asyncio
from agent_framework import WorkflowBuilder, tool
from agent_framework.openai import OpenAIChatCompletionClient
# search_web and run_code defined above (see single-agent example)
supervisor_client = OpenAIChatCompletionClient(
model="meta-llama/Llama-3.3-70B-Instruct",
base_url="http://SUPERVISOR_GPU_IP:8000/v1",
api_key="placeholder",
)
worker_client = OpenAIChatCompletionClient(
model="Qwen/Qwen3-32B",
base_url="http://WORKER_GPU_IP:8000/v1",
api_key="placeholder",
)
supervisor = supervisor_client.as_agent(
name="supervisor",
instructions="""You coordinate the research and analyst agents.
Route tasks to the appropriate agent based on what needs to be done.""",
)
research_agent = worker_client.as_agent(
name="researcher",
instructions="You gather information from external sources.",
tools=[search_web],
)
analysis_agent = worker_client.as_agent(
name="analyst",
instructions="You analyze data and produce structured summaries.",
tools=[run_code],
)
# MAF WorkflowBuilder: define the graph topology with edges from supervisor to workers
builder = WorkflowBuilder(start_executor=supervisor)
builder.add_edge(supervisor, research_agent)
builder.add_edge(supervisor, analysis_agent)
workflow = builder.build()
async def main():
async for event in workflow.run(
"Research and analyze the current state of GPU cloud pricing for LLM inference.",
stream=True,
):
if event.type == "output":
print(f"Completed: {event.data}")
asyncio.run(main())For CodeAct agents that write and run code directly, the pattern is similar but agents generate Python execution steps rather than tool call JSON. The agent runtime provides process isolation for code execution, which matters when you are running untrusted agent-generated code at scale.
For KV cache sizing across a multi-agent graph, the multi-agent GPU infrastructure guide covers the math in depth. The short version: each concurrent agent session holds its own KV cache in VRAM. At 32K context per session with Llama 3.3 70B in FP8 on H100 SXM5, you can run roughly 8-12 concurrent agent sessions before KV cache pressure starts degrading TTFT.
For autoscaling patterns when agent fleet size varies with traffic, see autoscaling agent fleets for the triggering metrics and the spot/on-demand split strategy.
Latency and Concurrency for Agent Fleets
Time-to-first-token requirements vary by agent tier:
Interactive agents (user is watching in real time): TTFT under 500ms. For a 32K input prefill on Llama 3.3 70B, H100 SXM5 achieves roughly 300-400ms TTFT at typical load. On a busier instance with full KV cache, TTFT can stretch to 800-1,200ms. If you need consistent sub-500ms for interactive use, keep the H100 instance below 60% KV cache utilization.
Background agents (batch processing, overnight pipelines): TTFT can reach 5-10 seconds without user impact. Use spot instances for these tiers to cut cost.
vLLM's continuous batching handles N concurrent MAF agent calls sharing one GPU. There is no scheduler overhead between calls; requests enter the batch as GPU cycles free up. The practical concurrency limit comes from KV cache:
Available KV cache VRAM = Total VRAM - Model weight VRAM - System reserve
Max concurrent sessions = Available KV cache VRAM / (KV cache per session at max context)For H100 SXM5 80GB with Llama 3.3 70B FP8 (~70GB weights):
- Available for KV cache: ~8GB (tight, increase GPU count for more sessions)
- For production use with more than 8-10 concurrent sessions, run two H100s with tensor parallelism
For H100 SXM5 80GB with Qwen3-32B FP8 (~35GB weights):
- Available for KV cache: ~42GB
- At 32K context per session: roughly 40-64 concurrent sessions depending on prefix cache hit rate
For a deeper treatment of KV cache math and prefix caching configuration, the context engineering guide for production AI agents covers the sizing formulas and NVMe KV offload options for when VRAM runs out.
Cost Comparison: Self-Hosted vs Azure OpenAI
Scenario: 100 concurrent users, 50 requests/day each, 20,000 tokens per agent run (5-step ReAct, 50% input/output split).
Daily tokens: 100M. Monthly: 3B.
| Scenario | Azure gpt-4o (~$5/$15 per 1M) | Azure gpt-4o-mini (~$0.15/$0.60 per 1M) | Spheron H100 + Llama 3.3 70B ($4.41/hr) |
|---|---|---|---|
| 10M tokens/day | ~$100/day | ~$3.75/day | $105.84/day (flat) |
| 100M tokens/day | ~$1,000/day | ~$37.50/day | $105.84/day |
| Monthly at 100M/day | ~$30,000 | ~$1,125 | ~$3,175 |
For gpt-4o class quality at agent-scale volumes: the H100 crossover happens around 10M tokens/day. Above that, self-hosting wins on cost while delivering a larger, higher-quality model.
For gpt-4o-mini equivalent tasks using Qwen3-32B on L40S ($0.96/hr = $23.04/day):
| Scenario | Azure gpt-4o-mini | Spheron L40S + Qwen3-32B ($0.96/hr) |
|---|---|---|
| 10M tokens/day | ~$3.75/day | $23.04/day (Azure wins) |
| 100M tokens/day | ~$37.50/day | $23.04/day (self-hosting wins) |
| Monthly at 100M/day | ~$1,125 | ~$691 |
The L40S crossover for gpt-4o-mini equivalent tasks falls around 60-65M tokens/day. Below that threshold, using Azure gpt-4o-mini directly is cheaper. Above it, self-hosting wins and keeps winning as volume grows.
Note that these comparisons assume 24/7 GPU utilization for the self-hosted case. If your workload is bursty (business hours only), per-minute billing on Spheron means you only pay for the hours you use, which shifts the crossover point lower.
Azure OpenAI prices change without notice. The figures above are based on publicly available pricing as of 30 Jun 2026 and are used for illustrative comparison only. Check the Azure pricing page for current rates.
Pricing fluctuates based on GPU availability. The prices above are based on 30 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
What to Try First
The lowest-risk entry point is running one agent's model backend on a self-hosted endpoint. Keep the MAF agent code unchanged, swap the model client from Azure to your vLLM server, and run your existing test suite against the new endpoint. If accuracy is acceptable (check your task completion rate, not just unit tests), expand to the full agent fleet.
The configuration change is three lines of Python. The infrastructure change is one Docker command. The cost difference at scale is substantial.
MAF agent workloads that run at scale need a backend that can absorb concurrent tool-calling requests without blowing the token budget. An on-demand H100 SXM5 on Spheron handles the model weight and KV cache for a Llama 3.3 70B instance serving dozens of agent sessions simultaneously.
Check H100 availability → | L40S for smaller models → | View live GPU pricing →
Quick Setup Guide
Log in to the Spheron dashboard and select the GPU instance sized to your model. For Qwen3-32B in FP8: one L40S 48GB. For Llama 3.3 70B in FP8: one H100 SXM5 80GB. For Llama 4 Scout in INT4: one H100 SXM5 80GB (Scout's 109B total parameters require ~109GB in FP8, which exceeds 80GB VRAM; INT4 brings it to ~55-60GB). For multi-agent supervisor plus multiple worker agents: consider separate instances per tier. SSH in and confirm VRAM with nvidia-smi before proceeding.
Pull and run the vLLM OpenAI server container with Docker: docker run --gpus all --ipc=host -p 8000:8000 -e HF_TOKEN=$HF_TOKEN vllm/vllm-openai:latest --model Qwen/Qwen3-32B --dtype fp8 --gpu-memory-utilization 0.92 --max-model-len 32768 --max-num-seqs 64 --enable-auto-tool-choice --tool-call-parser hermes. The --enable-auto-tool-choice and --tool-call-parser flags are required for agent tool-calling workloads. Wait for 'Application startup complete' in logs before connecting.
Install the MAF OpenAI provider: pip install agent-framework-openai. Configure the model client to point at your vLLM server: from agent_framework.openai import OpenAIChatCompletionClient; model_client = OpenAIChatCompletionClient(model='Qwen/Qwen3-32B', base_url='http://YOUR_GPU_IP:8000/v1', api_key='placeholder'). Create agents with model_client.as_agent(name='my_agent', instructions='...', tools=[...]) instead of the Azure client.
Decorate tool functions with @tool from agent_framework. Pass them to model_client.as_agent() via the tools parameter. For MCP server integration, use MAF's built-in MCP client to connect to any MCP server running alongside your inference backend. The tool call JSON flows through the OpenAI wire format that vLLM already handles, so no extra configuration is needed at the inference layer.
Instrument the vLLM server with Prometheus metrics (exposed at /metrics by default) and scrape into Grafana for token throughput, TTFT, and GPU utilization. Set autoscaling triggers on queue depth rather than CPU: when pending requests exceed your TTFT budget, add another GPU instance and register it with your load balancer. Use on-demand instances for interactive agent sessions and spot for batch agent pipelines, using MAF's state checkpointing to resume from the last completed step after a spot eviction.
Frequently Asked Questions
Microsoft Agent Framework 1.0 (MAF) is the unified successor to AutoGen and Semantic Kernel, which reached 1.0 GA in April 2026. The framework ships a production agent runtime for managing agent lifecycles, CodeAct mode where agents write and execute Python rather than emit JSON tool calls, built-in multi-agent graph with supervisor routing, first-class MCP client integration, and hosted agent support.
Yes. MAF decouples the agent orchestration layer from the model backend. You configure a model client pointing at any OpenAI-compatible endpoint using the base_url and api_key parameters. A vLLM server running on GPU cloud exposes the exact same REST API surface as OpenAI, so MAF treats it identically. No Azure subscription or Azure OpenAI resource is required.
For general-purpose agent workloads with strong tool-calling accuracy: Qwen3-32B in FP8 on an L40S 48GB for cost efficiency, or Llama 3.3 70B in FP8 on an H100 SXM5 80GB for higher quality. For multi-agent supervisor roles requiring complex reasoning: Llama 3.3 70B or Llama 4 Scout on H100 SXM5. For high-throughput worker agents where cost per call matters most: Phi-5 in FP8 on an L40S. Qwen3 models have the strongest function-calling and ReAct accuracy among publicly benchmarked open-weight models as of mid-2026.
MAF uses a model client abstraction. Instead of AzureOpenAIChatCompletionClient, use OpenAIChatCompletionClient with base_url set to your vLLM server address (e.g. http://your-gpu-ip:8000/v1) and api_key set to any non-empty string (vLLM does not validate it by default). Set model to the model ID you loaded in vLLM (e.g. Qwen/Qwen3-32B). Create the agent with model_client.as_agent(name=..., instructions=..., tools=[...]). Everything else, including tool definitions and streaming, works through the standard OpenAI wire format.
At 10M tokens per day, Azure gpt-4o costs roughly $100/day while an H100 SXM5 on Spheron running Llama 3.3 70B costs about $105.84/day (running 24/7), making it essentially break-even with a much larger model. At 100M tokens/day, gpt-4o reaches $1,000/day while the H100 stays at $105.84/day, a 9x cost advantage. For smaller worker agents using Qwen3-32B on an L40S ($0.96/hr = $23.04/day), self-hosting beats gpt-4o-mini above roughly 60M tokens per day.
