Finance and healthcare teams with data-residency requirements cannot send documents to the Cohere API. Command A's 256K context window and reliable tool calling make it the strongest open-weight option for that constraint, but self-hosting 111 billion parameters on GPU infrastructure is not trivial. This guide covers the hardware sizing, verified vLLM deployment commands for Spheron H200 nodes, a full RAG pipeline with Cohere Embed v4 and Rerank 3, and the cost math comparing self-hosting against the managed API.
What Is Cohere Command A
Command A is Cohere's flagship open-weight language model, released in early 2025. It has 111 billion parameters and a 256K token context window, making it one of the few open-weight models capable of processing full legal contracts, lengthy research corpora, or multi-session conversation histories without chunking.
The model uses a modified transformer architecture with improved attention efficiency for long contexts. Cohere trained it on multilingual data spanning 23 languages, with particular focus on enterprise retrieval, tool calling, and agentic task planning.
The model ID on Hugging Face is CohereForAI/c4ai-command-a-03-2025. Access is gated by a license agreement review, which typically takes 24-48 hours for enterprise accounts.
Key improvements over Command R+:
- 256K context vs 128K in Command R+
- More reliable structured output for tool calls (the preamble-based format is less prone to malformed JSON than Command R+'s function call format)
- Better performance on multi-hop agentic tasks where the model needs to decide what to retrieve next
- Higher throughput per GPU at equivalent precision due to architectural improvements
For the embedding and reranking steps in a full RAG stack, see the Cohere Embed v4 and multimodal embedding guide for deployment details.
Command A vs Command R+ vs Llama 4 vs Qwen 3.6: RAG and Agentic Benchmarks
| Model | Params | Context | MMLU | MATH | BFCL Tool Calling | RAG BEIR nDCG@10 | License |
|---|---|---|---|---|---|---|---|
| Cohere Command A | 111B | 256K | 89.5 | 75.2 | 80.7 | 51.4 | CC-BY-NC 4.0 |
| Cohere Command R+ | 104B | 128K | 75.7 | 60.3 | 64.8 | 54.0 | CC BY-NC 4.0 |
| Llama 4 Scout | 109B (17Bx16E) | 128K | 79.6 | 72.1 | 70.4 | 46.3 | Llama 4 Community |
| Qwen 3.6 | 235B A22B | 128K | 87.0 | 85.4 | 74.2 | 47.8 | Apache 2.0 |
MMLU and MATH figures are from published technical reports. BFCL (Berkeley Function Calling Leaderboard) and BEIR scores reflect model-card benchmarks at release.
Command A leads on tool calling reliability and long-context RAG accuracy. That BFCL score of 80.7 matters in production: agentic RAG pipelines that chain 4-7 tool calls per user turn fail more often at the function-call formatting layer than at the retrieval or reasoning layers, and Command A's structured output is noticeably more consistent than its predecessors.
Command R+ still leads on BEIR RAG score (54.0 vs 51.4) because it was specifically trained on dense retrieval tasks. Command A closes much of that gap while adding the 256K context window, which changes how you can structure your retrieval pipeline entirely.
Llama 4 Scout and Qwen 3.6 are competitive on pure reasoning benchmarks (especially MATH, where Qwen 3.6 at 85.4 outperforms everything in this table). For teams without a Cohere embedding stack and without data-residency requirements, Llama 4 Scout on a single H200 is often the lower-friction choice: permissive license, lower VRAM at FP16 (active 17B parameters), and strong general-purpose quality.
Hardware Requirements: VRAM Sizing at FP16, FP8, and AWQ INT4
| Precision | Model Size (GB) | KV Cache Headroom (GB) | Total VRAM Needed | Minimum GPU Config | Recommended Config |
|---|---|---|---|---|---|
| FP16 | ~222 GB | ~20 GB | ~242 GB | Dual H200 SXM5 (tp=2, 282GB total) | Dual H200 SXM5 (282GB total) |
| FP8 | ~111 GB | ~20 GB | ~131 GB | Single H200 SXM5 (141GB, fits) | Single H200 SXM5 FP8 |
| AWQ INT4 | ~56 GB | ~15 GB | ~71 GB | Single H100 SXM5 (80GB, tight) | Single H100 SXM5 with FP8 KV cache |
The VRAM math is straightforward:
# FP16: 111B parameters × 2 bytes = 222GB
# FP8: 111B parameters × 1 byte = 111GB
# INT4: 111B parameters × 0.5 bytes = 55.5GB
model_vram_gb = (params_in_billions * bytes_per_param)The KV cache headroom is where things get complicated at Command A's 256K context. At full 256K context with batch size 8, FP16 attention KV cache can run 40-80GB. This means:
- On a single H200 at FP8, do not start with
--max-model-len 262144. Start at 65536 and increase while monitoringnvidia-smiVRAM utilization after each increase. Full 256K context in production needs the KV cache itself quantized with--kv-cache-dtype fp8. - On dual H100 at FP8, the practical max-model-len is around 32768. The KV cache at longer contexts would consume the remaining VRAM budget. FP16 is not viable on dual H100 — 222GB weights exceed 160GB combined VRAM.
- On B200 with 192GB, FP16 + a 256K context window becomes feasible with careful tuning, but FP8 is still recommended for headroom.
For a detailed GPU-by-GPU spec comparison, see H100 vs H200 GPU specs and benchmarks and H200 vs B200 vs GB200 comparison.
Cohere Command A License: What Enterprise Self-Hosting Means
Command A is released under CC-BY-NC 4.0 (Cohere Labs Acceptable Use Policy). This is not Apache 2.0. The key practical implications:
- Internal enterprise use is permitted without a separate agreement with Cohere.
- Commercial redistribution, including serving Command A as a public API product or including it in a SaaS offering to third parties, requires a separate commercial agreement with Cohere.
- Fine-tuned weight publishing also requires Cohere's approval.
For data residency, self-hosting is the cleanest path. When you run Command A on your own GPU infrastructure, all inference happens on your hardware. No document content, no user queries, and no responses transit Cohere's infrastructure. This is what makes it viable for HIPAA, FedRAMP, and EU AI Act Article 22 scenarios.
By contrast, the Cohere API routes all data through Cohere's infrastructure. It's the right choice for low-volume or prototype workloads, but it creates a compliance gap for regulated industries.
Practical first step: go to the Hugging Face model page for CohereForAI/c4ai-command-a-03-2025 and submit the access request before provisioning hardware. Weight download time after approval is 3-4 hours on a 10Gbps connection (the full FP16 download is ~222GB). Plan accordingly.
Deploy Command A with vLLM on Spheron H200
This is the primary recommended deployment path: a single H200 SXM5 node at FP8. The H200's 141GB HBM3e and 4.8TB/s memory bandwidth give you the best cost-per-token ratio in the current Spheron catalog for this model size.
Prerequisites
- A Spheron H200 SXM5 instance (141GB HBM3e), or dual H100 SXM5 nodes for the tensor-parallel path
- Ubuntu 22.04, CUDA 12.4, Python 3.10+
- Hugging Face access approved for
CohereForAI/c4ai-command-a-03-2025 - ~500GB attached storage for model weights (recommend NVMe, not network-attached, for load time)
- vLLM 0.8.x (specifically
>=0.8.5for Cohere tool-call parser support)
Install and Download Weights
pip install "vllm>=0.8.5" hf-transfer
# Authenticated download
huggingface-cli login --token $HF_TOKEN
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
CohereForAI/c4ai-command-a-03-2025 \
--local-dir /data/c4ai-command-ahf-transfer uses a Rust-based parallel downloader. Without it, a 222GB download on a 10Gbps link takes 3-4 hours. With it, the same download finishes in under 30 minutes.
Launch vLLM (H200 FP8, Single Node)
vllm serve /data/c4ai-command-a \
--dtype fp8 \
--max-model-len 65536 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--enable-chunked-prefill \
--max-num-seqs 32 \
--tool-call-parser cohere_command3 \
--port 8000--tool-call-parser cohere_command3 requires the cohere_melody package (pip install cohere_melody) — install it before starting vLLM.
What each flag does:
--dtype fp8: quantizes model weights to FP8, halving VRAM from ~222GB to ~111GB--kv-cache-dtype fp8: quantizes the KV cache, critical for any context over 32K--enable-chunked-prefill: for long documents, splits prefill into chunks so decode can overlap with prefill processing, improving TTFT under concurrent load--max-num-seqs 32: continuous batching queue depth; 32 is good for interactive RAG, increase to 64-128 for batch processing--tool-call-parser cohere_command3: enables structured tool-call parsing specific to Command A's format; without this flag, tool calls appear as raw text in the response body
Start --max-model-len at 65536. Once the server is stable and you've confirmed VRAM headroom via nvidia-smi, increase to 131072 or 262144 and monitor again.
Launch vLLM (Dual H100 FP8, Tensor Parallel)
vllm serve /data/c4ai-command-a \
--dtype fp8 \
--kv-cache-dtype fp8 \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.88 \
--enable-chunked-prefill \
--max-num-seqs 16 \
--tool-call-parser cohere_command3 \
--port 8000FP8 is required for dual H100. FP16 weights total ~222GB, which exceeds 160GB combined VRAM even before the KV cache — the FAQ confirms this explicitly. With FP8, model weights drop to ~111GB, split evenly at ~55GB per GPU, leaving ~25GB per GPU for the KV cache. --max-model-len 32768 is conservative for this configuration. Install cohere_melody before starting (pip install cohere_melody).
Validate the Endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/data/c4ai-command-a",
"messages": [{"role": "user", "content": "What is tensor parallelism?"}],
"max_tokens": 200
}'If you get a connection refused immediately, the model is still loading. vLLM loads all weights before accepting connections, which takes 8-15 minutes for Command A on first start. Watch vllm serve stdout for the "Loaded model" log line before sending requests.
For a broader vLLM production configuration guide, see the vLLM production deployment guide.
Deploy with SGLang and TensorRT-LLM (Alternative Serving Engines)
SGLang
SGLang's RadixAttention makes it particularly good for agentic RAG workloads where multiple requests share a common system prompt or document prefix. If your application sends the same 10K-token policy document as a prefix to every query, SGLang caches that prefix in the radix tree and subsequent requests skip the prefill step entirely for that shared portion.
Install and launch:
pip install "sglang[all]>=0.4.0"
python -m sglang.launch_server \
--model-path /data/c4ai-command-a \
--dtype fp8 \
--tp 1 \
--mem-fraction-static 0.85 \
--port 30000 \
--chat-template cohere \
--enable-torch-compileThe --chat-template cohere flag tells SGLang to use Cohere's tool-use template. If your SGLang version does not recognize cohere as a named template, remove the flag — SGLang 0.4.0+ automatically loads the chat template from the model's tokenizer_config.json, which has the same effect. Without a Cohere-specific template, Command A's tool-use blocks will not parse correctly. Verify with the same curl test from the vLLM section, pointing to port 30000.
SGLang's GPU memory management (--mem-fraction-static) controls the fraction reserved for the KV cache pool. Start at 0.85 and increase if you see OOM errors on long contexts. For the full SGLang production configuration guide, see SGLang deployment guide.
TensorRT-LLM
TRT-LLM gives the highest peak throughput for Command A, but the engine-build step costs 45-90 minutes on an H200. The resulting engine is hardware-specific: an engine built for H200 will not run on H100 without a rebuild.
# Install TRT-LLM
pip install tensorrt-llm==0.17.0
# Convert checkpoint to TRT-LLM format with AWQ INT4 quantization
python -c "
import tensorrt_llm
from tensorrt_llm.quantization import QuantAlgo
from tensorrt_llm.models import CohereForCausalLM
# Convert and quantize (45-90 min)
CohereForCausalLM.convert_and_build(
model_dir='/data/c4ai-command-a',
output_dir='/data/c4ai-command-a-trt-int4',
max_batch_size=32,
max_input_len=8192,
max_seq_len=16384,
quantization=QuantAlgo.W4A16_AWQ,
)
"
# Serve the built engine
python -m tensorrt_llm.serve \
--engine_dir /data/c4ai-command-a-trt-int4 \
--port 8000INT4 AWQ quantization brings Command A's memory footprint to ~56GB, which fits on a single H100 SXM5. Peak output throughput on H200 with a TRT-LLM INT4 engine typically reaches 3,000-4,000 tok/s for this model class, compared to ~2,200 tok/s with vLLM FP8.
For throughput comparisons across all three engines, see vLLM vs TensorRT-LLM vs SGLang benchmarks.
Tool Calling and Function Call Setup for Command A
Command A uses a preamble-based system prompt format for tool definitions, not the OpenAI functions JSON-schema format. The tools go into the system message, and the model returns tool_calls blocks in the response. vLLM 0.8.x handles the format conversion automatically when --tool-call-parser cohere_command3 is active.
Here is a working Python example using the openai client library against vLLM:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
tools = [
{
"type": "function",
"function": {
"name": "search_documents",
"description": "Search the document store for relevant passages",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"top_k": {"type": "integer", "default": 5}
},
"required": ["query"]
}
}
}
]
response = client.chat.completions.create(
model="/data/c4ai-command-a",
messages=[{"role": "user", "content": "Find information about transformer architecture"}],
tools=tools,
tool_choice="auto"
)
# Parse tool call from response
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")The most common failure mode: if --tool-call-parser cohere_command3 is missing from the vLLM launch command, Command A generates the tool call as raw text in message.content instead of a structured tool_calls block. The client then receives what looks like a normal text response with embedded XML-style tool syntax. The fix is to restart vLLM with --tool-call-parser cohere_command3 (and pip install cohere_melody if not already installed) and not to add custom chat template handling that conflicts with the built-in Cohere parser.
For multi-tool agentic loops, pass the previous assistant message (including tool_calls) back to the model along with the tool result. Command A handles multi-turn tool conversations reliably up to ~10 tool call rounds in our testing.
RAG Pipeline: Command A + Cohere Embed v4 + Rerank 3 on a Single GPU Cluster
The full data flow:
User query
→ Cohere Embed v4 (encode query, 1024-dim vector (Matryoshka, also supports 256/512/1536))
→ Qdrant / Milvus (ANN search, return top-50 passages)
→ Cohere Rerank-v3.5 (score top-50, return top-10)
→ Command A (256K context: system prompt + top-10 passages + query)
→ Grounded response with citationsFor a single H200 node deployment, the typical allocation:
- Command A: vLLM using 85% of GPU memory (
--gpu-memory-utilization 0.85) - Cohere Embed v4: via the Infinity embedding server in a separate Docker container. For low-to-medium batch sizes, Embed v4 can run on the remaining GPU memory. For high-throughput embedding, provision a second H100 or L40S specifically for the embedding workload.
- Rerank-v3.5: runs on CPU for under 100ms latency on 50-passage batches. Move to GPU if reranking becomes a bottleneck under high concurrency.
Docker Compose for all three services:
version: "3.8"
services:
command-a:
image: vllm/vllm-openai:latest
command: >
--model /data/c4ai-command-a
--dtype fp8
--kv-cache-dtype fp8
--max-model-len 65536
--gpu-memory-utilization 0.85
--enable-chunked-prefill
--max-num-seqs 32
--tool-call-parser cohere_command3
--port 8000
volumes:
- /data:/data
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
ports:
- "8000:8000"
embed-v4:
# Embed v4 requires Infinity with multimodal support (michaelf34/infinity:latest or later)
image: michaelf34/infinity:latest
command: >
v2
--model-id Cohere/embed-v4.0
--port 7997
--engine torch
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
ports:
- "7997:7997"
rerank:
image: michaelf34/infinity:latest
command: >
v2
--model-id Cohere/rerank-v3.5
--port 7998
--engine torch
ports:
- "7998:7998"Full pipeline Python call sequence:
import cohere
from openai import OpenAI
import qdrant_client
from qdrant_client.models import Distance, VectorParams
# Clients
co = cohere.Client(base_url="http://localhost:7997", api_key="none")
reranker_co = cohere.Client(base_url="http://localhost:7998", api_key="none")
llm_client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
qdrant = qdrant_client.QdrantClient(host="localhost", port=6333)
def rag_query(user_query: str) -> str:
# Step 1: Embed query
query_embedding = co.embed(
texts=[user_query],
model="embed-v4.0",
input_type="search_query"
).embeddings[0]
# Step 2: Retrieve top-50 from Qdrant
search_results = qdrant.search(
collection_name="documents",
query_vector=query_embedding,
limit=50
)
passages = [r.payload["text"] for r in search_results]
# Step 3: Rerank to top-10
rerank_results = reranker_co.rerank(
model="rerank-v3.5",
query=user_query,
documents=passages,
top_n=10
)
top_passages = [passages[r.index] for r in rerank_results.results]
# Step 4: Generate with Command A
context = "\n\n".join(top_passages)
response = llm_client.chat.completions.create(
model="/data/c4ai-command-a",
messages=[
{
"role": "system",
"content": f"Answer using only the following context:\n\n{context}"
},
{"role": "user", "content": user_query}
],
max_tokens=1024
)
return response.choices[0].message.contentFor a deeper treatment of the full stack architecture, see the agentic RAG GPU infrastructure guide and the self-hosted vector database deployment guide.
Inference Latency Tuning: Continuous Batching, KV Cache, Chunked Prefill
Three levers matter most for Command A in production:
1. Continuous batching (--enable-chunked-prefill --max-num-seqs 32)
vLLM's continuous batching allows new requests to enter the batch mid-generation rather than waiting for the current batch to complete. This avoids head-of-line blocking, where one 200K-token context request stalls all shorter requests behind it.
For interactive RAG (P95 latency target below 2 seconds): --max-num-seqs 16-32. For offline batch processing where throughput matters more than latency: --max-num-seqs 64-128.
2. KV cache quantization (--kv-cache-dtype fp8)
At 256K context, the KV cache is the dominant memory consumer, not the model weights. Without FP8 KV cache, a single 256K-token request can consume 60-80GB of VRAM just for attention keys and values. With --kv-cache-dtype fp8, that drops to 30-40GB. This doubles the context length you can serve at a given VRAM budget, with less than 1% quality regression in practice.
Measure TTFT before and after enabling FP8 KV cache on your workload. If you see a regression, compare generation quality on 5-10 sample queries to confirm the quality impact is acceptable.
3. Chunked prefill (--enable-chunked-prefill --max-chunked-prefill-size 512)
For requests with 50K-256K input tokens, the prefill stage (processing the input) can block all concurrent requests for 2-10 seconds. Chunked prefill splits the input into 512-token chunks and interleaves prefill work with decode, so concurrent shorter requests continue generating while the long input is processed.
The tradeoff: chunked prefill increases total prefill latency slightly (the input takes longer to process overall) but dramatically reduces P99 latency for concurrent shorter requests. For RAG pipelines where most requests have long document contexts, this tradeoff is strongly in your favor.
| Setting | Default | Recommended for Interactive (P95 latency) | Recommended for Batch (throughput) |
|---|---|---|---|
--max-num-seqs | 256 | 16-32 | 64-128 |
--kv-cache-dtype | fp16 | fp8 | fp8 |
--max-chunked-prefill-size | 512 | 256-512 | 1024-2048 |
--gpu-memory-utilization | 0.9 | 0.88-0.90 | 0.92-0.95 |
--max-model-len | model max | 65536 (start here) | 131072+ (if VRAM allows) |
Cost Comparison: Self-Hosted Command A vs Cohere API
Cohere API pricing for Command A as of June 2026: $2.50/M input tokens and $10.00/M output tokens. These figures are from Cohere's published pricing page. Check cohere.ai/pricing for current rates as they change over time.
Spheron GPU costs from live pricing at the time of writing:
| Deployment | GPU Config | $/hr (spot) | Throughput (tok/s, output) | $/M output tokens | $/M input tokens | Best For |
|---|---|---|---|---|---|---|
| Cohere API | Managed | N/A | N/A | $10.00 | $2.50 | Low-volume or no-GPU teams |
| Spheron H100 spot (dual, FP8) | 2x H100 SXM5 | $2.98/hr | ~1,800 tok/s | ~$0.46 | ~$0.06 | Cost-sensitive with 300M+ tokens/month |
| Spheron H200 spot (single, FP8) | 1x H200 SXM5 | $3.31/hr | ~2,200 tok/s | ~$0.42 | ~$0.05 | Primary self-host path: best cost/performance |
| Spheron B200 (single, FP8) | 1x B200 SXM6 | $2.68/hr | ~3,200 tok/s | ~$0.23 | ~$0.03 | Highest throughput single node |
Pricing fluctuates based on GPU availability. The prices above are based on 04 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Break-even calculation for H200 spot vs Cohere API:
Monthly H200 cost = $3.31/hr × 24hr × 30 days = $2,383/month
Output token savings per million = $10.00 - $0.42 = $9.58/M
Input token savings per million = $2.50 - $0.05 = $2.45/M
# For a workload with 3:1 input:output ratio:
# Per 1M processed tokens: 750K input + 250K output
# Cohere API cost: (0.75 × $2.50) + (0.25 × $10.00) = $4.375/M
# Self-hosted H200: ~$0.30/M (combined input+output at ~3,000 mixed tok/s)
# Savings per 1M tokens: $4.375 - $0.30 = $4.075/M
monthly_break_even = $2,383 / $4.075 × 1,000,000 = ~585M tokens/monthFor a simpler approximation: if your team sends over 400M tokens per month (output-weighted workloads) or 600M tokens per month (input-heavy document processing), self-hosted H200 pays for itself. Below those volumes, the Cohere API has lower total cost of ownership when you factor in engineering time.
For a broader cloud pricing comparison, see GPU cloud pricing comparison 2026.
When to Use Cohere API vs Self-Hosted Command A vs Alternative Open-Weight Models
| Scenario | Recommended Choice | Reason |
|---|---|---|
| Starting out, under 100M tokens/month | Cohere API | No GPU management overhead, lower total cost |
| Enterprise, data residency required | Self-hosted Command A on H200 | All inference stays on your infrastructure |
| High throughput, cost-sensitive (over 500M tokens/month) | Self-hosted on Spheron H200/B200 spot | ~95% cost reduction vs managed API at scale |
| Need 70B-class quality at lower cost | Llama 4 Scout or Qwen3 72B | No Cohere license dependency, lower VRAM requirements |
| Existing Cohere Embed + Rerank stack | Self-hosted Command A | Native format compatibility, no translation layer needed |
| Need over 256K context | Command A | One of few open-weight options at this context length |
Regulated enterprises with existing Cohere embedding pipelines get the strongest ROI from self-hosting Command A. The native compatibility between Command A, Cohere Embed v4, and Rerank 3 means no format translation between components, which eliminates a common source of subtle retrieval quality bugs.
Teams without a Cohere dependency and without data-residency requirements should evaluate via Ollama or vLLM before committing to Command A. Llama 4 Scout's mixture-of-experts architecture activates only 17B parameters per token, giving much lower serving cost at quality that's competitive for most RAG tasks.
For NIM-based enterprise deployments, the NVIDIA NIM self-host deployment guide covers an alternative stack that pairs NVIDIA's LLM NIMs with NeMo Retriever for teams that prefer an NVIDIA-supported container stack over direct model hosting.
Cohere Command A gives enterprise teams 256K context and built-in tool calling without sending data to a managed API. Run it on a single H200 node at FP8 for the best cost-per-token among enterprise-grade open-weight models.
Quick Setup Guide
Calculate total VRAM needed: (111B parameters × 2 bytes for FP16) = ~222GB, (× 1 byte for FP8) = ~111GB, (× 0.5 bytes for AWQ INT4) = ~56GB. Add 15-20% headroom for KV cache. Map to hardware: FP16 needs ~242GB total VRAM — dual H200 SXM5 (282GB total, tp=2) is the minimum. Dual H100 SXM5 (160GB total) cannot fit FP16 weights; use FP8 on dual H100 with tensor parallelism tp=2 instead. Single H200 (141GB) needs FP8; B200 (192GB) fits FP16. Choose FP8 on H200 for best cost-to-performance.
Visit the Cohere Command A Hugging Face repository (CohereForAI/c4ai-command-a-03-2025), review the CC-BY-NC 4.0 (Cohere Labs Acceptable Use Policy) license terms, and request access if gated. Once approved, run huggingface-cli login and huggingface-cli download CohereForAI/c4ai-command-a-03-2025 to pull weights to your GPU node. Weights total approximately 222GB in FP16.
Log into app.spheron.ai, navigate to GPU Cloud, and provision an H200 SXM5 (141GB) or dual H100 SXM5 node. Select Ubuntu 22.04 with CUDA 12.4. Once the node is live, SSH in and verify VRAM with nvidia-smi. For the H200 path use a single node; for dual H100 ensure both GPUs are visible and NVLink bandwidth is available.
Install vLLM 0.8+ with pip install vllm cohere_melody. For H200 single-node FP8: run vllm serve CohereForAI/c4ai-command-a-03-2025 --dtype fp8 --tensor-parallel-size 1 --max-model-len 65536 --gpu-memory-utilization 0.90 --tool-call-parser cohere_command3. For dual H100 FP8: add --tensor-parallel-size 2 --kv-cache-dtype fp8. Enable the OpenAI-compatible endpoint.
Send a test request to /v1/chat/completions using the Cohere chat format. Include tools in the messages array using the preamble-tool format specific to Command A. Verify the response contains tool_use blocks. Test with a retrieval-style query using a dummy Qdrant or FAISS retriever to confirm end-to-end function calling works before wiring in production data.
Deploy Cohere Embed v4 as a separate container using the Infinity Embedding server (michaelf34/infinity container). Deploy Cohere Rerank-v3.5 similarly. Wire: embed query → retrieve from Qdrant/Milvus → rerank top-50 to top-10 → pass top-10 to Command A with 256K context window. Co-locate all containers on the same GPU node to minimize cross-host latency.
Enable continuous batching (--enable-chunked-prefill in vLLM), set --max-num-seqs 32, and tune --gpu-memory-utilization to 0.92. Run load tests with a realistic input/output token ratio. Calculate cost per million tokens: (GPU hourly cost / tokens per hour) × 1,000,000. Compare against Cohere API rates for your expected monthly volume to determine the self-host crossover point.
Frequently Asked Questions
Command A has 111B parameters. At FP16 it needs approximately 222GB VRAM - requiring two H100 80GB nodes (160GB total is insufficient at FP16 without quantization). At FP8 quantization memory drops to ~111GB, fitting a single H200 SXM (141GB). With AWQ INT4 quantization memory drops further to ~56GB, which fits on a single H100 80GB with headroom for the KV cache.
At FP16, no - Command A needs ~222GB and a single H100 has 80GB. At FP8 it needs ~111GB, still exceeding a single H100. With AWQ INT4 quantization (~56GB) a single H100 SXM5 can serve Command A, leaving ~24GB for the KV cache. For production throughput, dual H100 with FP8 or a single H200/B200 at FP8 is the recommended setup.
Command R+ is Cohere's 104B retrieval-augmented generation model with 128K context and strong RAG performance. Command A is newer - 111B parameters with 256K context, improved tool calling, agentic capabilities, and better throughput efficiency. Command A is the preferred choice for new deployments.
Yes. vLLM 0.8+ supports Cohere's chat template including tool-use and function-calling modes. You need to pass --chat-template and set the tokenizer to use Cohere's official chat template. Command A uses a distinct tool-use format compared to OpenAI function calling - the prompt must include the tool definitions in the correct system-level format.
Cohere API prices Command A at $2.50/M input tokens and $10.00/M output tokens (as of June 2026). On Spheron H200 at spot pricing (~$3.31/hr from live API), running at ~2,200 tokens/second throughput gives a per-million-output-token cost of roughly $0.42/M. The crossover point is around 400-500M tokens/month for most enterprise workloads.
