Arcee AI's Trinity-Large-Thinking is a 400B sparse MoE reasoning model, licensed under Apache 2.0 and trained with agentic RL post-training to generate extended chain-of-thought before answering. It competes with proprietary frontier models on math, science, and coding benchmarks at a self-hosted cost that undercuts the Claude Opus 4.6 API significantly at scale. This guide covers the full deployment path: VRAM planning, vLLM setup with reasoning parser and auto tool-choice, and a concrete cost comparison against the Claude Opus 4.6 API. For MoE-specific inference behavior, expert routing, and interconnect saturation, the MoE inference optimization guide is essential reading before provisioning hardware.
What Is Arcee Trinity-Large-Thinking
Trinity-Large-Thinking is Arcee's reasoning-native variant of their Trinity-Large base model. Both share a 400B sparse MoE architecture; the Thinking variant is a separate fine-tuned checkpoint produced through agentic RL post-training using verifiable reward signals on coding, math, and tool-use tasks. The model generates internal chain-of-thought in <think>...</think> tags before producing a final answer, similar to DeepSeek R1's approach but with Arcee's own RL training pipeline.
Key architecture properties:
| Property | Trinity-Large-Thinking |
|---|---|
| Total parameters | 400B |
| Active parameters per token | 13B (4-of-256 expert routing) |
| Architecture | Sparse Mixture-of-Experts |
| Context window | 256K native (validated to 512K via NIAH; 128K on preview API, 262K on OpenRouter) |
| License | Apache 2.0 |
| HuggingFace ID | arcee-ai/Trinity-Large-Thinking (verify at release) |
Arcee published 13B active parameters per token via 4-of-256 expert routing. This is the core efficiency advantage: inference runs at 13B-active latency while drawing on 400B of trained knowledge. For VRAM planning purposes, all 400B expert weights must reside in VRAM regardless of which experts activate per forward pass. The routing sparsity improves speed, but it does not reduce memory requirements.
Trinity-Large vs Trinity-Large-Thinking: Trinity-Large is the base instruction-following checkpoint. Trinity-Large-Thinking is a separate fine-tuned checkpoint trained to reason before acting. You cannot toggle reasoning on and off at runtime on the base model; you need the Thinking checkpoint directly.
Benchmark comparison (scores below are from published Arcee and Anthropic sources; verify current numbers against official model cards):
| Benchmark | Trinity-Large-Thinking | Claude Opus 4.6 |
|---|---|---|
| AIME 2025 | 96.3 | ~80.0 |
| MMLU-Pro | 83.4 | ~87.2 |
| GPQA Diamond | 76.3 | ~74.9 |
| PinchBench (agentic) | 91.9 | 93.3 |
Sources: Trinity-Large-Thinking scores from Arcee's technical report and model card (arcee-ai/Trinity-Large-Thinking on HuggingFace). Claude Opus 4.6 PinchBench score from third-party coverage at launch; AIME/MMLU-Pro/GPQA figures from Anthropic's published benchmarks. Verify all numbers against the official model cards before citing in production.
The Apache 2.0 license is the key operational advantage over Claude Opus 4.6. There are no per-query API restrictions, no usage caps, and the weights stay on infrastructure you control. For regulated industries handling sensitive data, this matters as much as the price difference.
Hardware and VRAM Requirements
Unlike dense models where only loaded weight slices matter, sparse MoE models like Trinity-Large-Thinking require all expert weight matrices in VRAM simultaneously. The active parameter count per token is lower than 400B, but every expert must be resident because any expert can be selected by the routing function for any token. This is a hard constraint with no workaround.
VRAM formula: total_params × bytes_per_param × 1.15 (overhead) + KV_cache_budget
VRAM by precision:
| Precision | Bytes/Param | Weight VRAM | Runtime (15% overhead) | Notes |
|---|---|---|---|---|
| BF16 | 2 | ~800 GB | ~920 GB | Impractical for production |
| FP8 | 1 | ~400 GB | ~460 GB | Recommended for H100/H200 |
| W4A16 (INT4) | 0.5 | ~200 GB | ~230 GB | Lower cost; validate accuracy on reasoning tasks |
GPU configuration options:
| Config | Total VRAM | Quantization | KV Cache Headroom | Use Case |
|---|---|---|---|---|
| 8x H100 SXM5 | 640 GB | FP8 | ~180 GB | Production recommended |
| 4x H200 SXM5 | 564 GB | FP8 | ~104 GB | Production alternative |
| 4x H100 SXM5 | 320 GB | W4A16 | ~90 GB | Dev/low-concurrency |
| 3x H200 SXM5 | 423 GB | W4A16 | ~193 GB | Budget production |
| 2x H200 SXM5 | 282 GB | W4A16 | ~52 GB | Too tight for KV at scale; not recommended |
The SXM5 form factor is required for NVLink. NVLink 4.0 on H100 SXM5 runs tensor parallelism all-reduces at 900 GB/s; PCIe variants top out at 64 GB/s. MoE expert routing generates frequent all-to-all communication across GPUs, and the bandwidth difference is the bottleneck. Use SXM5 for production. PCIe H100 variants are usable for development and testing but will show significant throughput degradation under production load.
A note on INT4 for reasoning workloads: quantization errors compound through long thinking chains. A precision error that has minor impact on a 500-token standard response can cascade through a 20,000-token reasoning trace and shift logical inferences. FP8 causes less than 1-2% accuracy loss on math and coding benchmarks compared to BF16. W4A16 can cause 5-10% degradation on reasoning-specific benchmarks depending on calibration quality. Validate accuracy on your specific task set before deploying W4A16 to production, using the same approach described in the FP8 quantization guide.
You can provision H100 SXM5 on Spheron for the 8-GPU production configuration, or H200 SXM5 rental for the 4-GPU FP8 alternative with lower total node cost.
Step-by-Step vLLM Deployment
Step 1: Provision a Spheron multi-GPU instance
Log in at app.spheron.ai and go to GPU Cloud. Select H100 SXM5 with 8 GPUs for the FP8 production path, or H200 SXM5 with 4 GPUs. Use the PyTorch 2.6 / CUDA 12.4 base image. Provision at least 600 GB of persistent storage for model weights and vLLM cache.
After SSH access, verify NVLink topology:
nvidia-smi topo -mAll H100 SXM5 GPUs in the same node should show NV18 (NVLink) connections to each other. If you see PIX or NODE instead of NVLink connections, stop and contact support before downloading weights.
Step 2: Install vLLM and download weights
pip install vllm==0.23.0
export HF_TOKEN=your_huggingface_token
export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli download arcee-ai/Trinity-Large-Thinking \
--local-dir /models/trinity-large-thinkingThe FP8 checkpoint is approximately 400 GB. With HF_HUB_ENABLE_HF_TRANSFER=1, download speed on a 10 Gbps NIC should complete in 30-40 minutes.
Before downloading, verify the exact model ID on the official Arcee AI HuggingFace page. HuggingFace IDs sometimes change between announcement and release. If the ID differs from arcee-ai/Trinity-Large-Thinking, update the path accordingly.
Step 3: Launch vLLM with reasoning parser and auto tool-choice
For 8x H100 SXM5 at FP8:
vllm serve /models/trinity-large-thinking \
--served-model-name trinity-large-thinking \
--dtype fp8 \
--tensor-parallel-size 8 \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--kv-cache-dtype fp8_e5m2 \
--port 8000The --reasoning-parser deepseek_r1 argument tells vLLM to parse <think>...</think> blocks and expose the chain-of-thought separately in the response. The deepseek_r1 parser handles models using this tag convention. If Arcee has contributed a dedicated arcee parser by the time you deploy, prefer it. Run vllm --help | grep reasoning-parser to list parsers available at your installed version, or check the vLLM reasoning outputs documentation directly.
For W4A16 on 4x H100 SXM5, replace the flags:
vllm serve /models/trinity-large-thinking-awq \
--served-model-name trinity-large-thinking \
--quantization awq \
--tensor-parallel-size 4 \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--port 8000Use --quantization compressed_tensors if the provided checkpoint uses that format instead of AWQ. Check the model card for the checkpoint format at release.
Step 4: Transformers path for custom agentic loops
When you need step-level tensor hooks or custom sampling logic, the Transformers library gives direct access to model internals:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"arcee-ai/Trinity-Large-Thinking",
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
"arcee-ai/Trinity-Large-Thinking",
trust_remote_code=True,
torch_dtype=torch.float8_e4m3fn,
device_map="auto"
)The trust_remote_code=True flag is required because Trinity uses custom model code hosted on HuggingFace. This means the HuggingFace repository executes arbitrary Python code during model loading. Before using this in production environments handling sensitive data, audit the model files and pin to a specific commit hash rather than the latest main branch.
Use vLLM for production serving. Transformers does not implement continuous batching or paged attention; throughput drops significantly under concurrent requests compared to vLLM.
Step 5: Test reasoning output and tool-calling
Verify chain-of-thought parsing:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "trinity-large-thinking",
"messages": [{"role": "user", "content": "What is the sum of all prime numbers less than 50? Show your work."}],
"max_tokens": 8192
}' | python3 -c "
import sys, json
resp = json.load(sys.stdin)
msg = resp['choices'][0]['message']
reasoning = msg.get('reasoning_content')
if reasoning:
print('Reasoning trace present. Length:', len(reasoning))
print('Reasoning preview:', reasoning[:200])
print('Final answer:', msg.get('content', '').strip()[:200])
else:
print('No reasoning_content field found. Check --reasoning-parser configuration.')
print('Raw content:', msg.get('content', '')[:200])
"Test auto tool-choice:
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}]
response = client.chat.completions.create(
model="trinity-large-thinking",
messages=[{"role": "user", "content": "What is the weather in Tokyo?"}],
tools=tools,
tool_choice="auto",
max_tokens=4096
)
print(response.choices[0].message.tool_calls)Monitor GPU utilization and KV cache during inference:
nvidia-smi dmon -s u # GPU utilization; expect >80% during reasoning
# Or check vLLM Prometheus metrics:
curl -s http://localhost:8000/metrics | grep gpu_cache_usageAlert if vllm:gpu_cache_usage_perc exceeds 0.85.
Cost Analysis: Self-Hosting vs Claude Opus 4.6 API
Live pricing from the Spheron GPU API (DEDICATED on-demand offers only):
| Configuration | Cost | Per-GPU Rate | Estimated Throughput | Cost per 1M Output Tokens |
|---|---|---|---|---|
| Claude Opus 4.6 API | $25/M output | N/A | N/A | $25.00 |
| 8x H100 SXM5 on-demand | $31.36/hr | $3.92/hr/GPU | 600-900K tokens/hr | $35-52/M |
| 8x H100 SXM5 spot | $23.28/hr | $2.91/hr/GPU | 600-900K tokens/hr | $26-39/M |
| 4x H200 SXM5 on-demand | $14.80/hr | $3.70/hr/GPU | 700K-1M tokens/hr | $15-21/M |
| 4x H200 SXM5 spot | $7.04/hr | $1.76/hr/GPU | 700K-1M tokens/hr | $7-10/M |
Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing for live rates.
At Claude Opus 4.6's $25/M output rate, the 8x H100 SXM5 on-demand path ($35-52/M) does not beat the API on a per-token basis. That node generates 600-900K output tokens per hour at $31.36/hr, while the same volume costs only $15-22.50 in API fees at $25/M. The cost advantage lies with the H200 configuration: the 4x H200 SXM5 node at $14.80/hr on-demand generates 700K-1M tokens/hr, putting the effective cost at $15-21/M, which undercuts the $25/M API rate. On spot ($7.04/hr), the H200 cost falls to $7-10/M, roughly 60-72% below the API price. For background on how reasoning model throughput affects per-token economics, see the reasoning model inference cost guide.
Breakeven in monthly output tokens: the 4x H200 node at $14.80/hr runs ~$10,656/month. Break-even against Claude Opus 4.6 at $25/M requires 10,656 / 25 × 1M = 426M output tokens per month. Below that, the API is cheaper per token; above it, the node wins. The 4x H200 node can produce up to 720M output tokens per month at peak (1M tokens/hr × 720 hours), so the 426M breakeven is physically reachable at roughly 60% utilization. The 8x H100 on-demand path has no breakeven: at $35-52/M cost-per-token versus the $25/M API rate, it costs more per token than the API at every utilization level. No volume justifies H100 on-demand over the Claude Opus 4.6 API on a pure cost-per-token basis.
In practice, teams running agentic workloads at scale typically reach 426M output tokens per month on the H200 path within weeks of production rollout. The Apache 2.0 license removes usage caps entirely, so there is no cost ceiling that triggers renegotiation as query volume grows.
For on-demand access to the exact GPU nodes used in this guide, the H100 SXM5 and H200 SXM5 product pages on Spheron list current availability and pricing.
Fine-Tuning and Quantizing Your Own Variant
Arcee trained Trinity-Large-Thinking using agentic RL post-training, specifically targeting verifiable reward signals on coding, math, and tool-use tasks. You can apply the same approach to tune the model for your specific domain. The GRPO fine-tuning guide covers the RL training setup in detail.
Hardware minimums for fine-tuning:
| Approach | Minimum Hardware | Notes |
|---|---|---|
| LoRA + GRPO (FP8 forward pass) | 8x H100 SXM5 | BF16 gradients for adapter; FP8 activations only |
| Full-parameter GRPO | 16x H100 SXM5 or 8x H200 SXM5 | Large batch required for stable RL signal |
For W4A16 quantization using AutoAWQ:
pip install autoawqfrom awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_pretrained(
"arcee-ai/Trinity-Large-Thinking",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"arcee-ai/Trinity-Large-Thinking",
trust_remote_code=True
)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("./trinity-large-thinking-awq")One non-obvious requirement: the calibration dataset for W4A16 quantization of a reasoning model must include chain-of-thought examples, not just standard instruction data. Using a generic instruction calibration set (like wikitext or ShareGPT) leaves the quantization unaware of the <think> token distribution and tends to produce larger accuracy gaps on reasoning benchmarks than the numbers suggest for dense models. Generate or collect a calibration set that covers reasoning traces of similar length and structure to your production queries.
For future hardware paths: FP4 on Blackwell B200 can reduce the per-node cost further while maintaining accuracy closer to FP8 than INT4. B200 supply is currently limited, but once availability stabilizes, FP4 quantization is worth evaluating for high-volume deployments (see the FP4 quantization on Blackwell guide for the tradeoffs).
Production Serving: Continuous Batching, KV Cache, and SLO Tuning
vLLM handles continuous batching automatically. The key parameters for reasoning workloads differ from standard generation:
--max-num-seqs: default is 256. Trinity-Large-Thinking reasoning traces run 8K-40K tokens per request. Under concurrent load, this many in-flight sequences quickly exhaust KV cache. For production reasoning workloads, set this to 32-64 to prevent eviction cascades.
--max-model-len: set to the 95th-percentile trace length for your workload, not the model's theoretical maximum. A lower max-model-len frees KV cache for more concurrent sequences. Start at 32768 and profile your actual trace lengths before raising it.
--kv-cache-dtype fp8_e5m2: halves KV cache memory with negligible impact on reasoning accuracy. Use this in all production deployments.
SLO tuning table:
| SLO Priority | Recommended Settings |
|---|---|
| Minimize TTFT | --enable-chunked-prefill --chunked-prefill-size 512 |
| Maximize throughput | --max-num-seqs 128 --max-num-batched-tokens 65536 |
| Long reasoning traces | --max-model-len 65536 --kv-cache-dtype fp8_e5m2 |
| Tool-use with JSON | --guided-decoding-backend lm-format-enforcer |
Monitor vllm:gpu_cache_usage_perc via the Prometheus endpoint at /metrics. Set an alert at 0.85; above that, requests begin waiting for KV cache slots, TTFT spikes, and queue depth grows faster than throughput can clear it.
For full production deployment patterns including load balancing, multi-node setups, KV cache sizing, and autoscaling, the vLLM production deployment guide covers all of it.
Related Guides
Trinity-Large-Thinking sits in the same tier as DeepSeek R2 and Magistral for open-weight 400B-class MoE reasoning models:
- Deploy DeepSeek R2: comparable multi-GPU MoE deployment using similar tensor parallelism and FP8 configs
- Magistral self-hosting guide: smaller Apache 2.0 reasoning model (Magistral Small, 24B) that fits on a single H100
- Deploy Nemotron Ultra 253B: dense reasoning model on a single 8x H100 node, no MoE complexity
Trinity-Large-Thinking's 400B MoE architecture needs multi-GPU nodes. H100 SXM5 and H200 SXM5 clusters on Spheron come with NVLink fabric, per-minute billing, and no lock-in to the weights you own.
Quick Setup Guide
At FP8 (1 byte/param), Trinity-Large-Thinking's 400B weights require approximately 400 GB of VRAM plus 15% activation overhead, totaling ~460 GB. Choose 8x H100 SXM5 (640 GB) or 4x H200 SXM5 (564 GB) for FP8 production deployments. At W4A16 (INT4), weights compress to ~200 GB plus overhead (~230 GB), fitting on 4x H100 SXM5 (320 GB) or 3x H200 SXM5 (423 GB). BF16 requires ~920 GB and is not practical for production. Provision a node with NVLink to ensure tensor parallelism communication runs at 900 GB/s rather than the 64 GB/s on PCIe.
Log in to app.spheron.ai, navigate to GPU Cloud, and select H100 SXM5 (8-GPU bundle) or H200 SXM5 (4-GPU bundle). Use the PyTorch 2.6 / CUDA 12.4 base image. Provision at least 600 GB of persistent storage for model weights. SSH in and verify GPU topology with 'nvidia-smi topo -m' to confirm NVLink is active between all GPUs.
Run 'pip install vllm==0.23.0'. Set HF_HUB_ENABLE_HF_TRANSFER=1 and your HF_TOKEN, then run 'huggingface-cli download arcee-ai/Trinity-Large-Thinking --local-dir /models/trinity-large-thinking'. The FP8 checkpoint is approximately 400 GB. Verify the exact model ID on the official Arcee AI HuggingFace page; IDs can change between announcement and release.
For 8x H100 SXM5 at FP8: 'vllm serve /models/trinity-large-thinking --dtype fp8 --tensor-parallel-size 8 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes --max-model-len 32768 --gpu-memory-utilization 0.90 --enable-chunked-prefill --kv-cache-dtype fp8_e5m2 --port 8000'. Check the vLLM reasoning outputs documentation for the exact --reasoning-parser name supported for Trinity-Large-Thinking at your installed vLLM version, as parser names are model-specific.
For agentic loops that require direct tensor access: 'from transformers import AutoModelForCausalLM, AutoTokenizer; model = AutoModelForCausalLM.from_pretrained("arcee-ai/Trinity-Large-Thinking", trust_remote_code=True, torch_dtype=torch.float8_e4m3fn, device_map="auto")'. The trust_remote_code flag is required because Trinity uses custom model code on HuggingFace. Use device_map="auto" to distribute across GPUs automatically.
Send a multi-step math or agentic task to /v1/chat/completions and verify the reasoning_content field in the response is populated. When --reasoning-parser is active, vLLM strips <think>...</think> tags from the content field and exposes the chain-of-thought separately under message.reasoning_content. Test tool-choice by passing a tools array with a function definition and confirm the model generates a tool_calls JSON response instead of plain text when a function call is appropriate. Monitor KV cache usage via vLLM's /metrics Prometheus endpoint and alert if vllm:gpu_cache_usage_perc exceeds 85%.
Frequently Asked Questions
Trinity-Large-Thinking is a 400B sparse MoE model. All expert weights must reside in VRAM simultaneously. At FP8 (1 byte/param), the weights require ~400 GB plus 15% overhead, totaling ~460 GB. The recommended minimum is 8x H100 SXM5 (640 GB total, with ~180 GB KV cache headroom) or 4x H200 SXM5 (564 GB total). At W4A16 (INT4), weights compress to ~200 GB, fitting on 4x H100 SXM5 or 3x H200 SXM5 with meaningful KV cache budget. BF16 requires ~920 GB, which is impractical for production.
vLLM supports a --reasoning-parser flag with support for several reasoning model architectures. Trinity-Large-Thinking generates internal chain-of-thought in <think>...</think> tags. Pass --reasoning-parser deepseek_r1 (or the appropriate parser name per the official HuggingFace model card) and --enable-auto-tool-choice --tool-call-parser hermes to activate both the reasoning trace and native function calling in a single deployment.
It depends on the hardware config. Claude Opus 4.6 is $5 per million input tokens and $25 per million output tokens. The 8x H100 SXM5 node on Spheron on-demand costs $31.36/hr and generates roughly 600-900K output tokens per hour, putting its effective cost at $35-52/M output tokens, higher than the $25/M API rate, so H100 on-demand does not win on cost. The 4x H200 SXM5 node is the cost winner: at $14.80/hr and 700K-1M tokens/hr throughput, the effective cost drops to $15-21/M output tokens, below the API rate. At spot pricing ($7.04/hr), the H200 cost falls to $7-10/M. The breakeven for 4x H200 on-demand is roughly 426M output tokens per month.
Yes, using RL-based fine-tuning (GRPO or PPO) on your task-specific reward signal, which is the same post-training approach Arcee used to produce the reasoning-capable Trinity-Large-Thinking from the base Trinity-Large weights. For quantization-aware fine-tuning, generate a W4A16 adapter using AutoAWQ or llm-compressor and then serve the merged checkpoint via vLLM with the same --quantization awq flag. A minimum of 4x H100 SXM5 is required for LoRA fine-tuning at FP8; 8x H100 SXM5 for full-parameter GRPO runs.
Trinity-Large is Arcee's base 400B sparse MoE instruction-following model. Trinity-Large-Thinking is the reasoning-enhanced variant produced through agentic RL post-training: it generates extended internal chain-of-thought before answering. The Thinking variant scores measurably higher on AIME, MATH-500, and GPQA Diamond. Both models use the same 400B MoE weights as the base; the Thinking variant has a separate fine-tuned checkpoint, not a runtime flag.
