Tutorial

Deploy Arcee Trinity-Large-Thinking on GPU Cloud: Self-Host the 400B Apache-2.0 Reasoning Agent That Rivals Claude Opus at a Fraction of the Cost (2026 Guide)

deploy Trinity-Large-ThinkingArcee Trinity 400B self-hostopen source reasoning model GPU cloudTrinity-Large vLLM deploymentArcee AIvLLMMoEH100H200Reasoning Models
Deploy Arcee Trinity-Large-Thinking on GPU Cloud: Self-Host the 400B Apache-2.0 Reasoning Agent That Rivals Claude Opus at a Fraction of the Cost (2026 Guide)

Arcee AI's Trinity-Large-Thinking is a 400B sparse MoE reasoning model, licensed under Apache 2.0 and trained with agentic RL post-training to generate extended chain-of-thought before answering. It competes with proprietary frontier models on math, science, and coding benchmarks at a self-hosted cost that undercuts the Claude Opus 4.6 API significantly at scale. This guide covers the full deployment path: VRAM planning, vLLM setup with reasoning parser and auto tool-choice, and a concrete cost comparison against the Claude Opus 4.6 API. For MoE-specific inference behavior, expert routing, and interconnect saturation, the MoE inference optimization guide is essential reading before provisioning hardware.

What Is Arcee Trinity-Large-Thinking

Trinity-Large-Thinking is Arcee's reasoning-native variant of their Trinity-Large base model. Both share a 400B sparse MoE architecture; the Thinking variant is a separate fine-tuned checkpoint produced through agentic RL post-training using verifiable reward signals on coding, math, and tool-use tasks. The model generates internal chain-of-thought in <think>...</think> tags before producing a final answer, similar to DeepSeek R1's approach but with Arcee's own RL training pipeline.

Key architecture properties:

PropertyTrinity-Large-Thinking
Total parameters400B
Active parameters per token13B (4-of-256 expert routing)
ArchitectureSparse Mixture-of-Experts
Context window256K native (validated to 512K via NIAH; 128K on preview API, 262K on OpenRouter)
LicenseApache 2.0
HuggingFace IDarcee-ai/Trinity-Large-Thinking (verify at release)

Arcee published 13B active parameters per token via 4-of-256 expert routing. This is the core efficiency advantage: inference runs at 13B-active latency while drawing on 400B of trained knowledge. For VRAM planning purposes, all 400B expert weights must reside in VRAM regardless of which experts activate per forward pass. The routing sparsity improves speed, but it does not reduce memory requirements.

Trinity-Large vs Trinity-Large-Thinking: Trinity-Large is the base instruction-following checkpoint. Trinity-Large-Thinking is a separate fine-tuned checkpoint trained to reason before acting. You cannot toggle reasoning on and off at runtime on the base model; you need the Thinking checkpoint directly.

Benchmark comparison (scores below are from published Arcee and Anthropic sources; verify current numbers against official model cards):

BenchmarkTrinity-Large-ThinkingClaude Opus 4.6
AIME 202596.3~80.0
MMLU-Pro83.4~87.2
GPQA Diamond76.3~74.9
PinchBench (agentic)91.993.3

Sources: Trinity-Large-Thinking scores from Arcee's technical report and model card (arcee-ai/Trinity-Large-Thinking on HuggingFace). Claude Opus 4.6 PinchBench score from third-party coverage at launch; AIME/MMLU-Pro/GPQA figures from Anthropic's published benchmarks. Verify all numbers against the official model cards before citing in production.

The Apache 2.0 license is the key operational advantage over Claude Opus 4.6. There are no per-query API restrictions, no usage caps, and the weights stay on infrastructure you control. For regulated industries handling sensitive data, this matters as much as the price difference.

Hardware and VRAM Requirements

Unlike dense models where only loaded weight slices matter, sparse MoE models like Trinity-Large-Thinking require all expert weight matrices in VRAM simultaneously. The active parameter count per token is lower than 400B, but every expert must be resident because any expert can be selected by the routing function for any token. This is a hard constraint with no workaround.

VRAM formula: total_params × bytes_per_param × 1.15 (overhead) + KV_cache_budget

VRAM by precision:

PrecisionBytes/ParamWeight VRAMRuntime (15% overhead)Notes
BF162~800 GB~920 GBImpractical for production
FP81~400 GB~460 GBRecommended for H100/H200
W4A16 (INT4)0.5~200 GB~230 GBLower cost; validate accuracy on reasoning tasks

GPU configuration options:

ConfigTotal VRAMQuantizationKV Cache HeadroomUse Case
8x H100 SXM5640 GBFP8~180 GBProduction recommended
4x H200 SXM5564 GBFP8~104 GBProduction alternative
4x H100 SXM5320 GBW4A16~90 GBDev/low-concurrency
3x H200 SXM5423 GBW4A16~193 GBBudget production
2x H200 SXM5282 GBW4A16~52 GBToo tight for KV at scale; not recommended

The SXM5 form factor is required for NVLink. NVLink 4.0 on H100 SXM5 runs tensor parallelism all-reduces at 900 GB/s; PCIe variants top out at 64 GB/s. MoE expert routing generates frequent all-to-all communication across GPUs, and the bandwidth difference is the bottleneck. Use SXM5 for production. PCIe H100 variants are usable for development and testing but will show significant throughput degradation under production load.

A note on INT4 for reasoning workloads: quantization errors compound through long thinking chains. A precision error that has minor impact on a 500-token standard response can cascade through a 20,000-token reasoning trace and shift logical inferences. FP8 causes less than 1-2% accuracy loss on math and coding benchmarks compared to BF16. W4A16 can cause 5-10% degradation on reasoning-specific benchmarks depending on calibration quality. Validate accuracy on your specific task set before deploying W4A16 to production, using the same approach described in the FP8 quantization guide.

You can provision H100 SXM5 on Spheron for the 8-GPU production configuration, or H200 SXM5 rental for the 4-GPU FP8 alternative with lower total node cost.

Step-by-Step vLLM Deployment

Step 1: Provision a Spheron multi-GPU instance

Log in at app.spheron.ai and go to GPU Cloud. Select H100 SXM5 with 8 GPUs for the FP8 production path, or H200 SXM5 with 4 GPUs. Use the PyTorch 2.6 / CUDA 12.4 base image. Provision at least 600 GB of persistent storage for model weights and vLLM cache.

After SSH access, verify NVLink topology:

bash
nvidia-smi topo -m

All H100 SXM5 GPUs in the same node should show NV18 (NVLink) connections to each other. If you see PIX or NODE instead of NVLink connections, stop and contact support before downloading weights.

Step 2: Install vLLM and download weights

bash
pip install vllm==0.23.0

export HF_TOKEN=your_huggingface_token
export HF_HUB_ENABLE_HF_TRANSFER=1

huggingface-cli download arcee-ai/Trinity-Large-Thinking \
  --local-dir /models/trinity-large-thinking

The FP8 checkpoint is approximately 400 GB. With HF_HUB_ENABLE_HF_TRANSFER=1, download speed on a 10 Gbps NIC should complete in 30-40 minutes.

Before downloading, verify the exact model ID on the official Arcee AI HuggingFace page. HuggingFace IDs sometimes change between announcement and release. If the ID differs from arcee-ai/Trinity-Large-Thinking, update the path accordingly.

Step 3: Launch vLLM with reasoning parser and auto tool-choice

For 8x H100 SXM5 at FP8:

bash
vllm serve /models/trinity-large-thinking \
  --served-model-name trinity-large-thinking \
  --dtype fp8 \
  --tensor-parallel-size 8 \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --kv-cache-dtype fp8_e5m2 \
  --port 8000

The --reasoning-parser deepseek_r1 argument tells vLLM to parse <think>...</think> blocks and expose the chain-of-thought separately in the response. The deepseek_r1 parser handles models using this tag convention. If Arcee has contributed a dedicated arcee parser by the time you deploy, prefer it. Run vllm --help | grep reasoning-parser to list parsers available at your installed version, or check the vLLM reasoning outputs documentation directly.

For W4A16 on 4x H100 SXM5, replace the flags:

bash
vllm serve /models/trinity-large-thinking-awq \
  --served-model-name trinity-large-thinking \
  --quantization awq \
  --tensor-parallel-size 4 \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --port 8000

Use --quantization compressed_tensors if the provided checkpoint uses that format instead of AWQ. Check the model card for the checkpoint format at release.

Step 4: Transformers path for custom agentic loops

When you need step-level tensor hooks or custom sampling logic, the Transformers library gives direct access to model internals:

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "arcee-ai/Trinity-Large-Thinking",
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "arcee-ai/Trinity-Large-Thinking",
    trust_remote_code=True,
    torch_dtype=torch.float8_e4m3fn,
    device_map="auto"
)

The trust_remote_code=True flag is required because Trinity uses custom model code hosted on HuggingFace. This means the HuggingFace repository executes arbitrary Python code during model loading. Before using this in production environments handling sensitive data, audit the model files and pin to a specific commit hash rather than the latest main branch.

Use vLLM for production serving. Transformers does not implement continuous batching or paged attention; throughput drops significantly under concurrent requests compared to vLLM.

Step 5: Test reasoning output and tool-calling

Verify chain-of-thought parsing:

bash
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "trinity-large-thinking",
    "messages": [{"role": "user", "content": "What is the sum of all prime numbers less than 50? Show your work."}],
    "max_tokens": 8192
  }' | python3 -c "
import sys, json
resp = json.load(sys.stdin)
msg = resp['choices'][0]['message']
reasoning = msg.get('reasoning_content')
if reasoning:
    print('Reasoning trace present. Length:', len(reasoning))
    print('Reasoning preview:', reasoning[:200])
    print('Final answer:', msg.get('content', '').strip()[:200])
else:
    print('No reasoning_content field found. Check --reasoning-parser configuration.')
    print('Raw content:', msg.get('content', '')[:200])
"

Test auto tool-choice:

python
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"]
        }
    }
}]

response = client.chat.completions.create(
    model="trinity-large-thinking",
    messages=[{"role": "user", "content": "What is the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
    max_tokens=4096
)
print(response.choices[0].message.tool_calls)

Monitor GPU utilization and KV cache during inference:

bash
nvidia-smi dmon -s u  # GPU utilization; expect >80% during reasoning
# Or check vLLM Prometheus metrics:
curl -s http://localhost:8000/metrics | grep gpu_cache_usage

Alert if vllm:gpu_cache_usage_perc exceeds 0.85.

Cost Analysis: Self-Hosting vs Claude Opus 4.6 API

Live pricing from the Spheron GPU API (DEDICATED on-demand offers only):

ConfigurationCostPer-GPU RateEstimated ThroughputCost per 1M Output Tokens
Claude Opus 4.6 API$25/M outputN/AN/A$25.00
8x H100 SXM5 on-demand$31.36/hr$3.92/hr/GPU600-900K tokens/hr$35-52/M
8x H100 SXM5 spot$23.28/hr$2.91/hr/GPU600-900K tokens/hr$26-39/M
4x H200 SXM5 on-demand$14.80/hr$3.70/hr/GPU700K-1M tokens/hr$15-21/M
4x H200 SXM5 spot$7.04/hr$1.76/hr/GPU700K-1M tokens/hr$7-10/M

Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing for live rates.

At Claude Opus 4.6's $25/M output rate, the 8x H100 SXM5 on-demand path ($35-52/M) does not beat the API on a per-token basis. That node generates 600-900K output tokens per hour at $31.36/hr, while the same volume costs only $15-22.50 in API fees at $25/M. The cost advantage lies with the H200 configuration: the 4x H200 SXM5 node at $14.80/hr on-demand generates 700K-1M tokens/hr, putting the effective cost at $15-21/M, which undercuts the $25/M API rate. On spot ($7.04/hr), the H200 cost falls to $7-10/M, roughly 60-72% below the API price. For background on how reasoning model throughput affects per-token economics, see the reasoning model inference cost guide.

Breakeven in monthly output tokens: the 4x H200 node at $14.80/hr runs ~$10,656/month. Break-even against Claude Opus 4.6 at $25/M requires 10,656 / 25 × 1M = 426M output tokens per month. Below that, the API is cheaper per token; above it, the node wins. The 4x H200 node can produce up to 720M output tokens per month at peak (1M tokens/hr × 720 hours), so the 426M breakeven is physically reachable at roughly 60% utilization. The 8x H100 on-demand path has no breakeven: at $35-52/M cost-per-token versus the $25/M API rate, it costs more per token than the API at every utilization level. No volume justifies H100 on-demand over the Claude Opus 4.6 API on a pure cost-per-token basis.

In practice, teams running agentic workloads at scale typically reach 426M output tokens per month on the H200 path within weeks of production rollout. The Apache 2.0 license removes usage caps entirely, so there is no cost ceiling that triggers renegotiation as query volume grows.

For on-demand access to the exact GPU nodes used in this guide, the H100 SXM5 and H200 SXM5 product pages on Spheron list current availability and pricing.

Fine-Tuning and Quantizing Your Own Variant

Arcee trained Trinity-Large-Thinking using agentic RL post-training, specifically targeting verifiable reward signals on coding, math, and tool-use tasks. You can apply the same approach to tune the model for your specific domain. The GRPO fine-tuning guide covers the RL training setup in detail.

Hardware minimums for fine-tuning:

ApproachMinimum HardwareNotes
LoRA + GRPO (FP8 forward pass)8x H100 SXM5BF16 gradients for adapter; FP8 activations only
Full-parameter GRPO16x H100 SXM5 or 8x H200 SXM5Large batch required for stable RL signal

For W4A16 quantization using AutoAWQ:

bash
pip install autoawq
python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_pretrained(
    "arcee-ai/Trinity-Large-Thinking",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "arcee-ai/Trinity-Large-Thinking",
    trust_remote_code=True
)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("./trinity-large-thinking-awq")

One non-obvious requirement: the calibration dataset for W4A16 quantization of a reasoning model must include chain-of-thought examples, not just standard instruction data. Using a generic instruction calibration set (like wikitext or ShareGPT) leaves the quantization unaware of the <think> token distribution and tends to produce larger accuracy gaps on reasoning benchmarks than the numbers suggest for dense models. Generate or collect a calibration set that covers reasoning traces of similar length and structure to your production queries.

For future hardware paths: FP4 on Blackwell B200 can reduce the per-node cost further while maintaining accuracy closer to FP8 than INT4. B200 supply is currently limited, but once availability stabilizes, FP4 quantization is worth evaluating for high-volume deployments (see the FP4 quantization on Blackwell guide for the tradeoffs).

Production Serving: Continuous Batching, KV Cache, and SLO Tuning

vLLM handles continuous batching automatically. The key parameters for reasoning workloads differ from standard generation:

--max-num-seqs: default is 256. Trinity-Large-Thinking reasoning traces run 8K-40K tokens per request. Under concurrent load, this many in-flight sequences quickly exhaust KV cache. For production reasoning workloads, set this to 32-64 to prevent eviction cascades.

--max-model-len: set to the 95th-percentile trace length for your workload, not the model's theoretical maximum. A lower max-model-len frees KV cache for more concurrent sequences. Start at 32768 and profile your actual trace lengths before raising it.

--kv-cache-dtype fp8_e5m2: halves KV cache memory with negligible impact on reasoning accuracy. Use this in all production deployments.

SLO tuning table:

SLO PriorityRecommended Settings
Minimize TTFT--enable-chunked-prefill --chunked-prefill-size 512
Maximize throughput--max-num-seqs 128 --max-num-batched-tokens 65536
Long reasoning traces--max-model-len 65536 --kv-cache-dtype fp8_e5m2
Tool-use with JSON--guided-decoding-backend lm-format-enforcer

Monitor vllm:gpu_cache_usage_perc via the Prometheus endpoint at /metrics. Set an alert at 0.85; above that, requests begin waiting for KV cache slots, TTFT spikes, and queue depth grows faster than throughput can clear it.

For full production deployment patterns including load balancing, multi-node setups, KV cache sizing, and autoscaling, the vLLM production deployment guide covers all of it.

Trinity-Large-Thinking sits in the same tier as DeepSeek R2 and Magistral for open-weight 400B-class MoE reasoning models:

Trinity-Large-Thinking's 400B MoE architecture needs multi-GPU nodes. H100 SXM5 and H200 SXM5 clusters on Spheron come with NVLink fabric, per-minute billing, and no lock-in to the weights you own.

H100 on Spheron → | H200 on Spheron → | View GPU pricing →

STEPS / 06

Quick Setup Guide

  1. Plan GPU configuration based on quantization choice

    At FP8 (1 byte/param), Trinity-Large-Thinking's 400B weights require approximately 400 GB of VRAM plus 15% activation overhead, totaling ~460 GB. Choose 8x H100 SXM5 (640 GB) or 4x H200 SXM5 (564 GB) for FP8 production deployments. At W4A16 (INT4), weights compress to ~200 GB plus overhead (~230 GB), fitting on 4x H100 SXM5 (320 GB) or 3x H200 SXM5 (423 GB). BF16 requires ~920 GB and is not practical for production. Provision a node with NVLink to ensure tensor parallelism communication runs at 900 GB/s rather than the 64 GB/s on PCIe.

  2. Provision a multi-GPU instance on Spheron

    Log in to app.spheron.ai, navigate to GPU Cloud, and select H100 SXM5 (8-GPU bundle) or H200 SXM5 (4-GPU bundle). Use the PyTorch 2.6 / CUDA 12.4 base image. Provision at least 600 GB of persistent storage for model weights. SSH in and verify GPU topology with 'nvidia-smi topo -m' to confirm NVLink is active between all GPUs.

  3. Install vLLM and download Trinity-Large-Thinking weights

    Run 'pip install vllm==0.23.0'. Set HF_HUB_ENABLE_HF_TRANSFER=1 and your HF_TOKEN, then run 'huggingface-cli download arcee-ai/Trinity-Large-Thinking --local-dir /models/trinity-large-thinking'. The FP8 checkpoint is approximately 400 GB. Verify the exact model ID on the official Arcee AI HuggingFace page; IDs can change between announcement and release.

  4. Launch vLLM with reasoning parser and tool-choice

    For 8x H100 SXM5 at FP8: 'vllm serve /models/trinity-large-thinking --dtype fp8 --tensor-parallel-size 8 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes --max-model-len 32768 --gpu-memory-utilization 0.90 --enable-chunked-prefill --kv-cache-dtype fp8_e5m2 --port 8000'. Check the vLLM reasoning outputs documentation for the exact --reasoning-parser name supported for Trinity-Large-Thinking at your installed vLLM version, as parser names are model-specific.

  5. Load via Transformers with trust_remote_code

    For agentic loops that require direct tensor access: 'from transformers import AutoModelForCausalLM, AutoTokenizer; model = AutoModelForCausalLM.from_pretrained("arcee-ai/Trinity-Large-Thinking", trust_remote_code=True, torch_dtype=torch.float8_e4m3fn, device_map="auto")'. The trust_remote_code flag is required because Trinity uses custom model code on HuggingFace. Use device_map="auto" to distribute across GPUs automatically.

  6. Test reasoning output and verify tool-choice

    Send a multi-step math or agentic task to /v1/chat/completions and verify the reasoning_content field in the response is populated. When --reasoning-parser is active, vLLM strips <think>...</think> tags from the content field and exposes the chain-of-thought separately under message.reasoning_content. Test tool-choice by passing a tools array with a function definition and confirm the model generates a tool_calls JSON response instead of plain text when a function call is appropriate. Monitor KV cache usage via vLLM's /metrics Prometheus endpoint and alert if vllm:gpu_cache_usage_perc exceeds 85%.

FAQ / 05

Frequently Asked Questions

Trinity-Large-Thinking is a 400B sparse MoE model. All expert weights must reside in VRAM simultaneously. At FP8 (1 byte/param), the weights require ~400 GB plus 15% overhead, totaling ~460 GB. The recommended minimum is 8x H100 SXM5 (640 GB total, with ~180 GB KV cache headroom) or 4x H200 SXM5 (564 GB total). At W4A16 (INT4), weights compress to ~200 GB, fitting on 4x H100 SXM5 or 3x H200 SXM5 with meaningful KV cache budget. BF16 requires ~920 GB, which is impractical for production.

vLLM supports a --reasoning-parser flag with support for several reasoning model architectures. Trinity-Large-Thinking generates internal chain-of-thought in <think>...</think> tags. Pass --reasoning-parser deepseek_r1 (or the appropriate parser name per the official HuggingFace model card) and --enable-auto-tool-choice --tool-call-parser hermes to activate both the reasoning trace and native function calling in a single deployment.

It depends on the hardware config. Claude Opus 4.6 is $5 per million input tokens and $25 per million output tokens. The 8x H100 SXM5 node on Spheron on-demand costs $31.36/hr and generates roughly 600-900K output tokens per hour, putting its effective cost at $35-52/M output tokens, higher than the $25/M API rate, so H100 on-demand does not win on cost. The 4x H200 SXM5 node is the cost winner: at $14.80/hr and 700K-1M tokens/hr throughput, the effective cost drops to $15-21/M output tokens, below the API rate. At spot pricing ($7.04/hr), the H200 cost falls to $7-10/M. The breakeven for 4x H200 on-demand is roughly 426M output tokens per month.

Yes, using RL-based fine-tuning (GRPO or PPO) on your task-specific reward signal, which is the same post-training approach Arcee used to produce the reasoning-capable Trinity-Large-Thinking from the base Trinity-Large weights. For quantization-aware fine-tuning, generate a W4A16 adapter using AutoAWQ or llm-compressor and then serve the merged checkpoint via vLLM with the same --quantization awq flag. A minimum of 4x H100 SXM5 is required for LoRA fine-tuning at FP8; 8x H100 SXM5 for full-parameter GRPO runs.

Trinity-Large is Arcee's base 400B sparse MoE instruction-following model. Trinity-Large-Thinking is the reasoning-enhanced variant produced through agentic RL post-training: it generates extended internal chain-of-thought before answering. The Thinking variant scores measurably higher on AIME, MATH-500, and GPQA Diamond. Both models use the same 400B MoE weights as the base; the Thinking variant has a separate fine-tuned checkpoint, not a runtime flag.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.