Tutorial

Deploy the Hierarchical Reasoning Model (HRM) on GPU Cloud: Self-Host a 27M-Parameter Reasoner (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 2, 2026
Hierarchical Reasoning ModelHRM DeploymentSelf-Hosted Reasoning ModelARC-AGIRecurrent Reasoning ModelPyTorch Model DeploymentSmall Reasoning ModelRTX 4090A100GPU Cloud
Deploy the Hierarchical Reasoning Model (HRM) on GPU Cloud: Self-Host a 27M-Parameter Reasoner (2026)

Running a 70B reasoning model on an H100 costs around $3.10/hr on-demand. Running HRM on a single RTX 4090 costs $0.79/hr. On structured reasoning benchmarks, HRM at 27M parameters matches or beats 70B-class chain-of-thought models. That gap only makes sense when you understand what HRM actually does differently.

HRM does not generate reasoning tokens. It reasons in embedding space through a deliberation loop, then emits a final answer. The result: no thinking-token overhead, no KV cache explosion, and a model small enough to fit in the margin of GPU memory you would have spent anyway. For the broader economics of why reasoning model inference is expensive, the reasoning model inference cost optimization guide covers the token-explosion problem in detail.

What Is HRM?

HRM is a 27M-parameter hierarchical recurrent reasoner published in 2025. Its defining property: it solves structured reasoning tasks through internal deliberation rather than chain-of-thought token generation.

Where a standard CoT model emits thousands of thinking tokens that consume GPU memory and compute, HRM runs multiple passes of an executor network over compressed sub-goal representations. The final answer is produced directly, with no intermediate tokens in the output stream.

ModelParametersWeights (FP16)VRAM NeededReasoning TokensRelative Cost
HRM27M~50MB~200MBNone (internal)Baseline
DeepSeek R2 7B distill7B~14GB~17GB+2,000-8,00050-100x
DeepSeek R1 671B671B~336GB (FP8)8x H1004,000-12,00010,000x+
Nemotron Ultra 253B253B~127GB (FP8)4x H1002,000-8,0003,000x+

The parameter count matters for GPU economics because VRAM is the bottleneck in inference. At 27M parameters, HRM fits in the VRAM overhead that other models waste. You can run hundreds of concurrent HRM instances on a single RTX 4090 while a single 7B model would fill it.

Architecture: Planner, Executor, and the Deliberation Loop

HRM uses a two-level hierarchy: a planner module that decomposes problems and a low-level executor that solves each sub-goal iteratively.

How the data flows:

  1. Input arrives at the planner
  2. Planner decomposes the problem into a sequence of sub-goals (represented as embeddings, not tokens)
  3. Each sub-goal goes to the executor
  4. Executor runs N iterations over the sub-goal using KV-cache reuse across iterations
  5. Executor produces a sub-goal solution when it converges
  6. Planner assembles sub-goal solutions into a final answer
  7. Final answer is emitted as output tokens (short, since there is no CoT trace)

The deliberation loop terminates when the executor converges or hits a configurable max-step limit. Default is 8 steps; harder tasks benefit from 12-16. Critically, no intermediate tokens are sampled during deliberation. All reasoning happens in the model's internal embedding space.

This differs fundamentally from CoT models like DeepSeek R2 or QwQ. Those models sample tokens at every reasoning step, which grows the KV cache linearly with reasoning depth. HRM's KV cache footprint stays nearly constant regardless of deliberation depth because the executor reuses cached representations across iterations rather than extending the sequence.

For the theoretical background on compute vs token tradeoffs, see inference-time compute scaling.

GPU Sizing for HRM

Single-Instance: RTX 4090

HRM weights fit in under 200MB in FP16. A single RTX 4090 has 24GB of VRAM, which means roughly 23.8GB is available for batch queues and executor KV-cache. That headroom handles hundreds of concurrent HRM instances even at aggressive batch sizes.

  • On-demand price: $0.79/hr
  • Batch size: up to 512 concurrent requests at default deliberation depth
  • Best for: development, low-to-medium concurrency production, cost-sensitive pipelines

Run HRM on RTX 4090 for any workload processing under 50,000 queries/hour. Above that, move to A100 for the larger memory headroom and more memory bandwidth.

Batched Serving: A100 80G

The A100 80G SXM4 has 80GB of HBM2e and significantly higher memory bandwidth than the RTX 4090. At production batch sizes (64+), the bandwidth advantage translates to better throughput per dollar.

  • On-demand price: $1.64/hr
  • Spot price: $0.45/hr
  • Best for: production APIs, multi-tenant serving, embedding HRM in a larger inference router

HRM's executor iterations are bandwidth-bound at small batch sizes and compute-bound at large ones. The A100's 2 TB/s bandwidth handles the transition well. For workloads that are interruptible (batch jobs, async queues), A100 spot at $0.45/hr makes HRM essentially free at scale.

When to Escalate to H100

HRM itself never needs an H100. The H100 is for your fallback model when HRM cannot handle a query. The architecture here is a two-tier stack: HRM on RTX 4090 for structured reasoning, H100 with DeepSeek R2 for everything else.

See the LLM inference router guide for how to build the routing layer between tiers.

GPUVRAMHRM Instances (FP16, est.)On-Demand Price/hrBest Use
RTX 409024GB~400$0.79Dev / low concurrency
A100 80G SXM480GB~1,000+$1.64Production serving
H100 SXM580GBN/A (overkill)$3.10Fallback model only

Pricing fluctuates based on GPU availability. The prices above are based on 02 May 2026 and may have changed. Check current GPU pricing → for live rates.

Prerequisites

  • Spheron account with GPU access (app.spheron.ai)
  • PyTorch 2.6 + CUDA 12.4
  • Python 3.11+
  • Hugging Face account (for model weights download)
  • 8GB+ disk space for weights and cache
  • Ray 2.9+ (for Ray Serve wrapper)

Step-by-Step Deployment

1. Provision a GPU Instance

Rent an RTX 4090 on Spheron. For detailed provisioning and SSH connection steps, see the Spheron getting-started docs. For batch workloads, use spot pricing on the A100 for the lowest cost. See GPU pricing for current rates.

bash
# After SSH into your Spheron instance, verify the GPU
nvidia-smi
# Expected: NVIDIA GeForce RTX 4090 or NVIDIA A100-SXM4-80GB

2. Install Dependencies

bash
pip install torch==2.6.0+cu124 torchvision --index-url https://download.pytorch.org/whl/cu124
pip install "ray[serve]"
git clone https://github.com/sapientinc/HRM hrm
cd hrm && pip install -e .
python
import torch
print(torch.cuda.is_available())        # True
print(torch.cuda.get_device_name(0))    # RTX 4090

3. Download Model Weights

python
from huggingface_hub import snapshot_download

# HRM ships three task-specific checkpoints, not one general model:
# - ARC-AGI-2:          sapientinc/HRM-checkpoint-ARC-2
# - Sudoku 9x9 Extreme: sapientinc/HRM-checkpoint-sudoku-extreme
# - Maze 30x30 Hard:    sapientinc/HRM-checkpoint-maze-30x30-hard
# Download the ARC-AGI-2 checkpoint for ARC-style reasoning tasks:
snapshot_download(
    repo_id="sapientinc/HRM-checkpoint-ARC-2",
    local_dir="./hrm-weights",
    revision="main"
)

FP32 checkpoint: ~100MB. FP16 checkpoint: ~50MB. Always verify the SHA checksum after download.

HRM is not a single general model. The official repo ships three separate checkpoints, each trained from scratch on approximately 1,000 task-specific examples: ARC-AGI-2, Sudoku 9x9 Extreme, and Maze 30x30 Hard. Each checkpoint solves problems within its own task family. Downloading the ARC-AGI-2 checkpoint gives you a model for ARC-style visual reasoning patterns, not a drop-in for arbitrary constraint satisfaction or logic puzzles outside its training distribution. Pick the checkpoint that matches your target task before building your serving stack.

4. Configure the Inference Loop

python
# pseudocode: verify class names against https://github.com/sapientinc/HRM before running
import torch
from hrm import HRMConfig, HRMInference

config = HRMConfig(
    model_path="./hrm-weights",
    max_deliberation_steps=8,  # increase to 16 for harder tasks
    executor_kv_cache_reuse=True,
    seed=42,
    device="cuda:0",
    dtype=torch.float16,
)

Set max_deliberation_steps based on task difficulty. ARC-AGI tasks typically converge in 4-6 steps. Constraint satisfaction with many variables may need 12-16.

5. Serve with FastAPI + Ray Serve

HRM is a custom recurrent architecture. vLLM cannot serve it because it is not a transformer language model. The deliberation loop runs natively with PyTorch. The code below shows the serving structure with pseudocode for the HRM-specific parts. Verify the actual checkpoint loading API against evaluate.py in the official repo before deploying.

python
# pseudocode: verify HRM checkpoint loading against evaluate.py in the official repo
# Usage reference: torchrun --nproc-per-node 1 evaluate.py checkpoint=./hrm-weights

import ray
from ray import serve
from fastapi import FastAPI
import torch

app = FastAPI()

@serve.deployment(num_replicas=1, ray_actor_options={"num_gpus": 1})
@serve.ingress(app)
class HRMServer:
    def __init__(self):
        # pseudocode: replace with actual checkpoint loading
        # checkpoint = torch.load("./hrm-weights/checkpoint.pt", map_location="cuda")
        # self.model = build_hrm_model(checkpoint)
        # self.model.eval()
        pass

    @app.post("/infer")
    async def infer(self, request: dict):
        # pseudocode deliberation loop, implement using the actual model forward pass:
        # with torch.no_grad():
        #     result = self.model.forward(request["prompt"], max_steps=8)
        # return {"output": result.answer, "depth": result.steps_taken}
        pass

ray.init()
serve.run(HRMServer.bind())
bash
# Test the endpoint
curl -X POST http://localhost:8000/infer \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Solve: [example structured reasoning task]"}'

6. Enable torch.compile for Throughput

python
import torch
import torch._dynamo
from hrm import HRMConfig, HRMInference

config = HRMConfig(
    model_path="./hrm-weights",
    max_deliberation_steps=8,
    device="cuda:0",
    dtype=torch.float16,
)
model = torch.compile(HRMInference(config), mode="reduce-overhead")

reduce-overhead mode improves executor iteration throughput by 15-25% on RTX 4090. The first call incurs compilation overhead. Warm up the model with 5-10 requests before benchmarking.

Benchmarks

HRM's advantage is specific to tasks with a finite, verifiable answer space. On open-ended tasks, it loses badly.

ModelARC-AGI AccuracyThroughput (queries/sec, RTX 4090)Cost per 1,000 correct answers
HRM (27M, depth 8)40.3%50-200+~$0.01-0.05
o3-mini-high34.5%API onlyAPI pricing
Claude 3.7 8K21.2%API onlyAPI pricing

Accuracy figures from the HRM paper (arXiv 2506.21734). HRM also achieves near-perfect accuracy on Sudoku 9x9 Extreme and Maze 30x30 Hard, the other two benchmarks reported in the paper.

HRM throughput varies significantly with deliberation depth and batch size. The ranges above reflect default depth 8 at batch sizes 32-256 on a single RTX 4090. Actual numbers depend on task complexity and your hardware.

Pricing fluctuates based on GPU availability. The prices above are based on 02 May 2026 and may have changed. Check current GPU pricing → for live rates.

The cost-per-correct-answer advantage is large for structured tasks. On ARC-AGI, HRM at 40.3% accuracy (self-hosted for ~$0.01-0.05 per 1K correct answers) beats o3-mini-high (34.5%) and Claude 3.7 8K (21.2%) while running on a single RTX 4090 at a fraction of API pricing.

When HRM Wins

HRM is the right call when:

  • Tasks fall within HRM's trained distributions: ARC-AGI-2, Sudoku 9x9 Extreme, or Maze 30x30 Hard patterns. Each checkpoint is task-specific and will not generalize to reasoning tasks outside its training distribution.
  • Answer space is closed: the output is one of N options or satisfies explicit constraints
  • Workload is high-volume and structured: thousands of constraint satisfaction or logic puzzles per hour
  • You need deterministic, reproducible reasoning outputs (fixed seed + deliberation loop)
  • Budget matters: cost-per-correct-answer is the primary metric
  • Low latency matters: no thinking-token generation means fast time-to-first-token

The deliberation loop finds solutions without generating intermediate tokens. On closed-answer-space tasks within its training distribution, that is exactly the right trade-off.

When HRM Loses

Escalate to a larger model when:

  • Open-ended generation: use DeepSeek R1 671B or a frontier model
  • Code synthesis: HRM has no training signal for code generation; use DeepSeek R2 or Qwen-Coder
  • Long-context summarization (>8K tokens): HRM has a limited context window
  • Broad world knowledge: 27M parameters cannot store enough factual knowledge for general QA
  • Unstructured reasoning: free-form reasoning chains require CoT models

For escalation routing, see the DeepSeek R2 deployment guide. For a comparison of open-weight frontier models as fallback options, see the open-weight frontier model showdown.

Production Checklist

  1. Deliberation depth monitoring - log deliberation_depth per request. Alert if P95 depth exceeds 12. That signals tasks beyond the model's capability, and those should route to a larger model.
  2. Early-exit heuristics - terminate the executor loop when confidence score exceeds 0.95. This avoids wasting compute on easy tasks that converge quickly.
  3. Fallback routing - when HRM entropy exceeds threshold, route to DeepSeek R2 7B distill or larger. The LLM inference router guide covers the routing layer in full.
  4. KV-cache sizing - allocate at least 4GB KV-cache per GPU for executor iterations across concurrent requests.
  5. Batch size tuning - HRM benefits from large batch sizes. Target batch size 64+ on A100 80G.
  6. Spot vs on-demand - HRM inference is stateless per request. Spot instances are safe for the HRM tier. Reserve on-demand for the fallback model only.

Cost Analysis

SetupGPUPrice/hrQueries/hr (est.)Cost per 1K queries
HRM singleRTX 4090$0.79~5,000-20,000~$0.04-0.16
HRM batchedA100 80G SXM4$1.64 (on-demand)~20,000-80,000~$0.02-0.08
HRM batched (spot)A100 80G SXM4$0.45 (spot)~20,000-80,000~$0.006-0.02
DeepSeek R2 7B distillA100 80G SXM4$1.64~500-2,000~$0.82-3.28
DeepSeek R1 671B8x H100 SXM5~$24.80~100-400~$62-248

Throughput estimates for HRM vary by deliberation depth and batch size. Figures above assume depth 8, batch size 64-256. For deeper cost optimization techniques, see the reasoning model inference cost guide.

Pricing fluctuates based on GPU availability. The prices above are based on 02 May 2026 and may have changed. Check current GPU pricing → for live rates.


HRM proves that reasoning quality does not require scale. A 27M-parameter model on a single RTX 4090 handles structured reasoning tasks that would otherwise demand a multi-GPU H100 setup. Spheron's on-demand RTX 4090 and A100 instances are the cheapest place to run the small-GPU side of a two-tier reasoning stack.

Rent RTX 4090 → | Rent A100 → | View all GPU pricing →

Deploy HRM on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.