IBM Granite 4.1 ships with Apache 2.0 licensing, a matching safety classifier (Granite Guardian), and cryptographic signing on every checkpoint. That combination is rare in open-weight models and makes it worth understanding for teams with compliance requirements. This guide covers everything you need to run Granite 4.1 on H100 and H200 hardware using vLLM and LMDeploy: GPU sizing, the hybrid architecture's implications for vLLM config, signature verification, Guardian deployment, and a cost comparison against OpenAI and Anthropic enterprise tiers. If you haven't set up vLLM yet, the vLLM production deployment guide covers the baseline first.
What Is IBM Granite 4.1
IBM released Granite 4.1 through April 2026. The family includes Granite 4.1 3B, 8B, and 30B general-purpose models plus specialized variants: Granite Code (coding tasks), Granite Vision (multimodal), and Granite Speech (ASR and TTS).
The 30B is the headlining variant: it uses a hybrid Mamba-Transformer architecture that interleaves SSM (Structured State Space Model) layers with standard attention layers through the stack. The 3B and 8B use standard dense Transformer architecture. All three have a 512K context window, though the KV cache implications of serving at that length are significant (more on this below).
| Variant | Params | Context | Architecture | License | HuggingFace ID |
|---|---|---|---|---|---|
| Granite 4.1 3B | 3B | 512K | Dense Transformer | Apache 2.0 | ibm-granite/granite-4.1-3b |
| Granite 4.1 8B | 8B | 512K | Dense Transformer | Apache 2.0 | ibm-granite/granite-4.1-8b |
| Granite 4.1 30B | 30B | 512K | Hybrid (Mamba+Attn) | Apache 2.0 | ibm-granite/granite-4.1-30b |
| Granite Guardian 8B | 8B | n/a | Dense Transformer | Apache 2.0 | ibm-granite/granite-guardian-4.1-8b |
Note on model IDs: These IDs follow IBM's established naming convention for the ibm-granite Hugging Face org. Verify the exact repository names on the IBM Granite Hugging Face page before downloading, as IBM may have adjusted naming between announcement and release.
Why Enterprises Choose Granite 4.1 Over Llama 4 and Qwen 3
Three things set Granite 4.1 apart when you're comparing it against alternatives for enterprise deployment.
Apache 2.0 with IBM backing. Meta's RAIL license for Llama 4 includes certain commercial restrictions. Qwen 3 uses Apache 2.0 but is backed by Alibaba, which creates different geopolitical considerations for some buyers. Granite 4.1's Apache 2.0 comes with IBM's enterprise support tier and commercial indemnification. Legal teams treat those differently than a community-maintained release, and in regulated industries that difference matters.
Granite Guardian as a first-class safety layer. Rather than bolt-on guardrails, IBM ships a separate classifier trained specifically against Granite 4.1's data distribution. The result is a safety layer that understands the same output space as the model it's protecting. For teams also running NeMo-based orchestration, the NeMo Guardrails guide covers stacking Granite Guardian with Colang policy enforcement as complementary approaches.
Cryptographic signing. IBM signs each Granite 4.1 checkpoint. Your deployment pipeline can verify the exact artifact before serving. For regulated industries where model provenance is part of the audit trail, this is the difference between "we checked" and "we can prove it." The confidential GPU computing guide covers pairing signature verification with TEE attestation for the highest assurance environments.
For teams navigating EU AI Act compliance, Apache 2.0 plus IBM's GPAI documentation obligations simplify the compliance posture substantially. The EU AI Act compliance guide covers how Granite 4.1's Apache 2.0 license and IBM GPAI documentation fit into a compliant AI infrastructure stack for high-risk applications.
Hybrid Mamba-Transformer Architecture: What It Means for GPU Memory
The 30B variant is where the architecture gets interesting for GPU deployments. Standard transformer attention layers compute query-key-value products across the full sequence at every layer. Memory consumption grows quadratically with sequence length. At 512K context on a pure transformer, you'd have (512K)^2 = 262 billion attention score pairs per head per layer to compute and store.
SSM (Mamba) layers replace the attention computation with a recurrent state. That state stays bounded at the state dimension regardless of sequence length. Granite 4.1 30B interleaves SSM layers with standard attention layers, giving you linear memory scaling from the SSM layers and complex reasoning capability from the attention layers.
This pattern will be familiar to readers who've deployed Nemotron 3 Super, which uses a similar hybrid approach. The same SSM-aware vLLM configuration is covered in the deployment section below. The Nemotron 3 Super deployment guide explains the hybrid MoE variant of this pattern in detail.
Practical impact: at 128K context, the hybrid 30B requires substantially less KV cache VRAM than a pure-transformer 30B of similar size.
| Configuration | KV Cache at 32K ctx | KV Cache at 128K ctx | KV Cache at 512K ctx |
|---|---|---|---|
| Pure-transformer 30B | ~20 GB | ~80 GB | ~320 GB |
| Granite 4.1 30B hybrid | ~12 GB | ~45 GB | ~90 GB (estimated) |
Estimates for illustration. Actual values depend on GQA configuration, batch size, and dtype.
The 512K context number is the architectural maximum. For most production workloads, 32K-128K is the practical range. Reserve 512K for specific use cases: long document Q&A, contract analysis, or repository-level code review where the full context genuinely matters.
Hardware Sizing on Spheron: 3B, 8B, and 30B
VRAM budget table using the same methodology as the Nemotron Ultra deployment guide:
| Model | Precision | Weight VRAM | Overhead (15%) | KV Cache (128K ctx) | Total | Min GPUs |
|---|---|---|---|---|---|---|
| Granite 4.1 3B | BF16 | ~6 GB | ~1 GB | ~20 GB | ~27 GB | 1x A100 40GB (RTX 4090 only at ≤32K context) |
| Granite 4.1 3B | INT4 (AWQ) | ~2 GB | ~0.4 GB | ~20 GB | ~22 GB | 1x A100 40GB |
| Granite 4.1 8B | BF16 | ~16 GB | ~2.4 GB | ~30 GB | ~48 GB | 1x H100 SXM5 |
| Granite 4.1 8B | FP8 | ~8 GB | ~1.2 GB | ~30 GB | ~39 GB | 1x A100 80GB |
| Granite 4.1 30B | BF16 | ~60 GB | ~9 GB | ~45 GB | ~114 GB | 2x H100 SXM5 |
| Granite 4.1 30B | FP8 | ~30 GB | ~4.5 GB | ~45 GB | ~79 GB | 1x H100 SXM5 (tight) |
Node recommendations:
- 3B: Single A100 40GB or RTX 4090 (RTX 4090 viable at ≤32K context only). From $1.10/hr on-demand on Spheron.
- 8B BF16: Single H100 SXM5. On-demand H100 SXM5 instances start from $5.07/hr on-demand (spot from $2.91/hr).
- 30B FP8: Single H100 SXM5 at tight utilization. Use
--gpu-memory-utilization 0.92and limit--max-model-lento 32768-65536 to avoid OOM. - 30B BF16: 2x H100 SXM5 or single H200. H200 SXM5 availability from $5.92/hr on-demand, spot pricing available from $1.40/hr.
Deploy Granite 4.1 with vLLM on Spheron H100 / H200
Prerequisites:
- CUDA 12.4+ (H100 FP8 support)
- Python 3.10+
- vLLM with SSM/Mamba kernel support (check the vLLM changelog for the current minimum version required for Granite 4.1 30B hybrid)
- Persistent storage volume (50+ GB for 8B BF16, 70+ GB for 30B FP8)
Step 1: Provision the node
Log into app.spheron.ai, navigate to the GPU catalog, select H100 SXM5. For the 30B BF16, select a 2-GPU configuration or an H200 node.
Step 2: Verify the checkpoint signature (see dedicated section below)
Step 3: Install vLLM
pip install vllm # use the latest release; check changelog for SSM kernel supportStep 4a: Launch for 8B (single GPU)
vllm serve ibm-granite/granite-4.1-8b \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--max-model-len 131072 \
--port 8000Step 4b: Launch for 30B hybrid
For single H100 SXM5 with FP8 quantization:
vllm serve ibm-granite/granite-4.1-30b \
--quantization fp8 \
--tensor-parallel-size 1 \
--no-enable-chunked-prefill \
--trust-remote-code \
--gpu-memory-utilization 0.92 \
--max-model-len 65536 \
--port 8000For 2x H100 SXM5 with BF16 (no quantization):
vllm serve ibm-granite/granite-4.1-30b \
--tensor-parallel-size 2 \
--no-enable-chunked-prefill \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
--max-model-len 131072 \
--port 8000The --no-enable-chunked-prefill flag is required for the hybrid 30B. SSM layers cannot correctly initialize their recurrent state across chunk boundaries. This is a correctness issue, not a performance issue: wrong outputs, not just slower ones. Use --no-enable-chunked-prefill as the safe default and re-enable only after validating correct outputs on your specific workload and vLLM version.
Step 5: Validate with a test request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ibm-granite/granite-4.1-8b",
"messages": [{"role": "user", "content": "Explain KV cache in three sentences."}],
"max_tokens": 256
}'For full Spheron-specific vLLM setup, see the Spheron vLLM server quickstart.
Deploy Granite 4.1 with LMDeploy and TurboMind
LMDeploy's TurboMind backend gives lower memory footprint for AWQ quantization and tends to have better latency on A100-class hardware. This section covers the dense 3B and 8B variants only. The 30B hybrid Mamba-Transformer may not be supported by TurboMind's attention-optimized kernels out of the box. Check LMDeploy's architecture compatibility list before attempting the 30B with TurboMind.
Install LMDeploy:
pip install lmdeployServe the 8B with TurboMind:
lmdeploy serve api_server ibm-granite/granite-4.1-8b \
--backend turbomind \
--tp 1 \
--server-port 8000AWQ quantization for the 8B (reduces to ~4 bits):
lmdeploy lite auto_awq ibm-granite/granite-4.1-8b \
--work-dir ./granite-4.1-8b-awq \
--calib-dataset pilevalThen serve the quantized checkpoint:
lmdeploy serve api_server ./granite-4.1-8b-awq \
--backend turbomind \
--tp 1 \
--server-port 8000AWQ on the 8B reduces weight VRAM from ~16 GB to ~4 GB, letting a single A100 40GB serve it with substantial KV cache headroom. For the full LMDeploy setup on Spheron, see docs.spheron.ai/quick-guides/llms/frameworks/lmdeploy.
Verifying the Cryptographic Signature Before Deployment
IBM signs Granite 4.1 checkpoints using sigstore-based tooling. Before serving any Granite 4.1 checkpoint in a production environment, verify the signature. This confirms the artifact matches what IBM published and has not been tampered with in transit.
Step 1: Install sigstore
pip install sigstoreStep 2: Download the model and signature
Download the checkpoint from the ibm-granite Hugging Face org. The model card for each variant documents the signature file location and IBM's certificate identity.
huggingface-cli download ibm-granite/granite-4.1-8b \
--local-dir /models/granite-4.1-8bStep 3: Verify the signature
sigstore verify artifact \
--cert-identity <IBM cert identity from model card> \
--cert-oidc-issuer https://token.actions.githubusercontent.com \
/models/granite-4.1-8b/model.safetensorsConfirm the output shows Verified OK before proceeding to deployment. The exact certificate identity and any additional flags are documented on the HuggingFace model card for each Granite 4.1 variant.
Why this matters: for regulated industries, cryptographic signing provides an auditable chain of custody from IBM's publishing infrastructure to your deployment environment. When paired with TEE attestation, you can prove to an auditor that the exact IBM-published weights ran in an isolated hardware environment. For EU AI Act high-risk system documentation requirements, signing and attestation fit into the broader model governance checklist (both topics are covered by the guides linked earlier in this post).
Adding Granite Guardian for Input/Output Safety Classification
Granite Guardian is a separate inference endpoint that classifies inputs and outputs for safety. Run it alongside your main Granite 4.1 server.
Serve Granite Guardian in a separate process:
vllm serve ibm-granite/granite-guardian-4.1-8b \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.85 \
--port 8001Note: The Guardian model ID above (granite-guardian-4.1-8b) is confirmed from the ibm-granite Hugging Face org. Verify availability before deploying.
Proxy logic (Python pseudo-code):
import httpx
async def route_request(user_message: str) -> str:
# Step 1: classify input through Guardian
guard_resp = await classify_with_guardian(user_message, port=8001)
if not guard_resp["safe"]:
return "I can't help with that request."
# Step 2: forward to main Granite 4.1 model
main_resp = await generate_with_granite(user_message, port=8000)
output_text = main_resp["choices"][0]["message"]["content"]
# Step 3: classify output through Guardian
out_guard = await classify_with_guardian(output_text, port=8001)
if not out_guard["safe"]:
return "I can't provide that response."
return output_textGPU budgeting with Guardian stacked:
- 8B + Guardian 8B on 2x H100: assign main model to GPU 0, Guardian to GPU 1 via
CUDA_VISIBLE_DEVICES. - Single H100 SXM5: run both processes on the full 80 GB GPU with process-level memory limits rather than MIG. The valid non-full-GPU MIG profiles on H100 SXM5 are
1g.10gb,2g.20gb, and3g.40gb. A6g.60gbprofile does not exist, and the largest non-full-GPU slice (3g.40gb, ~36 GB usable) is less than the ~48 GB needed for Granite 4.1 8B BF16 at 128K context. Instead, setCUDA_VISIBLE_DEVICES=0for both processes and use--gpu-memory-utilization 0.62for the main 8B BF16 model (~50 GB) and--gpu-memory-utilization 0.22for Guardian FP8 (~18 GB), keeping combined allocation within the 80 GB physical limit.
For teams who want Colang policy enforcement on top of Guardian's classifier-based approach, NeMo Guardrails can stack with Guardian as complementary layers (see the NeMo Guardrails guide linked earlier in this post).
Granite Code for Self-Hosted AI Coding Assistants
IBM's Granite Code series is a practical alternative to GitHub Copilot for teams with data residency requirements. The key differentiator over other coding models is Apache 2.0 licensing combined with cryptographic signing: legal can verify the exact artifact, and your code never leaves your network.
Note: As of this writing, a Granite Code 4.1 variant has not been published to the ibm-granite Hugging Face org. The current Granite Code release is part of the Granite 3.x line. Check the ibm-granite organization on Hugging Face for the latest Granite Code model ID before deploying.
For teams comparing coding models, Qwen2.5-Coder delivers strong HumanEval scores and is the most-deployed self-hosted coding model as of early 2026. Granite Code trades some benchmark coverage for the Apache 2.0 + IBM signing combination. If your organization has an IBM enterprise agreement or operates in a GPAI-regulated environment where model provenance documentation is mandatory, Granite Code's compliance story is cleaner than Qwen's.
Deployment follows the same pattern as Granite 4.1 8B — substitute the correct Granite Code model ID from the ibm-granite org:
vllm serve ibm-granite/<granite-code-model-id> \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--max-model-len 65536 \
--port 8000Tool Calling and Instruction Following
Granite 4.1 supports tool calling without extended chain-of-thought reasoning. This produces more predictable latency for production agents: you get a bounded response time rather than variable-length thinking traces. For comparison, models like Nemotron Ultra or QwQ generate extended reasoning before answering.
Standard tool calling via OpenAI-compatible API:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ibm-granite/granite-4.1-8b",
"messages": [{"role": "user", "content": "What is the weather in London?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"]
}
}
}],
"tool_choice": "auto"
}'Cost Per Million Tokens on Spheron vs OpenAI/Anthropic Enterprise
Pricing fetched from the Spheron GPU offers API on 01 Jun 2026. Per-GPU rates using price / gpuCount:
| Config | Spheron $/hr | Throughput (tok/s est.) | $/1M output tokens | Notes |
|---|---|---|---|---|
| Granite 4.1 8B BF16, 1x H100 SXM5 | $5.07 | ~2,500 | ~$0.56 | On-demand |
| Granite 4.1 8B BF16, 1x H100 SXM5 (spot) | $2.91 | ~2,500 | ~$0.32 | Spot |
| Granite 4.1 30B FP8, 1x H100 SXM5 | $5.07 | ~800 | ~$1.76 | On-demand, single GPU tight |
| Granite 4.1 30B BF16, 1x H200 SXM5 | $5.92 | ~600 | ~$2.74 | On-demand |
| Granite 4.1 30B BF16, 1x H200 SXM5 (spot) | $1.40 | ~600 | ~$0.65 | Spot |
| OpenAI GPT-4o Enterprise | n/a | n/a | ~$15.00 | API tier |
| Anthropic Claude 3.5 Sonnet Enterprise | n/a | n/a | ~$15.00 | API tier |
Throughput estimates are approximate and depend on batch size, sequence length, and quantization.
Break-even calculation for the 8B on-demand at $0.56/M tokens vs GPT-4o at $15/M: self-hosting beats the API price by 96% per token. The break-even point where the hourly GPU cost amortizes over token volume requires meaningful scale. At 10M tokens/day, monthly API spend at $15/M is $4,500 vs $3,650 for a dedicated H100 SXM5 (720 hours at $5.07/hr). That crossover happens at around 243M tokens/month (roughly 8M tokens/day): $3,650 / ($15/M) = 243M tokens where API and self-hosting costs are equal.
Pricing fluctuates based on GPU availability. The prices above are based on 01 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
512K Context vs Practical Deployment
The 512K context number needs some context. Serving at 512K requires enormous KV cache VRAM even with the hybrid 30B's SSM efficiency advantage. Most production workloads don't need 512K. The practical range is 32K-128K:
- 32K: Chat applications, document Q&A on shorter docs, code review on individual files
- 128K: Multi-file repository analysis, long contract review, research paper summarization
- 512K: Full repository ingestion, book-length document analysis, extended multi-session workflows
Increase --max-model-len only when your actual use case needs the length. Every additional token of max context costs VRAM that could otherwise go to larger batch sizes and higher throughput.
Production Checklist
- GPU provisioned with correct VRAM tier for chosen precision
- CUDA 12.4+ verified (
nvcc --version) - vLLM installed with SSM kernel support (check vLLM changelog for current minimum version)
- Checkpoint signature verified via sigstore before first serve
- vLLM launched with
--no-enable-chunked-prefillfor the 30B hybrid - Health endpoint verified:
curl http://localhost:8000/health - Granite Guardian running on port 8001 if safety classification required
--max-model-lenset to your actual required context, not 512K unless you need it- GPU utilization monitored via
nvidia-smi dmon -s u
IBM Granite 4.1's Apache 2.0 license and 512K context make it a strong default for enterprise teams who need a vendor-backed open-weight model with data residency control. Spheron provides bare-metal H100 and H200 nodes with per-minute billing and no hyperscaler lock-in.
H100 on Spheron → | H200 on Spheron → | View all GPU pricing →
Quick Setup Guide
Determine whether you need the 3B (fits on A100 40GB or RTX 4090), 8B BF16 (single H100 SXM5), 8B FP8 (single A100 80GB), 30B FP8 (single H100 SXM5 tight), or 30B BF16 (2x H100 SXM5 or 1x H200). Use the VRAM budget table in this guide before provisioning. Note that 512K context is the architectural maximum; budget your KV cache based on your actual expected context length (32K-128K for most workloads).
Log into app.spheron.ai, navigate to the GPU catalog, and select H100 SXM5 for 8B BF16 and 30B deployments, A100 80GB for budget-sensitive 8B FP8, or A100 40GB for 3B work. Note your public IP and SSH key after provisioning.
Install the sigstore CLI with pip install sigstore. Download the Granite 4.1 checkpoint from ibm-granite/ on Hugging Face along with the accompanying .sig file. Run: sigstore verify artifact --cert-identity <IBM cert identity from model card> --cert-oidc-issuer https://token.actions.githubusercontent.com <model file>. Confirm output shows 'Verified OK' before proceeding. The exact certificate identity is documented on the HuggingFace model card for each Granite 4.1 variant.
Install a recent version of vLLM with SSM kernel support: pip install vllm (check the vLLM changelog for the minimum version required for Granite 4.1 30B hybrid). Download the checkpoint using huggingface-cli: huggingface-cli download ibm-granite/granite-4.1-8b --local-dir /models/granite-4.1-8b. For the 30B hybrid: huggingface-cli download ibm-granite/granite-4.1-30b --local-dir /models/granite-4.1-30b.
For the 8B single-GPU: vllm serve ibm-granite/granite-4.1-8b --tensor-parallel-size 1 --gpu-memory-utilization 0.90 --max-model-len 131072 --port 8000. For the 30B FP8 single-GPU: vllm serve ibm-granite/granite-4.1-30b --quantization fp8 --tensor-parallel-size 1 --no-enable-chunked-prefill --trust-remote-code --gpu-memory-utilization 0.92 --max-model-len 65536 --port 8000. For 30B BF16 on 2x H100: use --tensor-parallel-size 2 without --quantization fp8.
Serve Granite Guardian in a separate process: vllm serve ibm-granite/granite-guardian-4.1-8b --tensor-parallel-size 1 --gpu-memory-utilization 0.85 --port 8001. Route every user input through Guardian on port 8001 first. If the response indicates safe, forward the request to your main Granite 4.1 server on port 8000. Pipe the model output back through Guardian for output classification before returning to the user.
Frequently Asked Questions
Granite 4.1 8B at BF16 requires about 48 GB total VRAM including KV cache at 128K context, so a single H100 SXM5 (80 GB) handles it comfortably with headroom for batching. For budget deployments, Granite 4.1 8B at FP8 fits on a single A100 80GB. The 3B variant runs on a single A100 40GB or RTX 4090.
Yes for the dense 3B and 8B variants. For the 30B hybrid (SSM + attention interleaved), recent vLLM releases include Mamba kernel support (check the vLLM changelog for the minimum version that includes SSM kernel support for the Granite 4.1 30B hybrid). You must pass --no-enable-chunked-prefill on initial launch for the 30B hybrid, since SSM layers cannot correctly initialize recurrent state across chunk boundaries without SSM-aware chunking. Re-enable only after validating correct outputs on your workload.
Granite Guardian is a separate IBM safety classifier trained on the same data distribution as Granite 4.1 to detect harmful, risky, or policy-violating content. Deploy it as a second vLLM endpoint (port 8001) and route every request through it before forwarding to your main Granite 4.1 server. This adds 15-40ms per request at p50 with co-location on the same node.
The three practical differentiators are licensing, safety tooling, and supply chain. Granite 4.1 ships under Apache 2.0 with commercial indemnification from IBM. Llama 4 uses Meta's RAIL license, which includes certain commercial restrictions. Qwen 3 uses Apache 2.0 but without IBM's enterprise support tier. Granite ships Granite Guardian as a matching safety classifier and cryptographically signs each checkpoint so deployment pipelines can verify artifact integrity before serving.
Yes. IBM signs Granite 4.1 checkpoints using sigstore-based tooling. Download the model from the ibm-granite Hugging Face org along with its signature file, then use the sigstore CLI to verify the signature against IBM's published public key. The model card for each Granite 4.1 variant documents the exact verification commands.
