The specific attack that blocks GPU cloud adoption in healthcare, finance, and defense is not DDoS or credential theft. It is this: a privileged process on the cloud host, whether a hypervisor, a management daemon, or a rogue admin, can read your model weights and inference inputs directly out of GPU VRAM while the GPU is computing. No TLS, no disk encryption, no network firewall stops this. The data is plaintext in VRAM. NVIDIA Confidential Computing mode closes this gap at the hardware level. For the regulatory context driving these requirements, see our EU AI Act compliance guide for GPU cloud deployments. This guide covers the hardware support matrix, how CC mode actually works, a working deployment walkthrough on H100 and B200, KMS integration, and an honest look at performance trade-offs.
One delivery note before the technical sections: Spheron does not expose CC mode on the on-demand GPU marketplace. CC mode is available only through reserved commitments, where Spheron works with the data center partner to enable CC at the BIOS/VBIOS level on the target host before the instance is handed over. The rest of this guide assumes that reserved delivery path.
The Threat Model
Standard GPU cloud security covers three things: TLS for data in transit, disk encryption for data at rest, and network isolation via VPC and security groups. All three are necessary. None of them protect data while the GPU is computing.
VRAM is plaintext during computation on non-CC GPUs. The hypervisor has direct memory access to the physical GPU. A compromised management layer, a live migration operation, or a malicious insider with hypervisor access can attach to the GPU's address space and read whatever is in HBM. This is not theoretical. GPU memory scanning tools are publicly available and trivial to use from a privileged host context.
Cold-boot attacks on HBM. High-bandwidth memory retains contents briefly after power loss. Physical access to a GPU node before memory fully discharges enables extraction of model weights and recent inference inputs. This is a more constrained attack vector but relevant for colocation environments and high-security data center scenarios.
PCIe bus interception. Data moves between CPU and GPU over PCIe. Without PCIe bus encryption, a hardware DMA analyzer connected to the PCIe slot reads all traffic in the clear. This requires physical access but is the kind of threat model that defense and regulated finance teams actually think about.
Multi-tenant hypervisor boundaries. On shared-node GPU instances, VRAM isolation between tenant VMs depends entirely on the hypervisor. A hypervisor vulnerability or a misconfigured vGPU partition breaks that isolation. The boundary is software, not hardware.
What CC mode provides: hardware-enforced VRAM encryption using a key generated inside the GPU security processor and never exposed to host software, PCIe bus encryption so traffic between CPU and GPU is ciphertext, and a cryptographic attestation mechanism that lets you verify the GPU's firmware and security state before trusting it with sensitive data.
The threat model changes from "trust the cloud provider's operational controls" to "trust the hardware cryptography, verify via attestation."
Hardware Support Matrix
A CC-capable GPU needs three things that older architectures lack: a dedicated security processor on the GPU die, hardware encryption circuits integrated into the HBM memory subsystem, and encrypted PCIe support on the host CPU side. NVIDIA introduced this with Hopper. Ampere (A100) and older do not have the on-die security processor and cannot run CC mode.
| GPU | Architecture | CC Mode | VRAM Encryption | PCIe Bus Encryption | NVLink Encryption | Attestation Method | VRAM Capacity |
|---|---|---|---|---|---|---|---|
| H100 SXM5 | Hopper | Yes | AES-256-GCM | Yes (SEV-SNP/TDX) | Not available | NRAS / local SDK | 80 GB HBM3 |
| H100 PCIe | Hopper | Yes | AES-256-GCM | Yes (SEV-SNP/TDX) | N/A | NRAS / local SDK | 80 GB HBM3 |
| H200 SXM5 | Hopper | Yes | AES-256-GCM | Yes (SEV-SNP/TDX) | Not available | NRAS / local SDK | 141 GB HBM3e |
| B200 SXM6 | Blackwell | Yes | AES-256-GCM | Yes (PCIe 5.0 encrypted) | Yes | NRAS / local SDK | 192 GB HBM3e |
| GB200 | Grace-Hopper | Yes | AES-256-GCM | Yes (unified memory) | Yes | NRAS + CPU TEE | 96 GB per GPU |
| MI300X | CDNA 3 | AMD SEV-SNP (host-level) | Host VM envelope | Host-level only | N/A | AMD SEV attestation | 192 GB HBM3 |
The MI300X row requires a precise note: AMD's approach integrates the GPU under the host VM's SEV-SNP envelope rather than providing a dedicated GPU-level TEE. The GPU is inside the trusted execution boundary, but attestation covers the entire VM, not specifically the GPU's firmware and VRAM encryption state. For regulated buyers who need GPU-level attestation specifically, this distinction matters: NVIDIA CC mode produces a GPU-specific attestation report signed by the GPU's security processor, not a VM-level attestation.
Spheron H200 SXM5 hardware supports the full Hopper CC mode stack with 141 GB HBM3e, making it the highest-VRAM option for confidential inference. Blackwell B200 hosts add NVLink encryption for multi-GPU confidential workloads. CC mode on Spheron is delivered through reserved commitments only: the on-demand marketplace ships standard (non-CC) instances, and CC mode is enabled at the BIOS/VBIOS level by the data center partner as part of the reservation setup. If you just need standard H200 or B200 capacity, see H200 on Spheron and bare-metal B200 instances.
How NVIDIA Confidential Computing Mode Works
Encrypted VRAM and the VRAM Security Boundary
The H100 and newer GPUs include a dedicated Confidential Computing Engine (CCE) integrated on the GPU die. When CC mode is active, every write to HBM is encrypted using AES-256-GCM before the data leaves the CCE. The encryption key is generated inside the GPU security processor during device initialization and never leaves the chip. Host software, including the hypervisor, the cloud provider's management plane, and the CUDA driver running on the host, cannot access the key or read plaintext VRAM.
The AES-256-GCM cipher is hardware-accelerated in dedicated circuits alongside the HBM controller. It does not run through the CUDA compute pipeline (tensor cores, SMs, etc.), which means the encryption/decryption overhead is decoupled from compute throughput. NVIDIA's benchmarks put this overhead at under 3% for large matrix operations typical of transformer inference.
From the workload's perspective, nothing changes. The CUDA programming model is identical. The GPU driver inside the trusted execution environment handles encryption transparently. The model weights are decrypted inside the CCE before reaching the compute units, and activations are encrypted when written back to VRAM.
PCIe Bus and NVLink Encryption
When the CPU sends data to the GPU over PCIe (model weights during load, KV cache updates, etc.), that data needs to be encrypted on the PCIe bus to prevent the cold-boot and DMA analyzer attacks described above.
On H100 and H200, PCIe bus encryption is handled by integrating the GPU into the CPU's trusted execution environment. On AMD EPYC servers, this uses SEV-SNP. On Intel Xeon servers, it uses TDX. The CPU's memory encryption engine encrypts PCIe traffic, so data on the PCIe bus is ciphertext. An analyzer on the PCIe slot sees only encrypted payloads.
B200 and GB200 add NVLink encryption in addition to PCIe encryption. In SXM configurations where multiple GPUs communicate over NVLink (model parallelism, all-reduce operations), NVLink traffic between GPUs is also encrypted. This closes the last plaintext gap in multi-GPU confidential workloads.
Remote Attestation Flow
Encrypted VRAM and PCIe encryption protect the data, but a critical question remains: how does the workload operator verify that CC mode is actually active and the firmware has not been tampered with? This is what remote attestation answers.
The attestation flow:
- The workload operator (or automated deployment pipeline) calls the NVIDIA Remote Attestation Service (NRAS) via the
nv-attestation-sdklibrary. - The GPU security processor generates a signed attestation report. This report includes a hash of the current GPU firmware, the CC mode status flag, the GPU's hardware identity (a device certificate signed by NVIDIA's CA), and a nonce from the verifier to prevent replay attacks.
- The signed report is submitted to the NVIDIA NRAS endpoint (
https://nras.attestation.nvidia.com/v4/attest/gpu) for verification. NRAS checks the device certificate against NVIDIA's root CA and validates the firmware hash against NVIDIA's approved firmware list. - NRAS returns a signed JWT (attestation token). A valid token confirms: CC mode is active, the firmware is an unmodified NVIDIA release, and the hardware is a genuine NVIDIA GPU.
- The operator stores the attestation token as audit evidence, or passes it to a KMS as proof-of-environment for key release.
Generating an attestation report locally:
pip install nv-attestation-sdk
python3 << 'EOF'
from nv_attestation_sdk import attestation
import json
client = attestation.Attestation()
client.set_service_key("YOUR_NVIDIA_API_KEY") # Get from NVIDIA API Catalog
NRAS_URL = "https://nras.attestation.nvidia.com/v4/attest/gpu"
client.add_verifier(attestation.Devices.GPU, attestation.Environment.REMOTE, NRAS_URL, "")
evidence_list = client.get_evidence()
result = client.attest(evidence_list)
if not result:
raise RuntimeError("Attestation failed, CC mode may not be active")
token = client.get_token()
print(json.dumps(token if isinstance(token, dict) else {"REMOTE_GPU_CLAIMS": token}, indent=2))
EOFThe output is a JSON structure with the signed attestation JWT under the REMOTE_GPU_CLAIMS key, containing firmware measurements and CC mode flags.
Deploy a Confidential Inference Stack on Spheron
Step 1: Request a CC-Enabled Reserved Instance and Verify CC Mode
CC mode is not available on Spheron's on-demand GPU marketplace. The on-demand catalog ships standard H100, H200, and B200 instances; CC mode is not toggled on those hosts. To run confidential inference on Spheron, you need a reserved commitment: Spheron coordinates with the data center partner to enable CC mode at the BIOS/VBIOS level on the target host, flash the matching CC-capable VBIOS where needed, and hand over the instance with CC mode active. H100 SXM5 is the practical starting point for most regulated inference workloads; H200 and B200 are available on the same reserved commitment path. For standard (non-CC) capacity, see H100, H200, and B200.
After the reserved instance is delivered, SSH in and verify:
# Check CC mode status
nvidia-smi -q | grep -i "confidential"
# Expected output on CC-enabled instance:
# Confidential Compute : Protected
# Verify driver version supports CC mode (check NVIDIA Secure AI Compatibility Matrix for exact requirements)
nvidia-smi --query-gpu=driver_version --format=csv,noheaderIf nvidia-smi -q | grep -i "confidential" returns nothing, the host has not been switched into CC mode. On Spheron this means the BIOS/VBIOS toggle was not applied by the data center partner; reach out so the host can be re-prepared before you start the attestation flow.
Step 2: Install and Run the NVIDIA Attestation SDK
set -e
pip install nv-attestation-sdk
# Generate attestation token via NRAS
python3 << 'EOF' > attestation_report.json
from nv_attestation_sdk import attestation
import json
client = attestation.Attestation()
client.set_service_key("YOUR_NVIDIA_API_KEY")
NRAS_URL = "https://nras.attestation.nvidia.com/v4/attest/gpu"
client.add_verifier(attestation.Devices.GPU, attestation.Environment.REMOTE, NRAS_URL, "")
evidence_list = client.get_evidence()
result = client.attest(evidence_list)
if not result:
raise RuntimeError("Attestation failed, CC mode may not be active")
token = client.get_token()
print(json.dumps(token if isinstance(token, dict) else {"REMOTE_GPU_CLAIMS": token}, indent=2))
EOF
cat attestation_report.json | python3 -c "
import json, sys, base64
r = json.load(sys.stdin)
jwt_str = r.get('REMOTE_GPU_CLAIMS', '')
if jwt_str:
parts = jwt_str.split('.')
if len(parts) < 3:
raise ValueError(f'Not a valid JWT: {jwt_str[:60]}')
padding = '=' * (-len(parts[1]) % 4)
claims = json.loads(base64.urlsafe_b64decode(parts[1] + padding))
print('CC Mode enabled:', claims.get('x-nvidia-cc-mode-enabled'))
print('Overall result:', claims.get('x-nvidia-overall-att-result'))
"Step 3: Verify the Attestation Token
# Cryptographically verify the JWT signature against NRAS public keys
# Requires: pip install pyjwt[crypto]
python3 << 'EOF'
import json, jwt
with open("attestation_report.json") as f:
report = json.load(f)
jwt_str = report.get("REMOTE_GPU_CLAIMS", "")
if not jwt_str:
raise RuntimeError("No attestation token found in report")
# Fetch NRAS JWKS and verify the ECDSA signature
jwks_client = jwt.PyJWKClient("https://nras.attestation.nvidia.com/.well-known/jwks.json")
signing_key = jwks_client.get_signing_key_from_jwt(jwt_str)
claims = jwt.decode(
jwt_str,
signing_key.key,
algorithms=["ES384"],
options={"verify_aud": False},
issuer="https://nras.attestation.nvidia.com"
)
print(json.dumps({
"cc_mode_enabled": claims.get("x-nvidia-cc-mode-enabled"),
"driver_rim_validated": claims.get("x-nvidia-gpu-driver-rim-schema-validated"),
"vbios_rim_validated": claims.get("x-nvidia-gpu-vbios-rim-schema-validated"),
"overall_result": claims.get("x-nvidia-overall-att-result"),
"issuer": claims.get("iss")
}, indent=2))
EOFA valid token shows "cc_mode_enabled": true and "overall_result": "SUCCESS". The token is signed by NRAS using ECDSA P-384 keys. Store the token in your audit log before proceeding.
Step 4: Deploy vLLM Inside the Confidential Environment
For full vLLM configuration and throughput tuning, see our vLLM production deployment guide. The CC mode-specific change is the --enforce-eager flag:
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--dtype fp8 \
--enforce-eager \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 128The --enforce-eager flag disables CUDA graph captures. Some CC mode firmware versions are incompatible with CUDA graph pre-capture because the graph capture process requires pinning kernel launch parameters before runtime, which conflicts with the CC mode memory encryption initialization sequence. In vLLM v0.17.0+, CUDA graph compatibility with CC mode improved significantly. Check your firmware version against the vLLM changelog before removing --enforce-eager.
Step 5: Verify the Service
# Test from inside the instance
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"prompt": "System check: respond with OK",
"max_tokens": 5
}'
# Check vLLM metrics endpoint to confirm GPU is active
curl http://localhost:8000/metrics | grep vllm_gpu_cache_usagePerformance Overhead: CC Mode vs Standard
For LLM inference workloads, CC mode overhead is typically under 5%. The hardware encryption circuits are parallel to the compute pipeline, not in it.
| GPU | Mode | Model | Throughput (tokens/sec) | TTFT (ms) | Overhead |
|---|---|---|---|---|---|
| H100 SXM5 | Standard | Llama 3 70B FP8 | ~3,200 | ~85 | baseline |
| H100 SXM5 | CC mode | Llama 3 70B FP8 | ~3,100 | ~87 | ~3% |
| B200 SXM6 | Standard | DeepSeek V3 FP4 | ~8,500 | ~45 | baseline |
| B200 SXM6 | CC mode | DeepSeek V3 FP4 | ~8,200 | ~47 | ~3.5% |
Illustrative figures based on NVIDIA's published 2-5% CC mode overhead range and typical Llama 3 70B / DeepSeek V3 throughput on H100/B200. Not independently benchmarked by Spheron. See NVIDIA's CC mode performance characterization for your specific configuration.
Note: These figures are consistent with NVIDIA's published CC mode benchmark characterization of 2-5% throughput overhead for transformer inference workloads. Specific numbers depend on model size, batch configuration, and firmware version. Verify against NVIDIA's H100 security documentation and Blackwell CC mode release notes for the exact overhead on your configuration.
Attestation overhead is one-time at instance startup, not per request. A typical attestation flow (SDK call + OCSP verification) takes 1-3 seconds during provisioning. This is irrelevant for steady-state inference latency.
The exception worth disclosing: the --enforce-eager flag in vLLM disables CUDA graph captures. CUDA graphs pre-compile kernel launch sequences and avoid per-iteration Python overhead. Disabling them adds CPU-side overhead on the critical path, typically 10-15% throughput reduction on top of the base CC mode overhead. If your CC mode firmware version is compatible with CUDA graphs (H100 firmware 550+, vLLM v0.17.0+), remove --enforce-eager and re-benchmark. This is the gotcha that practitioners need to know before projecting regulated inference capacity.
KMS Integration: Customer-Managed Keys
AWS KMS with NVIDIA Attestation
Important: AWS KMS's kms:RecipientAttestation condition key is native to AWS Nitro Enclaves (which use CBOR-COSE attestation documents) and will not parse NVIDIA GPU attestation JWTs. NVIDIA CC attestation requires an intermediate attestation broker: a Lambda function or API Gateway authorizer that validates the NVIDIA JWT against the NRAS JWKS endpoint, then calls KMS or STS only after verification passes.
The broker validates the attestation JWT before calling KMS:
import jwt, requests, json
def validate_nvidia_cc_token(attest_token: str) -> bool:
"""Validate NVIDIA attestation JWT against NRAS public keys."""
try:
jwks = requests.get(
"https://nras.attestation.nvidia.com/.well-known/jwks.json",
timeout=5
).json()
except (requests.exceptions.RequestException, ValueError):
return False
if not isinstance(jwks, dict):
return False
for key in jwks.get("keys", []):
try:
pub_key = jwt.algorithms.ECAlgorithm.from_jwk(json.dumps(key))
claims = jwt.decode(attest_token, pub_key, algorithms=["ES384"], options={"verify_aud": False}, issuer="https://nras.attestation.nvidia.com")
return (claims.get("x-nvidia-cc-mode-enabled") is True and
claims.get("x-nvidia-overall-att-result") == "SUCCESS")
except (jwt.exceptions.InvalidSignatureError, jwt.exceptions.InvalidKeyError):
continue
except jwt.exceptions.PyJWTError:
return False
return False
# After validation, the broker calls kms:Decrypt under InferenceServiceRoleA KMS key policy granting decrypt to the verified inference role:
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "AllowDecryptViaAttestationBroker",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::ACCOUNT_ID:role/InferenceServiceRole"},
"Action": "kms:Decrypt",
"Resource": "*"
}]
}Apply with:
aws kms put-key-policy \
--key-id alias/inference-model-key \
--policy-name default \
--policy file://kms-policy.jsonHashiCorp Vault with PKCS#11 Interface
HashiCorp Vault's JWT auth method validates the NVIDIA attestation JWT as a bearer token. The bound_claims configuration checks CC mode status before issuing a Vault token with key access policies.
Configure Vault:
# Enable JWT auth method
vault auth enable jwt
# Configure the JWT auth to accept NVIDIA attestation tokens
vault write auth/jwt/config \
jwks_url="https://nras.attestation.nvidia.com/.well-known/jwks.json" \
bound_issuer="https://nras.attestation.nvidia.com"
# Create a role that requires CC mode active
vault write auth/jwt/role/inference-cc \
role_type="jwt" \
bound_claims='{"x-nvidia-cc-mode-enabled": true, "x-nvidia-overall-att-result": "SUCCESS"}' \
token_policies="kv-inference-read" \
token_ttl="1h"Issue tokens during inference startup:
# Get attestation JWT from the report saved in Step 2
ATTEST_TOKEN=$(python3 -c "
import json
token = json.load(open('attestation_report.json'))
print(token.get('REMOTE_GPU_CLAIMS', ''))
")
# Authenticate with Vault using the NVIDIA attestation token
VAULT_TOKEN=$(vault write -field=token auth/jwt/login \
role=inference-cc \
jwt="$ATTEST_TOKEN")
# Read the model decryption key
vault kv get -field=model_key secret/inference/model-keyRegulated Use Cases
HIPAA Inference: Protecting PHI in GPU VRAM
HIPAA's Technical Safeguards (45 CFR § 164.312) require covered entities and business associates to protect electronic protected health information (ePHI) through access controls, audit logging, integrity controls, and transmission security. The standard GPU cloud deployment creates a specific gap in the technical safeguards: PHI passed through a model during inference is plaintext in VRAM, accessible to anyone with hypervisor-level access to the host.
CC mode closes this specific gap. PHI is encrypted in VRAM during computation; the hypervisor cannot read it. Combined with TLS for inference API traffic and encrypted model storage at rest, CC mode completes the encryption chain: data is encrypted everywhere except inside the GPU's trusted compute boundary.
CC mode alone is not a HIPAA certification. A BAA with the cloud provider remains required. Audit logging of PHI access, role-based access controls on the inference API, and proper data retention policies are all still necessary. The broader compliance stack, including risk classification and governance requirements, is covered in our EU AI Act compliance guide, which applies to HIPAA in the technical safeguards section.
PCI-DSS Model Serving: Cardholder Data in AI Pipelines
LLM inference is appearing in payment processing pipelines: fraud detection models that receive PANs and transaction amounts, customer service bots in cardholder contact centers, and document analysis pipelines that handle account statements. Each of these can result in PANs or CVVs passing through GPU VRAM during inference.
PCI-DSS v4.0 Requirement 3 covers protection of stored cardholder data and Requirement 4 covers protection of cardholder data in transit. CC mode addresses the gap between these two: data during computation, not stored and not in transit, but also not protected by either standard control.
For payment AI pipelines, CC mode means PAN and transaction data passed to the model is never plaintext outside the GPU's trusted compute boundary. The QSA scoping question for cloud AI is still evolving as PCI-DSS v4.0 assessors work through AI pipeline use cases. Talk to your QSA before assuming CC mode satisfies a specific requirement. The technology is in place; the assessment criteria are catching up.
Defense ITAR Workloads
ITAR-controlled technical data processed by AI models (design specifications, export-controlled research, military hardware documentation) requires hardware-level isolation guarantees. The standard shared-cloud multi-tenant model does not provide these guarantees: the isolation depends on software controls that a determined adversary with access to cloud provider systems could circumvent.
CC mode provides hardware-enforced boundaries: the GPU's security processor controls access to VRAM, and the attestation chain goes back to NVIDIA's hardware root of trust. This is a hardware assurance, not a software promise.
ITAR compliance involves far more than GPU security. Facility accreditation, personnel controls, export authorization, and customer agreements with ITAR-compliant provisions are all required. CC mode addresses one specific technical control in a larger compliance program. ITAR workloads running on GPU cloud infrastructure require legal review of the specific data flows, not just technical controls.
Cost Comparison: Confidential GPU Cloud vs On-Prem HSM-Backed Inference
| Approach | GPU Cost (per GPU/hr) | HSM/Security Hardware Cost | Attestation Overhead | Operational Complexity | Typical Regulated Use Case |
|---|---|---|---|---|---|
| On-prem H100 + HSM (Luna Network) | ~$0 after capex | $40K-$100K upfront | In-house managed | Very high (rack, network, power, HSM mgmt) | Large enterprise with dedicated infra budget |
| On-prem H100 (software-only) | ~$0 after capex | $0 | None | High | Startups avoiding compliance until scale |
| Azure Confidential VM (H100 NC v5) | ~$5.00-$8.00 (Q1 2026 est.; check Azure pricing) | Included (Azure Key Vault) | Azure-managed | Medium (Azure ecosystem required) | Azure-native enterprise teams |
| GCP Confidential H100 VM | Limited availability | Included (Cloud KMS) | GCP-managed | Medium (GCP ecosystem required) | GCP-native enterprise teams |
| Spheron H100 SXM5 CC mode (reserved) | Custom reserved pricing | Bring your own KMS | Self-managed via NVIDIA SDK | Low-medium | Flexible regulated inference |
| Spheron B200 SXM6 CC mode (reserved) | Custom reserved pricing | Bring your own KMS | Self-managed via NVIDIA SDK | Low-medium | High-throughput Blackwell CC |
CC mode on Spheron is a reserved commitment, not an on-demand SKU: the data center partner enables CC at the BIOS/VBIOS level for the reserved host, so pricing is quoted per engagement based on term length, GPU count, and region. For reference, standard (non-CC) H100 and B200 on-demand rates are published on the pricing page.
Spheron's advantage over hyperscalers is not primarily price. The more significant difference is KMS flexibility: Azure Confidential Computing is deeply integrated with Azure Key Vault, which is convenient if you are already in the Azure ecosystem but creates vendor lock-in on your attestation and key management chain. HashiCorp Vault or AWS KMS integration works on Spheron without any provider-specific attestation intermediary.
Spheron vs Hyperscaler Confidential AI
| Feature | Spheron CC GPU | Azure Confidential Computing (H100) | GCP Confidential H100 VM |
|---|---|---|---|
| CC mode support | H100 SXM5, H200, B200 (reserved commitment only) | H100 (NC H100 v5 series) | H100 (limited preview) |
| Availability model | Reserved commitment with partner data center | On-demand and reserved | On-demand and reserved |
| KMS flexibility | Any KMS (AWS KMS, Vault, etc.) | Azure Key Vault preferred | Cloud KMS preferred |
| Attestation transparency | Direct NRAS + SDK access | Microsoft-mediated attestation | Google-mediated attestation |
| Pricing model | Custom reserved quote per engagement | Reserved instance commitments typical | Reserved instance commitments typical |
| Region selection | Partner data center network globally | 12+ Azure regions | Selected GCP regions |
| Vendor lock-in | Low | Medium-High (Azure AD, AKV, EAP) | Medium (GCP IAM, Cloud KMS) |
| Bare metal option | Yes | No (VM layer required) | No (VM layer required) |
For regulated buyers who need hardware-level security without being anchored to one cloud's IAM and KMS chain, Spheron's reserved CC GPU instances make sense. You manage attestation verification via the NVIDIA SDK directly, pipe tokens to whatever KMS you already use, and keep your compliance stack portable. If a regulator asks for your attestation logs, they come from NVIDIA's OCSP service, not filtered through a hyperscaler's management plane. The trade-off is the delivery model: CC mode requires a reserved commitment on Spheron so the data center partner can flip the BIOS/VBIOS CC toggle on the host before handover. If you need confidential GPU capacity for a short burst rather than a steady workload, a hyperscaler's on-demand confidential VM will be faster to spin up.
When Azure Confidential is the right call: teams already deep in the Azure ecosystem with Azure AD as their identity provider, Azure Key Vault as their KMS, and an existing EA agreement that narrows the pricing gap. Building a second attestation chain and a second KMS integration purely to avoid Azure lock-in has its own operational cost. If your security and compliance team has already built Azure-native controls, don't rebuild them elsewhere for a marginal cost difference.
The honest framing: Azure Confidential Computing is mature, well-documented, on-demand, and integrated with Microsoft's compliance certifications. Spheron gives you more control and KMS flexibility for steady-state regulated inference, but CC capacity has to be reserved, and you own the integration work. For teams that care about KMS portability and attestation transparency on a long-running workload, that trade-off is worth it. For teams that want a fully managed, on-demand confidential AI stack within one vendor's ecosystem, Azure is reasonable.
Confidential GPU computing bridges the gap between cloud cost efficiency and regulated workload requirements. Spheron offers CC mode on H100, H200, and B200 through reserved commitments, where we coordinate with the data center partner to enable CC at the BIOS/VBIOS level before the host is delivered. You get encrypted VRAM and remote attestation on hardware you control, without hyperscaler KMS lock-in.
Reach out for a reserved CC quote, or browse standard on-demand capacity: H100 | H200 | B200 | All GPU pricing
