Confidential GPU Computing on Cloud: Deploy LLMs with NVIDIA TEE and Encrypted VRAM for Regulated Workloads (2026 Guide)

The specific attack that blocks GPU cloud adoption in healthcare, finance, and defense is not DDoS or credential theft. It is this: a privileged process on the cloud host, whether a hypervisor, a management daemon, or a rogue admin, can read your model weights and inference inputs directly out of GPU VRAM while the GPU is computing. No TLS, no disk encryption, no network firewall stops this. The data is plaintext in VRAM. NVIDIA Confidential Computing mode closes this gap at the hardware level. For the regulatory context driving these requirements, see our EU AI Act compliance guide for GPU cloud deployments. This guide covers the hardware support matrix, how CC mode actually works, a working deployment walkthrough on H100 and B200, KMS integration, and an honest look at performance trade-offs.

One delivery note before the technical sections: Spheron does not expose CC mode on the on-demand GPU marketplace. CC mode is available only through reserved commitments, where Spheron works with the data center partner to enable CC at the BIOS/VBIOS level on the target host before the instance is handed over. The rest of this guide assumes that reserved delivery path.

Confidential computing protects model weights and inference data at rest and in transit. For the active security testing layer - systematically probing the model for jailbreaks and misuse - see the PyRIT, Garak, and Inspect AI deployment guide.

The Threat Model

Standard GPU cloud security covers three things: TLS for data in transit, disk encryption for data at rest, and network isolation via VPC and security groups. All three are necessary. None of them protect data while the GPU is computing.

VRAM is plaintext during computation on non-CC GPUs. The hypervisor has direct memory access to the physical GPU. A compromised management layer, a live migration operation, or a malicious insider with hypervisor access can attach to the GPU's address space and read whatever is in HBM. This is not theoretical. GPU memory scanning tools are publicly available and trivial to use from a privileged host context.

Cold-boot attacks on HBM. High-bandwidth memory retains contents briefly after power loss. Physical access to a GPU node before memory fully discharges enables extraction of model weights and recent inference inputs. This is a more constrained attack vector but relevant for colocation environments and high-security data center scenarios.

PCIe bus interception. Data moves between CPU and GPU over PCIe. Without PCIe bus encryption, a hardware DMA analyzer connected to the PCIe slot reads all traffic in the clear. This requires physical access but is the kind of threat model that defense and regulated finance teams actually think about.

Multi-tenant hypervisor boundaries. On shared-node GPU instances, VRAM isolation between tenant VMs depends entirely on the hypervisor. A hypervisor vulnerability or a misconfigured vGPU partition breaks that isolation. The boundary is software, not hardware.

What CC mode provides: hardware-enforced VRAM encryption using a key generated inside the GPU security processor and never exposed to host software, PCIe bus encryption so traffic between CPU and GPU is ciphertext, and a cryptographic attestation mechanism that lets you verify the GPU's firmware and security state before trusting it with sensitive data.

The threat model changes from "trust the cloud provider's operational controls" to "trust the hardware cryptography, verify via attestation."

For runtime policy enforcement layered on top of hardware-level VRAM encryption, the NeMo Guardrails production guide covers co-hosting safety classifiers on the same bare-metal node as your main inference GPU.

Hardware Support Matrix

A CC-capable GPU needs three things that older architectures lack: a dedicated security processor on the GPU die, hardware encryption circuits integrated into the HBM memory subsystem, and encrypted PCIe support on the host CPU side. NVIDIA introduced this with Hopper. Ampere (A100) and older do not have the on-die security processor and cannot run CC mode.

GPU	Architecture	CC Mode	VRAM Encryption	PCIe Bus Encryption	NVLink Encryption	Attestation Method	VRAM Capacity
H100 SXM5	Hopper	Yes	AES-256-GCM	Yes (SEV-SNP/TDX)	Not available	NRAS / local SDK	80 GB HBM3
H100 PCIe	Hopper	Yes	AES-256-GCM	Yes (SEV-SNP/TDX)	N/A	NRAS / local SDK	80 GB HBM3
H200 SXM5	Hopper	Yes	AES-256-GCM	Yes (SEV-SNP/TDX)	Not available	NRAS / local SDK	141 GB HBM3e
B200 SXM6	Blackwell	Yes	AES-256-GCM	Yes (PCIe 5.0 encrypted)	Yes	NRAS / local SDK	192 GB HBM3e
GB200	Grace-Hopper	Yes	AES-256-GCM	Yes (unified memory)	Yes	NRAS + CPU TEE	96 GB per GPU
MI300X	CDNA 3	AMD SEV-SNP (host-level)	Host VM envelope	Host-level only	N/A	AMD SEV attestation	192 GB HBM3

The MI300X row requires a precise note: AMD's approach integrates the GPU under the host VM's SEV-SNP envelope rather than providing a dedicated GPU-level TEE. The GPU is inside the trusted execution boundary, but attestation covers the entire VM, not specifically the GPU's firmware and VRAM encryption state. For regulated buyers who need GPU-level attestation specifically, this distinction matters: NVIDIA CC mode produces a GPU-specific attestation report signed by the GPU's security processor, not a VM-level attestation.

Spheron H200 SXM5 hardware supports the full Hopper CC mode stack with 141 GB HBM3e, making it the highest-VRAM option for confidential inference. Blackwell B200 hosts add NVLink encryption for multi-GPU confidential workloads. CC mode on Spheron is delivered through reserved commitments only: the on-demand marketplace ships standard (non-CC) instances, and CC mode is enabled at the BIOS/VBIOS level by the data center partner as part of the reservation setup. If you just need standard H200 or B200 capacity, see H200 on Spheron and bare-metal B200 instances.

How NVIDIA Confidential Computing Mode Works

Encrypted VRAM and the VRAM Security Boundary

The H100 and newer GPUs include a dedicated Confidential Computing Engine (CCE) integrated on the GPU die. When CC mode is active, every write to HBM is encrypted using AES-256-GCM before the data leaves the CCE. The encryption key is generated inside the GPU security processor during device initialization and never leaves the chip. Host software, including the hypervisor, the cloud provider's management plane, and the CUDA driver running on the host, cannot access the key or read plaintext VRAM.

The AES-256-GCM cipher is hardware-accelerated in dedicated circuits alongside the HBM controller. It does not run through the CUDA compute pipeline (tensor cores, SMs, etc.), which means the encryption/decryption overhead is decoupled from compute throughput. NVIDIA's benchmarks put this overhead at under 3% for large matrix operations typical of transformer inference.

From the workload's perspective, nothing changes. The CUDA programming model is identical. The GPU driver inside the trusted execution environment handles encryption transparently. The model weights are decrypted inside the CCE before reaching the compute units, and activations are encrypted when written back to VRAM.

PCIe Bus and NVLink Encryption

When the CPU sends data to the GPU over PCIe (model weights during load, KV cache updates, etc.), that data needs to be encrypted on the PCIe bus to prevent the cold-boot and DMA analyzer attacks described above.

On H100 and H200, PCIe bus encryption is handled by integrating the GPU into the CPU's trusted execution environment. On AMD EPYC servers, this uses SEV-SNP. On Intel Xeon servers, it uses TDX. The CPU's memory encryption engine encrypts PCIe traffic, so data on the PCIe bus is ciphertext. An analyzer on the PCIe slot sees only encrypted payloads.

B200 and GB200 add NVLink encryption in addition to PCIe encryption. In SXM configurations where multiple GPUs communicate over NVLink (model parallelism, all-reduce operations), NVLink traffic between GPUs is also encrypted. This closes the last plaintext gap in multi-GPU confidential workloads.

Remote Attestation Flow

Encrypted VRAM and PCIe encryption protect the data, but a critical question remains: how does the workload operator verify that CC mode is actually active and the firmware has not been tampered with? This is what remote attestation answers.

The attestation flow:

The workload operator (or automated deployment pipeline) calls the NVIDIA Remote Attestation Service (NRAS) via the nv-attestation-sdk library.
The GPU security processor generates a signed attestation report. This report includes a hash of the current GPU firmware, the CC mode status flag, the GPU's hardware identity (a device certificate signed by NVIDIA's CA), and a nonce from the verifier to prevent replay attacks.
The signed report is submitted to the NVIDIA NRAS endpoint (https://nras.attestation.nvidia.com/v4/attest/gpu) for verification. NRAS checks the device certificate against NVIDIA's root CA and validates the firmware hash against NVIDIA's approved firmware list.
NRAS returns a signed JWT (attestation token). A valid token confirms: CC mode is active, the firmware is an unmodified NVIDIA release, and the hardware is a genuine NVIDIA GPU.
The operator stores the attestation token as audit evidence, or passes it to a KMS as proof-of-environment for key release.

Generating an attestation report locally:

bash

pip install nv-attestation-sdk

python3 << 'EOF'
from nv_attestation_sdk import attestation
import json

client = attestation.Attestation()
client.set_service_key("YOUR_NVIDIA_API_KEY")  # Get from NVIDIA API Catalog
NRAS_URL = "https://nras.attestation.nvidia.com/v4/attest/gpu"
client.add_verifier(attestation.Devices.GPU, attestation.Environment.REMOTE, NRAS_URL, "")

evidence_list = client.get_evidence()
result = client.attest(evidence_list)
if not result:
    raise RuntimeError("Attestation failed, CC mode may not be active")
token = client.get_token()
print(json.dumps(token if isinstance(token, dict) else {"REMOTE_GPU_CLAIMS": token}, indent=2))
EOF

The output is a JSON structure with the signed attestation JWT under the REMOTE_GPU_CLAIMS key, containing firmware measurements and CC mode flags.

Deploy a Confidential Inference Stack on Spheron

Step 1: Request a CC-Enabled Reserved Instance and Verify CC Mode

CC mode is not available on Spheron's on-demand GPU marketplace. The on-demand catalog ships standard H100, H200, and B200 instances; CC mode is not toggled on those hosts. To run confidential inference on Spheron, you need a reserved commitment: Spheron coordinates with the data center partner to enable CC mode at the BIOS/VBIOS level on the target host, flash the matching CC-capable VBIOS where needed, and hand over the instance with CC mode active. H100 SXM5 is the practical starting point for most regulated inference workloads; H200 and B200 are available on the same reserved commitment path. For standard (non-CC) capacity, see H100, H200, and B200.

After the reserved instance is delivered, SSH in and verify:

bash

# Check CC mode status
nvidia-smi -q | grep -i "confidential"

# Expected output on CC-enabled instance:
# Confidential Compute       : Protected

# Verify driver version supports CC mode (check NVIDIA Secure AI Compatibility Matrix for exact requirements)
nvidia-smi --query-gpu=driver_version --format=csv,noheader

If nvidia-smi -q | grep -i "confidential" returns nothing, the host has not been switched into CC mode. On Spheron this means the BIOS/VBIOS toggle was not applied by the data center partner; reach out so the host can be re-prepared before you start the attestation flow.

Step 2: Install and Run the NVIDIA Attestation SDK

bash

set -e

pip install nv-attestation-sdk

# Generate attestation token via NRAS
python3 << 'EOF' > attestation_report.json
from nv_attestation_sdk import attestation
import json

client = attestation.Attestation()
client.set_service_key("YOUR_NVIDIA_API_KEY")
NRAS_URL = "https://nras.attestation.nvidia.com/v4/attest/gpu"
client.add_verifier(attestation.Devices.GPU, attestation.Environment.REMOTE, NRAS_URL, "")

evidence_list = client.get_evidence()
result = client.attest(evidence_list)
if not result:
    raise RuntimeError("Attestation failed, CC mode may not be active")
token = client.get_token()
print(json.dumps(token if isinstance(token, dict) else {"REMOTE_GPU_CLAIMS": token}, indent=2))
EOF

cat attestation_report.json | python3 -c "
import json, sys, base64
r = json.load(sys.stdin)
jwt_str = r.get('REMOTE_GPU_CLAIMS', '')
if jwt_str:
    parts = jwt_str.split('.')
    if len(parts) < 3:
        raise ValueError(f'Not a valid JWT: {jwt_str[:60]}')
    padding = '=' * (-len(parts[1]) % 4)
    claims = json.loads(base64.urlsafe_b64decode(parts[1] + padding))
    print('CC Mode enabled:', claims.get('x-nvidia-cc-mode-enabled'))
    print('Overall result:', claims.get('x-nvidia-overall-att-result'))
"

Step 3: Verify the Attestation Token

bash

# Cryptographically verify the JWT signature against NRAS public keys
# Requires: pip install pyjwt[crypto]
python3 << 'EOF'
import json, jwt

with open("attestation_report.json") as f:
    report = json.load(f)

jwt_str = report.get("REMOTE_GPU_CLAIMS", "")
if not jwt_str:
    raise RuntimeError("No attestation token found in report")

# Fetch NRAS JWKS and verify the ECDSA signature
jwks_client = jwt.PyJWKClient("https://nras.attestation.nvidia.com/.well-known/jwks.json")
signing_key = jwks_client.get_signing_key_from_jwt(jwt_str)

claims = jwt.decode(
    jwt_str,
    signing_key.key,
    algorithms=["ES384"],
    options={"verify_aud": False},
    issuer="https://nras.attestation.nvidia.com"
)

print(json.dumps({
    "cc_mode_enabled": claims.get("x-nvidia-cc-mode-enabled"),
    "driver_rim_validated": claims.get("x-nvidia-gpu-driver-rim-schema-validated"),
    "vbios_rim_validated": claims.get("x-nvidia-gpu-vbios-rim-schema-validated"),
    "overall_result": claims.get("x-nvidia-overall-att-result"),
    "issuer": claims.get("iss")
}, indent=2))
EOF

A valid token shows "cc_mode_enabled": true and "overall_result": "SUCCESS". The token is signed by NRAS using ECDSA P-384 keys. Store the token in your audit log before proceeding.

Step 4: Deploy vLLM Inside the Confidential Environment

For full vLLM configuration and throughput tuning, see our vLLM production deployment guide. The CC mode-specific change is the --enforce-eager flag:

bash

docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dtype fp8 \
  --enforce-eager \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 128

The --enforce-eager flag disables CUDA graph captures. Some CC mode firmware versions are incompatible with CUDA graph pre-capture because the graph capture process requires pinning kernel launch parameters before runtime, which conflicts with the CC mode memory encryption initialization sequence. In vLLM v0.17.0+, CUDA graph compatibility with CC mode improved significantly. Check your firmware version against the vLLM changelog before removing --enforce-eager.

Step 5: Verify the Service

bash

# Test from inside the instance
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "prompt": "System check: respond with OK",
    "max_tokens": 5
  }'

# Check vLLM metrics endpoint to confirm GPU is active
curl http://localhost:8000/metrics | grep vllm_gpu_cache_usage

Performance Overhead: CC Mode vs Standard

For LLM inference workloads, CC mode overhead is typically under 5%. The hardware encryption circuits are parallel to the compute pipeline, not in it.

GPU	Mode	Model	Throughput (tokens/sec)	TTFT (ms)	Overhead
H100 SXM5	Standard	Llama 3 70B FP8	~3,200	~85	baseline
H100 SXM5	CC mode	Llama 3 70B FP8	~3,100	~87	~3%
B200 SXM6	Standard	DeepSeek V3 FP4	~8,500	~45	baseline
B200 SXM6	CC mode	DeepSeek V3 FP4	~8,200	~47	~3.5%

Illustrative figures based on NVIDIA's published 2-5% CC mode overhead range and typical Llama 3 70B / DeepSeek V3 throughput on H100/B200. Not independently benchmarked by Spheron. See NVIDIA's CC mode performance characterization for your specific configuration.

Note: These figures are consistent with NVIDIA's published CC mode benchmark characterization of 2-5% throughput overhead for transformer inference workloads. Specific numbers depend on model size, batch configuration, and firmware version. Verify against NVIDIA's H100 security documentation and Blackwell CC mode release notes for the exact overhead on your configuration.

Attestation overhead is one-time at instance startup, not per request. A typical attestation flow (SDK call + OCSP verification) takes 1-3 seconds during provisioning. This is irrelevant for steady-state inference latency.

The exception worth disclosing: the --enforce-eager flag in vLLM disables CUDA graph captures. CUDA graphs pre-compile kernel launch sequences and avoid per-iteration Python overhead. Disabling them adds CPU-side overhead on the critical path, typically 10-15% throughput reduction on top of the base CC mode overhead. If your CC mode firmware version is compatible with CUDA graphs (H100 firmware 550+, vLLM v0.17.0+), remove --enforce-eager and re-benchmark. This is the gotcha that practitioners need to know before projecting regulated inference capacity.

KMS Integration: Customer-Managed Keys

AWS KMS with NVIDIA Attestation

Important: AWS KMS's kms:RecipientAttestation condition key is native to AWS Nitro Enclaves (which use CBOR-COSE attestation documents) and will not parse NVIDIA GPU attestation JWTs. NVIDIA CC attestation requires an intermediate attestation broker: a Lambda function or API Gateway authorizer that validates the NVIDIA JWT against the NRAS JWKS endpoint, then calls KMS or STS only after verification passes.

The broker validates the attestation JWT before calling KMS:

python

import jwt, requests, json

def validate_nvidia_cc_token(attest_token: str) -> bool:
    """Validate NVIDIA attestation JWT against NRAS public keys."""
    try:
        jwks = requests.get(
            "https://nras.attestation.nvidia.com/.well-known/jwks.json",
            timeout=5
        ).json()
    except (requests.exceptions.RequestException, ValueError):
        return False
    if not isinstance(jwks, dict):
        return False
    for key in jwks.get("keys", []):
        try:
            pub_key = jwt.algorithms.ECAlgorithm.from_jwk(json.dumps(key))
            claims = jwt.decode(attest_token, pub_key, algorithms=["ES384"], options={"verify_aud": False}, issuer="https://nras.attestation.nvidia.com")
            return (claims.get("x-nvidia-cc-mode-enabled") is True and
                    claims.get("x-nvidia-overall-att-result") == "SUCCESS")
        except (jwt.exceptions.InvalidSignatureError, jwt.exceptions.InvalidKeyError):
            continue
        except jwt.exceptions.PyJWTError:
            return False
    return False

# After validation, the broker calls kms:Decrypt under InferenceServiceRole

A KMS key policy granting decrypt to the verified inference role:

json

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "AllowDecryptViaAttestationBroker",
    "Effect": "Allow",
    "Principal": {"AWS": "arn:aws:iam::ACCOUNT_ID:role/InferenceServiceRole"},
    "Action": "kms:Decrypt",
    "Resource": "*"
  }]
}

Apply with:

bash

aws kms put-key-policy \
  --key-id alias/inference-model-key \
  --policy-name default \
  --policy file://kms-policy.json

HashiCorp Vault with PKCS#11 Interface

HashiCorp Vault's JWT auth method validates the NVIDIA attestation JWT as a bearer token. The bound_claims configuration checks CC mode status before issuing a Vault token with key access policies.

Configure Vault:

bash

# Enable JWT auth method
vault auth enable jwt

# Configure the JWT auth to accept NVIDIA attestation tokens
vault write auth/jwt/config \
  jwks_url="https://nras.attestation.nvidia.com/.well-known/jwks.json" \
  bound_issuer="https://nras.attestation.nvidia.com"

# Create a role that requires CC mode active
vault write auth/jwt/role/inference-cc \
  role_type="jwt" \
  bound_claims='{"x-nvidia-cc-mode-enabled": true, "x-nvidia-overall-att-result": "SUCCESS"}' \
  token_policies="kv-inference-read" \
  token_ttl="1h"

Issue tokens during inference startup:

bash

# Get attestation JWT from the report saved in Step 2
ATTEST_TOKEN=$(python3 -c "
import json
token = json.load(open('attestation_report.json'))
print(token.get('REMOTE_GPU_CLAIMS', ''))
")

# Authenticate with Vault using the NVIDIA attestation token
VAULT_TOKEN=$(vault write -field=token auth/jwt/login \
  role=inference-cc \
  jwt="$ATTEST_TOKEN")

# Read the model decryption key
vault kv get -field=model_key secret/inference/model-key

Regulated Use Cases

HIPAA Inference: Protecting PHI in GPU VRAM

HIPAA's Technical Safeguards (45 CFR § 164.312) require covered entities and business associates to protect electronic protected health information (ePHI) through access controls, audit logging, integrity controls, and transmission security. The standard GPU cloud deployment creates a specific gap in the technical safeguards: PHI passed through a model during inference is plaintext in VRAM, accessible to anyone with hypervisor-level access to the host.

CC mode closes this specific gap. PHI is encrypted in VRAM during computation; the hypervisor cannot read it. Combined with TLS for inference API traffic and encrypted model storage at rest, CC mode completes the encryption chain: data is encrypted everywhere except inside the GPU's trusted compute boundary.

CC mode alone is not a HIPAA certification. A BAA with the cloud provider remains required. Audit logging of PHI access, role-based access controls on the inference API, and proper data retention policies are all still necessary. The broader compliance stack, including risk classification and governance requirements, is covered in our EU AI Act compliance guide, which applies to HIPAA in the technical safeguards section. For the full architecture comparison between hyperscaler BAAs, vendor BAAs, and self-hosting open-weight models, see our HIPAA-compliant GPU cloud guide.

For training workflows where data cannot be centralized at all, federated learning combined with CC-mode aggregators provides defense in depth without requiring data transfer: clients train locally on private data, and the aggregator runs inside a TEE that encrypts the aggregated model updates.

PCI-DSS Model Serving: Cardholder Data in AI Pipelines

LLM inference is appearing in payment processing pipelines: fraud detection models that receive PANs and transaction amounts, customer service bots in cardholder contact centers, and document analysis pipelines that handle account statements. Each of these can result in PANs or CVVs passing through GPU VRAM during inference.

PCI-DSS v4.0 Requirement 3 covers protection of stored cardholder data and Requirement 4 covers protection of cardholder data in transit. CC mode addresses the gap between these two: data during computation, not stored and not in transit, but also not protected by either standard control.

For payment AI pipelines, CC mode means PAN and transaction data passed to the model is never plaintext outside the GPU's trusted compute boundary. The QSA scoping question for cloud AI is still evolving as PCI-DSS v4.0 assessors work through AI pipeline use cases. Talk to your QSA before assuming CC mode satisfies a specific requirement. The technology is in place; the assessment criteria are catching up.

Defense ITAR Workloads

ITAR-controlled technical data processed by AI models (design specifications, export-controlled research, military hardware documentation) requires hardware-level isolation guarantees. The standard shared-cloud multi-tenant model does not provide these guarantees: the isolation depends on software controls that a determined adversary with access to cloud provider systems could circumvent.

CC mode provides hardware-enforced boundaries: the GPU's security processor controls access to VRAM, and the attestation chain goes back to NVIDIA's hardware root of trust. This is a hardware assurance, not a software promise.

ITAR compliance involves far more than GPU security. Facility accreditation, personnel controls, export authorization, and customer agreements with ITAR-compliant provisions are all required. CC mode addresses one specific technical control in a larger compliance program. ITAR workloads running on GPU cloud infrastructure require legal review of the specific data flows, not just technical controls.

One complementary control at the model layer is checkpoint signing. IBM Granite 4.1 with cryptographic signing uses sigstore-based tooling to produce a verifiable signature for each model artifact. When combined with TEE attestation, you get a two-layer chain: the hardware environment is attested by NVIDIA's NRAS, and the model artifact is signed by IBM. Both layers of evidence are auditable.

Sales and CRM Data

Regulated data isn't limited to health records and export-controlled research. Sales teams handling regulated-industry prospect lists, health or financial buyer personas, confidential deal terms, face the same VRAM exposure risk. Our guide to self-hosting an AI SDR covers the GDPR case for keeping prospect data off third-party inference endpoints in the first place; CC mode is the next layer up for teams that need hardware-level attestation on top of self-hosting.

Cost Comparison: Confidential GPU Cloud vs On-Prem HSM-Backed Inference

Approach	GPU Cost (per GPU/hr)	HSM/Security Hardware Cost	Attestation Overhead	Operational Complexity	Typical Regulated Use Case
On-prem H100 + HSM (Luna Network)	~$0 after capex	$40K-$100K upfront	In-house managed	Very high (rack, network, power, HSM mgmt)	Large enterprise with dedicated infra budget
On-prem H100 (software-only)	~$0 after capex	$0	None	High	Startups avoiding compliance until scale
Azure Confidential VM (H100 NC v5)	~$5.00-$8.00 (Q1 2026 est.; check Azure pricing)	Included (Azure Key Vault)	Azure-managed	Medium (Azure ecosystem required)	Azure-native enterprise teams
GCP Confidential H100 VM	Limited availability	Included (Cloud KMS)	GCP-managed	Medium (GCP ecosystem required)	GCP-native enterprise teams
Spheron H100 SXM5 CC mode (reserved)	Custom reserved pricing	Bring your own KMS	Self-managed via NVIDIA SDK	Low-medium	Flexible regulated inference
Spheron B200 SXM6 CC mode (reserved)	Custom reserved pricing	Bring your own KMS	Self-managed via NVIDIA SDK	Low-medium	High-throughput Blackwell CC

CC mode on Spheron is a reserved commitment, not an on-demand SKU: the data center partner enables CC at the BIOS/VBIOS level for the reserved host, so pricing is quoted per engagement based on term length, GPU count, and region. For reference, standard (non-CC) H100 and B200 on-demand rates are published on the pricing page.

Spheron's advantage over hyperscalers is not primarily price. The more significant difference is KMS flexibility: Azure Confidential Computing is deeply integrated with Azure Key Vault, which is convenient if you are already in the Azure ecosystem but creates vendor lock-in on your attestation and key management chain. HashiCorp Vault or AWS KMS integration works on Spheron without any provider-specific attestation intermediary.

Spheron vs Hyperscaler Confidential AI

Feature	Spheron CC GPU	Azure Confidential Computing (H100)	GCP Confidential H100 VM
CC mode support	H100 SXM5, H200, B200 (reserved commitment only)	H100 (NC H100 v5 series)	H100 (limited preview)
Availability model	Reserved commitment with partner data center	On-demand and reserved	On-demand and reserved
KMS flexibility	Any KMS (AWS KMS, Vault, etc.)	Azure Key Vault preferred	Cloud KMS preferred
Attestation transparency	Direct NRAS + SDK access	Microsoft-mediated attestation	Google-mediated attestation
Pricing model	Custom reserved quote per engagement	Reserved instance commitments typical	Reserved instance commitments typical
Region selection	Partner data center network globally	12+ Azure regions	Selected GCP regions
Vendor lock-in	Low	Medium-High (Azure AD, AKV, EAP)	Medium (GCP IAM, Cloud KMS)
Bare metal option	Yes	No (VM layer required)	No (VM layer required)

For regulated buyers who need hardware-level security without being anchored to one cloud's IAM and KMS chain, Spheron's reserved CC GPU instances make sense. You manage attestation verification via the NVIDIA SDK directly, pipe tokens to whatever KMS you already use, and keep your compliance stack portable. If a regulator asks for your attestation logs, they come from NVIDIA's OCSP service, not filtered through a hyperscaler's management plane. The trade-off is the delivery model: CC mode requires a reserved commitment on Spheron so the data center partner can flip the BIOS/VBIOS CC toggle on the host before handover. If you need confidential GPU capacity for a short burst rather than a steady workload, a hyperscaler's on-demand confidential VM will be faster to spin up.

When Azure Confidential is the right call: teams already deep in the Azure ecosystem with Azure AD as their identity provider, Azure Key Vault as their KMS, and an existing EA agreement that narrows the pricing gap. Building a second attestation chain and a second KMS integration purely to avoid Azure lock-in has its own operational cost. If your security and compliance team has already built Azure-native controls, don't rebuild them elsewhere for a marginal cost difference.

The honest framing: Azure Confidential Computing is mature, well-documented, on-demand, and integrated with Microsoft's compliance certifications. Spheron gives you more control and KMS flexibility for steady-state regulated inference, but CC capacity has to be reserved, and you own the integration work. For teams that care about KMS portability and attestation transparency on a long-running workload, that trade-off is worth it. For teams that want a fully managed, on-demand confidential AI stack within one vendor's ecosystem, Azure is reasonable.

Confidential GPU computing bridges the gap between cloud cost efficiency and regulated workload requirements. Spheron offers CC mode on H100, H200, and B200 through reserved commitments, where we coordinate with the data center partner to enable CC at the BIOS/VBIOS level before the host is delivered. You get encrypted VRAM and remote attestation on hardware you control, without hyperscaler KMS lock-in.
Reach out for a reserved CC quote, or browse standard on-demand capacity: H100 | H200 | B200 | All GPU pricing

STEPS / 05

Quick Setup Guide

Request a CC-enabled GPU instance on Spheron
CC mode is not available on Spheron's on-demand GPU marketplace. To provision a CC-enabled H100 SXM5, H200, or B200 instance, contact Spheron for a reserved commitment. Spheron coordinates with a data center partner to enable CC mode at the BIOS/VBIOS level on the target host before the instance is delivered. Once provisioned, SSH in and verify by running `nvidia-smi -q | grep -i 'confidential'` to confirm CC mode status.
Install the NVIDIA Attestation SDK
Install the NVIDIA attestation toolchain: `pip install nv-attestation-sdk`. Use the SDK's `Attestation` class to generate a signed attestation token: instantiate `attestation.Attestation()`, call `set_service_key()` with your NVIDIA API key, add a remote GPU verifier pointing to `https://nras.attestation.nvidia.com/v4/attest/gpu`, then call `get_evidence()` and `attest()`. The resulting token contains GPU firmware measurements, CC mode status, and hardware identity.
Verify the attestation report
Verify the attestation token by decoding its JWT claims or by checking the token signature against the NRAS public keys at `https://nras.attestation.nvidia.com/.well-known/jwks.json`. A valid token confirms: CC mode is active, PCIe bus encryption is enabled, VRAM encryption is enabled, and the firmware is unmodified. Store the attestation token in your audit log before proceeding.
Launch vLLM inside the confidential VM
Deploy vLLM with the standard Docker command, adding the `--enforce-eager` flag to disable CUDA graph captures that are incompatible with some CC mode firmware versions: `docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.3-70B-Instruct --dtype fp8 --enforce-eager --max-model-len 16384`. The vLLM process runs inside the encrypted VRAM boundary.
Integrate with your KMS for attestation-gated key release
Configure your KMS (AWS KMS or HashiCorp Vault) to require the NVIDIA attestation JWT as a condition for key release. For AWS KMS, add a key policy condition: `kms:RecipientAttestation`. For Vault, configure the JWT auth method to validate the attestation token's `cc_mode: enabled` claim before issuing a token with the `kv:read` policy for your model decryption key.

FAQ / 05

Frequently Asked Questions

Standard GPU cloud protects data in transit and at rest via TLS and disk encryption, but GPU VRAM is unencrypted during computation, meaning a compromised hypervisor or malicious host admin can read model weights and inference data directly from GPU memory. Confidential GPU computing uses hardware-level VRAM encryption and PCIe bus encryption so that data is encrypted even while the GPU processes it. NVIDIA's CC mode (available on H100, H200, B200, and GB200) adds remote attestation to let workload operators cryptographically verify the hardware and firmware state before trusting the environment.

NVIDIA Confidential Computing mode is supported on H100 SXM5 and PCIe (Hopper architecture, CC mode requires compatible VBIOS and driver combinations per NVIDIA's Secure AI Compatibility Matrix at nvidia.com/confidential-computing), H200 SXM5 (inherits full Hopper CC support), B200 SXM6 (Blackwell, CC mode with enhanced encrypted PCIe 5.0 and NVLink encryption), and GB200 (Grace-Hopper, unified memory CC with CPU-side AMD SEV-SNP or Intel TDX integration). The A100 and older Ampere GPUs do not support CC mode.

NVIDIA publishes 2-5% throughput overhead for CC mode on H100 for most LLM inference workloads. Attestation adds a one-time startup cost of 1-3 seconds per instance provisioning event. The PCIe encryption overhead is negligible on NVLink-connected SXM systems because the encrypted path is on the host PCIe bus, not the inter-GPU NVLink fabric. The main cost is the added latency of VRAM encryption/decryption, which benchmarks show as under 3% for large matrix operations typical of transformer inference.

Yes. NVIDIA CC mode is KMS-agnostic. The attestation token from the NVIDIA Attestation Service can be used as proof-of-environment for any KMS that supports attestation-based key release policies. For AWS KMS, configure a key policy that grants Decrypt permission only when the attestation token validates the CC mode status. For HashiCorp Vault, use the JWT auth method with the NVIDIA attestation JWT as the token. This lets you keep keys in your own KMS while the GPU hardware proves its confidential state before receiving them.

HIPAA requires administrative, physical, and technical safeguards for PHI. Confidential GPU computing addresses the technical safeguard gap in cloud AI: it prevents unauthorized access to PHI during GPU computation. Used together with encryption at rest, TLS for data in transit, audit logging, access controls, and a signed Business Associate Agreement with the cloud provider, confidential GPU computing is part of a complete HIPAA-compliant inference stack. It is not a standalone HIPAA certification but a necessary component for regulated healthcare AI on shared cloud infrastructure.

The Threat Model

Hardware Support Matrix

How NVIDIA Confidential Computing Mode Works

Encrypted VRAM and the VRAM Security Boundary

PCIe Bus and NVLink Encryption

Remote Attestation Flow

Deploy a Confidential Inference Stack on Spheron

Step 1: Request a CC-Enabled Reserved Instance and Verify CC Mode

Step 2: Install and Run the NVIDIA Attestation SDK

Step 3: Verify the Attestation Token

Step 4: Deploy vLLM Inside the Confidential Environment

Step 5: Verify the Service

Performance Overhead: CC Mode vs Standard

KMS Integration: Customer-Managed Keys

AWS KMS with NVIDIA Attestation

HashiCorp Vault with PKCS#11 Interface

Regulated Use Cases

HIPAA Inference: Protecting PHI in GPU VRAM

PCI-DSS Model Serving: Cardholder Data in AI Pipelines

Defense ITAR Workloads

Sales and CRM Data

Cost Comparison: Confidential GPU Cloud vs On-Prem HSM-Backed Inference

Spheron vs Hyperscaler Confidential AI

Quick Setup Guide

Request a CC-enabled GPU instance on Spheron

Install the NVIDIA Attestation SDK

Verify the attestation report

Launch vLLM inside the confidential VM

Integrate with your KMS for attestation-gated key release

Frequently Asked Questions

01What is confidential GPU computing and how does it differ from standard GPU cloud security?

02Which NVIDIA GPUs support Confidential Computing mode in 2026?

03What is the performance overhead of running NVIDIA CC mode?

04Can I use my own KMS (AWS KMS or HashiCorp Vault) with confidential GPU instances?

05Is confidential GPU computing sufficient for HIPAA inference workloads?

Try It on Real GPUs