AI Inference Power Consumption and GPU Electricity Costs: 2026 Guide

Most teams budgeting for AI inference focus on one number: the GPU hourly rate. It is clean, predictable, and easy to model. The electricity bill does not show up until the first month of on-premise or colocation operations, and by then the budget is already set.

Power consumption is not a minor footnote. A single H100 SXM5 draws 700W under load. Scale that to 125 nodes (1,000 GPUs) and you are looking at continuous power demand north of 1 MW, with cooling overhead stacked on top. Depending on where those servers sit, the monthly electricity cost varies from $50,800 to $317,500 for the exact same hardware.

This guide covers what actually drives inference power costs: GPU TDP specifications, server overhead, cooling PUE, regional electricity rate variance, and how to translate raw wattage into dollars per token. It also walks through the on-prem vs GPU cloud electricity economics so you can make the comparison accurately.

Why Power Is the New GPU Bottleneck in 2026

For most of 2023 and 2024, the scarce resource was GPU availability. Long procurement queues, chip shortages, and limited H100 supply meant teams were constrained by hardware access, not by where to plug it in.

That changed. Data center power capacity is now the more pressing constraint for new AI infrastructure deployments. The IEA's 2025 "Energy and AI" report projected data center electricity consumption could double globally by 2030, with AI workloads accounting for the majority of incremental demand. Major markets including Northern Virginia, Silicon Valley, and Northern Europe have seen power approval timelines stretch to 24-36 months for new facilities, regardless of hardware availability. Power, not GPUs, is what limits scale.

This matters for inference cost modeling because most teams still plan budgets around GPU $/hr and treat power as an abstraction baked into that number. For cloud deployments, that is fine. For on-premise or colocation deployments, power is a separate, variable cost that compounds over time and does not respond to negotiation the way hardware procurement does.

For broader context on how infrastructure costs map to per-token economics, see the AI infrastructure cost economics guide.

GPU Power Draw by Hardware: H100, B200, A100, H200 TDP Reference

Every GPU has a thermal design point (TDP): the maximum sustained power draw under full load. TDP is not a worst-case spike figure. It is what the hardware is designed to sustain continuously when fully utilized.

GPU	Architecture	TDP (Watts)	Typical Inference Draw	8x Server Node Power
A100 SXM4 80G	Ampere	400W	340-380W	~5.5-6.5 kW
H100 SXM5 80G	Hopper	700W	600-680W	~10-10.5 kW
H200 SXM5 141G	Hopper	700W	620-700W	~10-10.5 kW
B200 SXM6 192G	Blackwell	1000-1200W depending on configuration	900-1100W	~13-15 kW

Actual inference draw typically runs 85-95% of TDP for large models where the GPU is consistently loaded. At low batch sizes or with small models, you may see 60-75% of TDP because the GPU spends cycles waiting on memory reads rather than executing tensor operations.

Server overhead matters here: the node power column above includes PSU inefficiency, cooling fans, baseboard management, and NVLink switch power. The eight GPUs account for roughly 56% of total node power; non-GPU components (dual CPUs, NVLink switches, 512 GB RAM, PSUs at load) contribute approximately 4.5 kW. A single 8x H100 node draws approximately 10 kW under inference load, not 5.6 kW (8 x 700W).

Teams running 70B-scale models on H100 hardware can check Spheron H100 instances for on-demand pricing that already includes power and cooling at the hourly rate. The broader GPU rental catalog covers H200, B200, A100, and L40S with the same all-in hourly billing. If you're choosing between L40 and L40S for energy-efficient inference, our L40 vs L40S inference comparison covers the tokens-per-watt math.

Calculating True Inference Cost: Compute, Power, and Cooling Overhead

On-premise total cost of ownership per GPU per hour involves several components that cloud pricing bundles into a single number:

On-prem $/hr per GPU =
  Hardware depreciation (capital / months / hours)
  + Electricity: (TDP_kW x server_overhead x PUE x $/kWh)
  + Cooling: embedded in PUE
  + Networking (InfiniBand, ToR switches amortized)
  + Staff (infra engineer FTE / GPUs managed)

Walking through a single H100 SXM5 node at $0.12/kWh and US median electricity rates:

Hardware depreciation: An 8x H100 SXM5 server costs approximately $350,000. Amortized over 36 months at 720 hours/month: $350,000 / 36 / 720 = $13.50/hr for the node, or $1.69/GPU/hr.

Electricity: 0.7 kW x 1.80 (server overhead) x 1.4 (PUE) x $0.12 = $0.21/hr per GPU.

Cooling: Already included in the PUE multiplier. A PUE of 1.4 means 40 cents of every dollar spent on IT power goes to cooling and facility overhead.

Staff: One infrastructure engineer at $200,000/year managing a typical 8-GPU server alongside other duties. Allocating a proportional share: $200,000 / 8,760 hr = $22.83/hr total, divided across the GPUs that engineer manages. For a team managing 8 GPUs: $22.83 / 8 = $2.85/GPU/hr. This drops significantly at scale (managing 128 GPUs vs 8 changes this cost 16x).

Total on-prem TCO (single 8-GPU node): $1.69 + $0.21 + $2.85 = approximately $4.75/hr per GPU, before networking, maintenance contracts, and facility lease costs.

Spheron's current on-demand H100 rate is $2.90/hr. The difference between that number and a naive "hardware only" comparison ($1.69/hr depreciation alone) accounts for the power, cooling, networking, and staff that cloud providers absorb. The power component alone, at $0.12/kWh, adds $152/month per GPU.

Electricity Price Variance: Why Location Changes Your GPU Bill by 3x

The electricity cost formula above uses $0.12/kWh as a baseline. Current EIA data puts the US commercial average closer to $0.13-0.14/kWh as of early 2026, so treat $0.12/kWh as a conservative floor rather than a precise figure. Actual rates vary substantially by market.

The following table models a 1,000-GPU H100 cluster: 1,000 x 700W x 1.80 server overhead x 1.4 PUE = approximately 1.76 MW continuous draw.

Market	Rate ($/kWh)	Monthly Electricity Cost (1,000x H100)	Notes
Pacific NW (hydro)	~$0.04	~$50,800	Data center wholesale hydro rates
US Midwest average	~$0.07	~$88,900	Typical commercial industrial
US national median	~$0.12	~$152,400	EIA commercial average
California	~$0.18-0.22	~$228,600-$279,400	Commercial utility rates
Western Europe	~$0.15-0.25	~$190,500-$317,500	Germany, Netherlands market rates
Singapore / Tokyo	~$0.18-0.22	~$228,600-$279,400	Typical APAC data center rates

The same 1,000-GPU cluster costs between $50,800 and $317,500 per month in electricity depending on geography. Over a 3-year hardware cycle, that spread is $9.6M on electricity alone. This is the number that on-premise cost models frequently omit when comparing to cloud.

Note: rates above are approximate commercial market rates; actual contracted data center rates depend on facility vintage, contract terms, and local utility tariffs.

How Liquid Cooling Changes the Economics

PUE is not a fixed input. It depends on how the facility cools its IT equipment, and the gap between efficient and inefficient cooling directly scales the electricity cost of every GPU in the room.

Air-cooled racks: PUE 1.3-1.5. Every 100W of GPU power requires 30-50W of additional overhead for cooling and facility systems. Industry average for existing enterprise data centers sits around 1.5.

Liquid-assisted cooling (rear-door heat exchangers, direct liquid cooling): PUE 1.1-1.2. Significantly reduces cooling overhead by extracting heat at the source rather than conditioning room air.

Immersion cooling: PUE 1.03-1.05 at scale. Servers submerged in dielectric fluid. The most efficient option, though with higher upfront infrastructure cost.

For the B200 at up to 1000W per GPU, dense rack configurations essentially require liquid cooling. Air cooling a 10U rack of 8x B200s puts roughly 13-15 kW of heat into a rack footprint that traditional CRAC units struggle to handle at scale (GPU TDP alone is 8 kW, but full server node power including CPUs, NVLink switches, RAM, and PSUs brings the total to 13-15 kW per node).

A practical savings example: dropping PUE from 1.45 to 1.15 on a 1,000-GPU H100 cluster at $0.12/kWh saves approximately $32,700/month. The IT load is 1,000 x 700W x 1.80 server overhead = 1,260 kW; the PUE reduction frees 1,260 kW x (1.45 - 1.15) = 378 kW of overhead power, or 378 kW x 720 hr x $0.12/kWh ≈ $32,700/month. Over 36 months, that is roughly $1.18M in electricity savings from cooling improvements alone, before any change to GPU count or utilization.

The token factory economics guide covers this further in the context of maximizing tokens per watt across hardware tiers.

Inference vs Training Power Profiles

Training and inference consume power differently, and the difference matters for long-term budget planning.

Training workloads are bounded compute jobs. You run a training campaign for a defined number of steps, it finishes, the cluster goes idle. The power cost is a project expense with a defined end date. You can estimate it upfront from the GPU-hours budget.

Inference workloads are continuous. A model serving production traffic runs 24 hours a day, 7 days a week, as long as the product is live. Power draw scales with traffic but never reaches zero. The infrastructure cost is ongoing and grows with user adoption.

This is why, at scale, 80-90% of AI compute energy goes to inference rather than training. Companies train models once (or a few times with fine-tuning cycles). They run inference indefinitely. The ratio of inference-to-training compute hours at a company serving millions of users might be 100:1 or higher.

Budget planning implications are direct: inference power should be modeled as a utility cost, like bandwidth or database storage, not as a one-time project cost. Year 1, year 2, and year 3 infrastructure costs for a deployed model grow with usage. On-premise, this means the electricity bill grows with your product. On cloud, the hourly rate is fixed regardless of how power costs evolve in the underlying data centers.

For the full picture of how inference economics scale from single-GPU to production fleet, see the AI inference cost economics playbook linked earlier in this post.

GPU Cloud vs On-Premise: Who Pays the Electricity Bill

The core difference between cloud and on-premise GPU economics from a power standpoint:

Factor	GPU Cloud (Spheron)	On-Premise / Colo
Electricity cost	Bundled in hourly rate	Separate metered cost
PUE overhead	Absorbed by provider	Your negotiated PUE
Cooling cost	Included	Your capex or facility charge
Price visibility	Fixed $/hr	Variable, quarterly bill
Exposure to rate changes	None	Full exposure
Location optimization	Provider-managed	Requires multi-region ops

On GPU cloud, the hourly rate is the total cost. Power, cooling, networking, facility lease, and hardware maintenance are the provider's costs to manage. You see one number.

On-premise or colocation, electricity arrives as a separate line item that varies month to month based on utilization, rate adjustments, and season. You also carry PUE negotiation risk: a colo facility promising PUE 1.3 but delivering 1.5 in summer means your electricity bill is 15% higher than modeled.

The on-premise vs GPU cloud break-even analysis covers the full TCO comparison including utilization thresholds and hybrid strategies.

For teams evaluating workloads suited for an A100-class budget, A100 instances on Spheron offer on-demand pricing from $1.64/hr with no separate power or cooling cost.

Optimizing Tokens per Watt: Right-Sizing, Quantization, and Batch Tuning

Given that power draw is fixed by hardware, the only way to improve the economics is to extract more inference tokens per watt. Three levers control this.

1. Right-Sizing GPU Selection

Not every model needs the fastest GPU. For smaller models (7B-13B parameters), an A100 at 400W may produce comparable or better tokens-per-watt than an H100 at 700W when the model fits comfortably in A100 VRAM without hitting memory bandwidth limits.

Rough comparison for Llama 3.1 8B (a model well within A100 VRAM):

GPU	TDP	Est. Throughput (Llama 3.1 8B, batch 32)	Tokens/Watt
A100 SXM4 80G	400W	~4,000 tok/s	~10.0
H100 SXM5 80G	700W	~6,500 tok/s	~9.3
L40S	350W	~3,200 tok/s	~9.1

For an 8B model at this batch size, the A100 delivers more tokens per watt than the H100, purely because the H100's extra VRAM and bandwidth headroom is not needed. The H100 advantage materializes at larger models (70B+) where memory bandwidth becomes the bottleneck.

2. Quantization

INT8 and FP8 quantization reduce activation memory pressure and KV cache size, which allows larger batch sizes per watt. At the same power draw, a larger batch produces proportionally more tokens.

FP8 on H100 is well-supported in vLLM and TensorRT-LLM and maintains output quality close to BF16 for most production models. The typical effect is 30-40% more tokens per second at identical power draw compared to BF16 at the same batch size. Over a month, that is approximately a 23-29% reduction in electricity cost per token on the same hardware (a 30% throughput gain reduces cost per token to 1/1.30 = 23.1% less; a 40% gain reduces it to 1/1.40 = 28.6% less).

3. Batch Tuning with Continuous Batching

Under-batched inference is the most wasteful power configuration. A GPU at 20% utilization serving single requests one at a time draws nearly the same power as a GPU at 80% utilization serving 64 concurrent requests in a continuous batch. The electricity cost per token at 80% utilization is one-quarter of the cost at 20%.

Continuous batching (implemented in vLLM, SGLang, and TensorRT-LLM) keeps the GPU saturated across variable-length requests by dynamically merging in-flight and queued sequences. Typical utilization improvement from naive to continuous batching is 3-4x, with proportional improvement in tokens-per-watt.

See vLLM vs TensorRT-LLM vs SGLang benchmarks for measured throughput and the KV cache optimization guide for memory management techniques that extend effective batch capacity.

Spheron GPU Cloud Power-Inclusive Pricing Breakdown

Using live Spheron pricing (fetched 20 Apr 2026) and the on-prem TCO formula from the earlier section ($0.12/kWh, PUE 1.4, 36-month hardware cycle):

Scenario	On-Prem TCO (est.)	Spheron On-Demand	Spheron Spot
Single H100 SXM5, 720 hr (1 month)	~$3,420	$2,088	N/A
8x H100 SXM5 node, 720 hr (1 month)	~$27,360	$16,704	N/A
Single A100 SXM4, 720 hr (1 month)	~$2,560	$1,181	N/A
8x A100 SXM4 node, 720 hr (1 month)	~$20,480	$9,446	N/A

On-prem TCO includes hardware depreciation, electricity, and staff. It excludes networking hardware amortization, maintenance contracts, and facility lease, which would push the figures higher.

The on-prem staff cost dominates at small scale (1-8 GPUs): an infra engineer's loaded cost allocated to 8 GPUs is $2.85/GPU/hr, larger than the electricity component. At 128 GPUs managed by the same engineer, the staff cost drops to $0.18/GPU/hr. Cloud pricing has no equivalent floor: you pay the same per-GPU rate whether you run 1 or 1,000.

Electricity costs on-prem for the single H100 example: $0.21/hr x 720 hr = $152/month. At $0.20/kWh (California), that rises to $254/month per GPU. These figures are included in the on-prem TCO estimates above at the $0.12/kWh baseline.

For current on-demand and spot rates across all GPU types, see Spheron GPU pricing.

Pricing fluctuates based on GPU availability. The prices above are based on 20 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Running inference on GPU cloud means your electricity cost is already priced in: no separate power bill, no PUE negotiation, no exposure to regional rate changes. Compare current hardware and spot pricing on Spheron before committing to on-premise infrastructure.
View A100 pricing → | View H200 pricing → | See all GPU pricing →

FAQ / 05

Frequently Asked Questions

A single H100 SXM5 server (8 GPUs) draws roughly 10-10.5 kW under full inference load - 700W per GPU plus server overhead. A 1,000-GPU cluster (125 x 8-GPU nodes) draws approximately 1.76 MW of continuous power (1,000 x 700W x 1.80 server overhead x 1.4 PUE). At $0.07/kWh (cheap US commercial rate) that is $88,900/month in electricity alone. At $0.12/kWh (median commercial rate) it is $152,400/month, and at $0.20/kWh (expensive markets like California or Germany) it is $254,000/month - the same hardware, nearly a 3x cost swing.

NVIDIA H100 SXM5: 700W TDP. NVIDIA B200 SXM6: up to 1000W TDP. NVIDIA A100 SXM4: 400W TDP. The B200 draws 2.5x the power of an A100 but delivers roughly 4-5x the inference throughput of an H100 on large models, and proportionally more vs A100, giving it a meaningfully better tokens-per-watt ratio for LLM workloads. TDP is the thermal design point under sustained load; actual inference draw may be 85-95% of TDP depending on batch size and model.

Formula: Electricity cost ($/hr) = (GPU TDP in kW) x (number of GPUs) x (server overhead factor, typically 1.5-1.8 for dense GPU servers) x (PUE, typically 1.3-1.5 for air-cooled, 1.1-1.2 for liquid-cooled) x (electricity rate $/kWh). For a single H100: 0.7 kW x 1 x 1.80 x 1.4 x $0.12 = $0.21/hr electricity cost. Add this to hardware depreciation, networking, and staff to get full TCO.

Training is a bounded compute job. You train once, spend a defined GPU-hour budget, and it ends. Inference is continuous: every user request, API call, and embedding lookup consumes power 24/7 as long as the model is deployed. A model serving 10,000 daily active users generates millions of inference calls per day with no natural stopping point. At the fleet level, the ratio has inverted: inference now consumes the majority of AI compute budgets and, by extension, the majority of AI-related electricity.

Yes. GPU cloud pricing (including Spheron's) is all-inclusive: the hourly rate covers the GPU, server hardware, power, cooling, networking, and data center overhead. You do not receive a separate electricity bill. For on-premise or colo deployments, electricity is a separate, volatile line item that varies with local rates and PUE. The cloud model converts a variable infrastructure cost into a fixed, predictable per-hour rate.

Why Power Is the New GPU Bottleneck in 2026

GPU Power Draw by Hardware: H100, B200, A100, H200 TDP Reference

Calculating True Inference Cost: Compute, Power, and Cooling Overhead

Electricity Price Variance: Why Location Changes Your GPU Bill by 3x

How Liquid Cooling Changes the Economics

Inference vs Training Power Profiles

GPU Cloud vs On-Premise: Who Pays the Electricity Bill

Optimizing Tokens per Watt: Right-Sizing, Quantization, and Batch Tuning

1. Right-Sizing GPU Selection

2. Quantization

3. Batch Tuning with Continuous Batching

Spheron GPU Cloud Power-Inclusive Pricing Breakdown

Frequently Asked Questions

01How much electricity does a GPU cluster use for AI inference?

02What is the GPU power draw for H100 vs B200 vs A100?

03How do I calculate the electricity cost for AI inference?

04Why does 80-90% of AI data center energy go to inference rather than training?

05Does renting from a GPU cloud like Spheron eliminate electricity costs?

Build what's next.