Most teams budgeting for AI inference focus on one number: the GPU hourly rate. It is clean, predictable, and easy to model. The electricity bill does not show up until the first month of on-premise or colocation operations, and by then the budget is already set.
Power consumption is not a minor footnote. A single H100 SXM5 draws 700W under load. Scale that to 125 nodes (1,000 GPUs) and you are looking at continuous power demand north of 1 MW, with cooling overhead stacked on top. Depending on where those servers sit, the monthly electricity cost varies from $50,800 to $317,500 for the exact same hardware.
This guide covers what actually drives inference power costs: GPU TDP specifications, server overhead, cooling PUE, regional electricity rate variance, and how to translate raw wattage into dollars per token. It also walks through the on-prem vs GPU cloud electricity economics so you can make the comparison accurately.
Why Power Is the New GPU Bottleneck in 2026
For most of 2023 and 2024, the scarce resource was GPU availability. Long procurement queues, chip shortages, and limited H100 supply meant teams were constrained by hardware access, not by where to plug it in.
That changed. Data center power capacity is now the more pressing constraint for new AI infrastructure deployments. The IEA's 2025 "Energy and AI" report projected data center electricity consumption could double globally by 2030, with AI workloads accounting for the majority of incremental demand. Major markets including Northern Virginia, Silicon Valley, and Northern Europe have seen power approval timelines stretch to 24-36 months for new facilities, regardless of hardware availability. Power, not GPUs, is what limits scale.
This matters for inference cost modeling because most teams still plan budgets around GPU $/hr and treat power as an abstraction baked into that number. For cloud deployments, that is fine. For on-premise or colocation deployments, power is a separate, variable cost that compounds over time and does not respond to negotiation the way hardware procurement does.
For broader context on how infrastructure costs map to per-token economics, see the AI infrastructure cost economics guide.
GPU Power Draw by Hardware: H100, B200, A100, H200 TDP Reference
Every GPU has a thermal design point (TDP): the maximum sustained power draw under full load. TDP is not a worst-case spike figure. It is what the hardware is designed to sustain continuously when fully utilized.
| GPU | Architecture | TDP (Watts) | Typical Inference Draw | 8x Server Node Power |
|---|---|---|---|---|
| A100 SXM4 80G | Ampere | 400W | 340-380W | ~5.5-6.5 kW |
| H100 SXM5 80G | Hopper | 700W | 600-680W | ~10-10.5 kW |
| H200 SXM5 141G | Hopper | 700W | 620-700W | ~10-10.5 kW |
| B200 SXM6 192G | Blackwell | 1000-1200W depending on configuration | 900-1100W | ~13-15 kW |
Actual inference draw typically runs 85-95% of TDP for large models where the GPU is consistently loaded. At low batch sizes or with small models, you may see 60-75% of TDP because the GPU spends cycles waiting on memory reads rather than executing tensor operations.
Server overhead matters here: the node power column above includes PSU inefficiency, cooling fans, baseboard management, and NVLink switch power. The eight GPUs account for roughly 56% of total node power; non-GPU components (dual CPUs, NVLink switches, 512 GB RAM, PSUs at load) contribute approximately 4.5 kW. A single 8x H100 node draws approximately 10 kW under inference load, not 5.6 kW (8 x 700W).
Teams running 70B-scale models on H100 hardware can check Spheron H100 instances for on-demand pricing that already includes power and cooling at the hourly rate.
Calculating True Inference Cost: Compute, Power, and Cooling Overhead
On-premise total cost of ownership per GPU per hour involves several components that cloud pricing bundles into a single number:
On-prem $/hr per GPU =
Hardware depreciation (capital / months / hours)
+ Electricity: (TDP_kW x server_overhead x PUE x $/kWh)
+ Cooling: embedded in PUE
+ Networking (InfiniBand, ToR switches amortized)
+ Staff (infra engineer FTE / GPUs managed)Walking through a single H100 SXM5 node at $0.12/kWh and US median electricity rates:
Hardware depreciation: An 8x H100 SXM5 server costs approximately $350,000. Amortized over 36 months at 720 hours/month: $350,000 / 36 / 720 = $13.50/hr for the node, or $1.69/GPU/hr.
Electricity: 0.7 kW x 1.80 (server overhead) x 1.4 (PUE) x $0.12 = $0.21/hr per GPU.
Cooling: Already included in the PUE multiplier. A PUE of 1.4 means 40 cents of every dollar spent on IT power goes to cooling and facility overhead.
Staff: One infrastructure engineer at $200,000/year managing a typical 8-GPU server alongside other duties. Allocating a proportional share: $200,000 / 8,760 hr = $22.83/hr total, divided across the GPUs that engineer manages. For a team managing 8 GPUs: $22.83 / 8 = $2.85/GPU/hr. This drops significantly at scale (managing 128 GPUs vs 8 changes this cost 16x).
Total on-prem TCO (single 8-GPU node): $1.69 + $0.21 + $2.85 = approximately $4.75/hr per GPU, before networking, maintenance contracts, and facility lease costs.
Spheron's current on-demand H100 rate is $2.90/hr. The difference between that number and a naive "hardware only" comparison ($1.69/hr depreciation alone) accounts for the power, cooling, networking, and staff that cloud providers absorb. The power component alone, at $0.12/kWh, adds $152/month per GPU.
Electricity Price Variance: Why Location Changes Your GPU Bill by 3x
The electricity cost formula above uses $0.12/kWh as a baseline. Current EIA data puts the US commercial average closer to $0.13-0.14/kWh as of early 2026, so treat $0.12/kWh as a conservative floor rather than a precise figure. Actual rates vary substantially by market.
The following table models a 1,000-GPU H100 cluster: 1,000 x 700W x 1.80 server overhead x 1.4 PUE = approximately 1.76 MW continuous draw.
| Market | Rate ($/kWh) | Monthly Electricity Cost (1,000x H100) | Notes |
|---|---|---|---|
| Pacific NW (hydro) | ~$0.04 | ~$50,800 | Data center wholesale hydro rates |
| US Midwest average | ~$0.07 | ~$88,900 | Typical commercial industrial |
| US national median | ~$0.12 | ~$152,400 | EIA commercial average |
| California | ~$0.18-0.22 | ~$228,600-$279,400 | Commercial utility rates |
| Western Europe | ~$0.15-0.25 | ~$190,500-$317,500 | Germany, Netherlands market rates |
| Singapore / Tokyo | ~$0.18-0.22 | ~$228,600-$279,400 | Typical APAC data center rates |
The same 1,000-GPU cluster costs between $50,800 and $317,500 per month in electricity depending on geography. Over a 3-year hardware cycle, that spread is $9.6M on electricity alone. This is the number that on-premise cost models frequently omit when comparing to cloud.
Note: rates above are approximate commercial market rates; actual contracted data center rates depend on facility vintage, contract terms, and local utility tariffs.
How Liquid Cooling Changes the Economics
PUE is not a fixed input. It depends on how the facility cools its IT equipment, and the gap between efficient and inefficient cooling directly scales the electricity cost of every GPU in the room.
Air-cooled racks: PUE 1.3-1.5. Every 100W of GPU power requires 30-50W of additional overhead for cooling and facility systems. Industry average for existing enterprise data centers sits around 1.5.
Liquid-assisted cooling (rear-door heat exchangers, direct liquid cooling): PUE 1.1-1.2. Significantly reduces cooling overhead by extracting heat at the source rather than conditioning room air.
Immersion cooling: PUE 1.03-1.05 at scale. Servers submerged in dielectric fluid. The most efficient option, though with higher upfront infrastructure cost.
For the B200 at up to 1000W per GPU, dense rack configurations essentially require liquid cooling. Air cooling a 10U rack of 8x B200s puts roughly 13-15 kW of heat into a rack footprint that traditional CRAC units struggle to handle at scale (GPU TDP alone is 8 kW, but full server node power including CPUs, NVLink switches, RAM, and PSUs brings the total to 13-15 kW per node).
A practical savings example: dropping PUE from 1.45 to 1.15 on a 1,000-GPU H100 cluster at $0.12/kWh saves approximately $32,700/month. The IT load is 1,000 x 700W x 1.80 server overhead = 1,260 kW; the PUE reduction frees 1,260 kW x (1.45 - 1.15) = 378 kW of overhead power, or 378 kW x 720 hr x $0.12/kWh ≈ $32,700/month. Over 36 months, that is roughly $1.18M in electricity savings from cooling improvements alone, before any change to GPU count or utilization.
The token factory economics guide covers this further in the context of maximizing tokens per watt across hardware tiers.
Inference vs Training Power Profiles
Training and inference consume power differently, and the difference matters for long-term budget planning.
Training workloads are bounded compute jobs. You run a training campaign for a defined number of steps, it finishes, the cluster goes idle. The power cost is a project expense with a defined end date. You can estimate it upfront from the GPU-hours budget.
Inference workloads are continuous. A model serving production traffic runs 24 hours a day, 7 days a week, as long as the product is live. Power draw scales with traffic but never reaches zero. The infrastructure cost is ongoing and grows with user adoption.
This is why, at scale, 80-90% of AI compute energy goes to inference rather than training. Companies train models once (or a few times with fine-tuning cycles). They run inference indefinitely. The ratio of inference-to-training compute hours at a company serving millions of users might be 100:1 or higher.
Budget planning implications are direct: inference power should be modeled as a utility cost, like bandwidth or database storage, not as a one-time project cost. Year 1, year 2, and year 3 infrastructure costs for a deployed model grow with usage. On-premise, this means the electricity bill grows with your product. On cloud, the hourly rate is fixed regardless of how power costs evolve in the underlying data centers.
For the full picture of how inference economics scale from single-GPU to production fleet, see the AI inference cost economics playbook linked earlier in this post.
GPU Cloud vs On-Premise: Who Pays the Electricity Bill
The core difference between cloud and on-premise GPU economics from a power standpoint:
| Factor | GPU Cloud (Spheron) | On-Premise / Colo |
|---|---|---|
| Electricity cost | Bundled in hourly rate | Separate metered cost |
| PUE overhead | Absorbed by provider | Your negotiated PUE |
| Cooling cost | Included | Your capex or facility charge |
| Price visibility | Fixed $/hr | Variable, quarterly bill |
| Exposure to rate changes | None | Full exposure |
| Location optimization | Provider-managed | Requires multi-region ops |
On GPU cloud, the hourly rate is the total cost. Power, cooling, networking, facility lease, and hardware maintenance are the provider's costs to manage. You see one number.
On-premise or colocation, electricity arrives as a separate line item that varies month to month based on utilization, rate adjustments, and season. You also carry PUE negotiation risk: a colo facility promising PUE 1.3 but delivering 1.5 in summer means your electricity bill is 15% higher than modeled.
The on-premise vs GPU cloud break-even analysis covers the full TCO comparison including utilization thresholds and hybrid strategies.
For teams evaluating workloads suited for an A100-class budget, A100 instances on Spheron offer on-demand pricing from $1.64/hr with no separate power or cooling cost.
Optimizing Tokens per Watt: Right-Sizing, Quantization, and Batch Tuning
Given that power draw is fixed by hardware, the only way to improve the economics is to extract more inference tokens per watt. Three levers control this.
1. Right-Sizing GPU Selection
Not every model needs the fastest GPU. For smaller models (7B-13B parameters), an A100 at 400W may produce comparable or better tokens-per-watt than an H100 at 700W when the model fits comfortably in A100 VRAM without hitting memory bandwidth limits.
Rough comparison for Llama 3.1 8B (a model well within A100 VRAM):
| GPU | TDP | Est. Throughput (Llama 3.1 8B, batch 32) | Tokens/Watt |
|---|---|---|---|
| A100 SXM4 80G | 400W | ~4,000 tok/s | ~10.0 |
| H100 SXM5 80G | 700W | ~6,500 tok/s | ~9.3 |
| L40S | 350W | ~3,200 tok/s | ~9.1 |
For an 8B model at this batch size, the A100 delivers more tokens per watt than the H100, purely because the H100's extra VRAM and bandwidth headroom is not needed. The H100 advantage materializes at larger models (70B+) where memory bandwidth becomes the bottleneck.
2. Quantization
INT8 and FP8 quantization reduce activation memory pressure and KV cache size, which allows larger batch sizes per watt. At the same power draw, a larger batch produces proportionally more tokens.
FP8 on H100 is well-supported in vLLM and TensorRT-LLM and maintains output quality close to BF16 for most production models. The typical effect is 30-40% more tokens per second at identical power draw compared to BF16 at the same batch size. Over a month, that is approximately a 23-29% reduction in electricity cost per token on the same hardware (a 30% throughput gain reduces cost per token to 1/1.30 = 23.1% less; a 40% gain reduces it to 1/1.40 = 28.6% less).
3. Batch Tuning with Continuous Batching
Under-batched inference is the most wasteful power configuration. A GPU at 20% utilization serving single requests one at a time draws nearly the same power as a GPU at 80% utilization serving 64 concurrent requests in a continuous batch. The electricity cost per token at 80% utilization is one-quarter of the cost at 20%.
Continuous batching (implemented in vLLM, SGLang, and TensorRT-LLM) keeps the GPU saturated across variable-length requests by dynamically merging in-flight and queued sequences. Typical utilization improvement from naive to continuous batching is 3-4x, with proportional improvement in tokens-per-watt.
See vLLM vs TensorRT-LLM vs SGLang benchmarks for measured throughput and the KV cache optimization guide for memory management techniques that extend effective batch capacity.
Spheron GPU Cloud Power-Inclusive Pricing Breakdown
Using live Spheron pricing (fetched 20 Apr 2026) and the on-prem TCO formula from the earlier section ($0.12/kWh, PUE 1.4, 36-month hardware cycle):
| Scenario | On-Prem TCO (est.) | Spheron On-Demand | Spheron Spot |
|---|---|---|---|
| Single H100 SXM5, 720 hr (1 month) | ~$3,420 | $2,088 | N/A |
| 8x H100 SXM5 node, 720 hr (1 month) | ~$27,360 | $16,704 | N/A |
| Single A100 SXM4, 720 hr (1 month) | ~$2,560 | $1,181 | N/A |
| 8x A100 SXM4 node, 720 hr (1 month) | ~$20,480 | $9,446 | N/A |
On-prem TCO includes hardware depreciation, electricity, and staff. It excludes networking hardware amortization, maintenance contracts, and facility lease, which would push the figures higher.
The on-prem staff cost dominates at small scale (1-8 GPUs): an infra engineer's loaded cost allocated to 8 GPUs is $2.85/GPU/hr, larger than the electricity component. At 128 GPUs managed by the same engineer, the staff cost drops to $0.18/GPU/hr. Cloud pricing has no equivalent floor: you pay the same per-GPU rate whether you run 1 or 1,000.
Electricity costs on-prem for the single H100 example: $0.21/hr x 720 hr = $152/month. At $0.20/kWh (California), that rises to $254/month per GPU. These figures are included in the on-prem TCO estimates above at the $0.12/kWh baseline.
For current on-demand and spot rates across all GPU types, see Spheron GPU pricing.
Pricing fluctuates based on GPU availability. The prices above are based on 20 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Running inference on GPU cloud means your electricity cost is already priced in: no separate power bill, no PUE negotiation, no exposure to regional rate changes. Compare current hardware and spot pricing on Spheron before committing to on-premise infrastructure.
View A100 pricing → | View H200 pricing → | See all GPU pricing →
