Your GPU bill shows $12,000/month for H100 compute. Your actual AWS bill is $14,200. The $2,200 difference is egress, and it appears nowhere in the GPU rental comparison you ran before choosing AWS.
This is the most predictable surprise in AI infrastructure billing. Hyperscalers publish GPU compute rates prominently and bury data transfer pricing in a separate, tiered pricing page. For text-only inference at low volume, egress barely registers. For multimodal pipelines, high-frequency streaming APIs, or workloads that pull model weights from S3 on every restart, the bandwidth bill can exceed 20% of total monthly spend.
This guide breaks down where every egress dollar goes, shows you the math for common AI workload patterns, and covers both quick optimizations and a full migration path to zero-egress infrastructure.
Egress Pricing: What the Hyperscalers Actually Charge
AWS, GCP, and Azure all use tiered egress pricing. The first tier covers most small and mid-size AI deployments, but the rate doesn't drop enough to matter at production scale.
| Provider | First 10 TB/month | 10-50 TB/month | 50-150 TB/month | Free tier |
|---|---|---|---|---|
| AWS | $0.09/GB | $0.085/GB | $0.07/GB | 100 GB/month |
| GCP | $0.08/GB | $0.06/GB | $0.05/GB | None |
| Azure | $0.087/GB | $0.083/GB | $0.07/GB | 5 GB/month |
| Spheron | $0/GB | $0/GB | $0/GB | N/A |
| CoreWeave | $0/GB | $0/GB | $0/GB | N/A |
| Lambda | $0/GB | $0/GB | $0/GB | N/A |
Verify these rates against current provider pricing pages before committing. Hyperscaler egress pricing does change. AWS revised P5 compute pricing in mid-2025, but egress tiers have remained more stable.
There's also a second cost category that many engineers overlook: cross-AZ data transfer. When your inference pods run in one availability zone and your load balancer or API gateway sits in another, AWS charges $0.01/GB in each direction. For a high-traffic endpoint with a standard multi-AZ deployment, this adds up independently of internet egress. Keep your inference GPU, load balancer, and downstream services in the same AZ where possible to eliminate this line item entirely.
For teams dealing with other unexpected AWS costs, egress is typically the largest hidden expense after reserved instance complexity and cross-service data transfer charges.
The Token Streaming Math
Text-based LLM endpoints don't generate as much egress as most engineers assume. Understanding the exact byte budget helps you prioritize which workloads to optimize first.
Bytes per token: UTF-8 encodes English text at roughly 3-5 bytes per token using a standard BPE tokenizer. For a typical mix of English prose and code, 4 bytes per token is a solid working estimate. Multilingual output skews higher (Chinese and Japanese characters are 3 bytes each but map to fewer tokens). Code-heavy outputs with long identifiers run lower.
Streaming overhead: Each SSE frame includes HTTP metadata: a data: prefix, JSON wrapper, newline delimiters, and chunk framing. Per-token overhead adds roughly 20-40 bytes per streamed chunk depending on your response envelope. For a small-batch model that streams one token per SSE event, this multiplier is significant. A 1.5x to 2x effective multiplier over raw token bytes is realistic for streaming endpoints.
| Daily token volume | Raw output (bytes/token = 4) | With 1.5x streaming | Monthly egress | AWS cost/month |
|---|---|---|---|---|
| 1M tokens/day | 4 MB/day | 6 MB/day | 0.18 GB | $0.02 |
| 10M tokens/day | 40 MB/day | 60 MB/day | 1.8 GB | $0.16 |
| 100M tokens/day | 400 MB/day | 600 MB/day | 18 GB | $1.62 |
| 1B tokens/day | 4 GB/day | 6 GB/day | 180 GB | $16.20 |
Pure text at these volumes is manageable. The AI inference cost economics analysis covers the full unit economics picture, including compute and batching effects. Egress is rarely the dominant cost for text-only workloads.
Where it shifts: when response payloads grow. An LLM that appends retrieved document excerpts to its response, returns base64-encoded images, or serves audio transcripts in the response body moves the egress needle fast. The same math applies, but with KB-per-response instead of bytes-per-response.
High-Egress AI Workloads
LLM Token Streaming Endpoints
Standard streaming text inference is the lowest-egress AI pattern. The numbers above show that even a billion tokens per day generates only 180 GB/month of egress before overhead. The concern here is TCP framing at very high request-per-second rates. At 10,000 streaming requests/second, the cumulative overhead from connection establishment and SSE framing can push effective bytes-per-token above the 1.5x multiplier. HTTP/2 multiplexing reduces this significantly.
Video Generation Pipelines
Video generation creates the most egress of any AI workload by a wide margin. A single 720p, 5-second clip generated by a diffusion model runs 3-8 MB depending on codec and quality settings. At 1080p, multiply by 4. A pipeline generating 10,000 clips per day at 5 MB each produces 50 GB/day of outbound traffic, or 1.5 TB/month. That's $135/month in egress on AWS, on top of the significant GPU compute costs for generation. Serving video outputs through a CDN with edge caching eliminates most of this, since generated clips are static artifacts.
Audio Synthesis APIs
Text-to-speech output runs 20-200 KB per second of audio, depending on codec. WAV files at 16kHz/16-bit/mono: about 32 KB/second. MP3 at 128kbps: 16 KB/second. For a TTS API serving 1 million 30-second responses per day, that's 960 GB/day in WAV format or 480 GB/day at 128kbps MP3. Encoding output as Opus at 48kbps (standard for voice-quality audio) cuts this to 180 GB/day. Codec selection matters more for egress than for compute cost.
Multimodal RAG
RAG pipelines that retrieve and return images compound the egress cost fast. The typical pattern: retrieve 3-5 document chunks, where some are image pages from a PDF, and embed thumbnails or base64-encoded images in the JSON response. A 300x300px JPEG thumbnail is roughly 15-25 KB. If your average RAG response includes two thumbnails, you've added 40-50 KB to every request. At 500,000 daily requests, that's an additional 25 GB/day of egress, or 750 GB/month ($67.50/month at AWS rates).
Embedding APIs at Scale
Float32 embeddings are surprisingly large. A 1536-dimension embedding vector (OpenAI text-embedding-ada-002 output dimension) encodes to 6,144 bytes in raw float32. Returning 1,000 embeddings per batch request sends 6 MB per response. At 100,000 embedding requests per day with average batch size of 100, that's 61 GB/day or 1.8 TB/month. Switching to float16 or int8 quantized embeddings halves or quarters this immediately.
Real-World Cost Model: 50M Tokens/Day
Here's a worked comparison for a streaming inference endpoint serving 50M output tokens per day on an 8x H100 SXM5 pod. This is a representative mid-scale production workload, roughly equivalent to a B2B API with 50-100K daily active users running sessions of average length.
Spheron H100 SXM5: $3.90/hr per GPU on-demand (from live API, 27 May 2026)
AWS P5.48xlarge (8x H100 SXM5): approximately $6.88/hr per GPU on-demand ($55.04/hr for the full 8-GPU instance). See AWS H100 pricing breakdown for current rates.
50M tokens/day egress calculation:
50,000,000 tokens/day × 4 bytes/token × 1.5x streaming overhead × 30 days
= 9,000,000,000 bytes/month
= ~8.4 GB/month outbound| Cost Component | AWS P5 (8x H100) | Spheron H100 SXM5 (8x, on-demand) | Monthly Savings |
|---|---|---|---|
| GPU compute (720 hrs) | $39,629 | $22,464 | $17,165 |
| Egress (8.4 GB text) | $0.76 | $0 | $0.76 |
| Cross-AZ transfer (est.) | $5-$20 | $0 | $5-$20 |
| S3 GET (model weights, 1x/day load) | $378 | $0 | $378 |
| Monthly total (approx.) | ~$40,020 | $22,464 | ~$17,556 |
For a text-only endpoint, egress itself is small. The compounding savings come from zero compute premium, no S3 egress for model weight loading, and no cross-AZ charges.
Note that the model weight loading line assumes the inference server pulls its weights from S3 on each container restart, a common pattern when using AWS ECS or EKS without persistent volumes. A 70B FP16 model is ~140 GB. If you restart once per day: 140 GB × 30 days × $0.09/GB = $378/month in S3 GET egress. Keeping model weights on a persistent volume attached to the inference node eliminates this entirely, regardless of cloud provider.
For video or multimodal workloads, swap the 8.4 GB egress figure for the actual payload sizes from the previous section. At 1.5 TB/month (video generation pipeline), the egress line becomes $135/month and grows from there.
Pricing fluctuates based on GPU availability. The prices above are based on 27 May 2026 and may have changed. Check current GPU pricing → for live rates.
If you've done a GPU cloud pricing comparison and found the compute delta already compelling, the egress analysis makes the case stronger for any workload that serves data at scale.
Cost-Cutting Techniques That Don't Require Migration
For teams committed to their current hyperscaler, these optimizations reduce egress without a provider change.
Response compression. JSON LLM responses compress aggressively: 60-80% reduction with gzip, slightly better with brotli. A 10 KB JSON response body compresses to 2-4 KB. If your inference API server (vLLM, TGI, or custom FastAPI) isn't already enabling gzip on responses, this is a one-line config change that immediately cuts egress volume. Most API gateways can handle this at the edge rather than at the inference server.
CDN fronting. Cloudflare's free tier caches static and semi-static responses with global edge PoPs. For deterministic inference endpoints (same prompt always produces the same output, which applies to zero-temperature completions and classification models), a 5-minute CDN TTL can serve the majority of repeat requests without hitting the origin. Even for non-deterministic endpoints, caching at edge for identical prompts within a short window (semantic cache) reduces origin traffic and eliminates the egress charge for cached hits.
Same-AZ deployment. AWS charges $0.01/GB in each direction for cross-AZ traffic. If your inference GPU is in us-east-1a and your application server is in us-east-1b, every response traverses two AZ boundaries. Moving to a same-AZ deployment eliminates this. For multi-AZ redundancy, the cost is unavoidable, but verify you actually need multi-AZ HA before paying for the traffic.
Batch asynchronous workloads. For non-interactive AI jobs (batch summarization, embedding generation, offline classification), streaming results back to the client in real time is unnecessary. Accumulate outputs and deliver them in a single response rather than SSE-streaming each token. This eliminates the 1.5-2x streaming multiplier and reduces connection count, which also cuts TCP overhead.
Avoid S3 model weight loading on every restart. Keep model weights on local NVMe or a persistent block volume attached to the inference instance. The first load is unavoidable, but every subsequent restart pulls from local disk instead of S3. For a 70B model on an instance that restarts daily, this saves ~$380/month in S3 egress alone.
Architecting to Minimize Egress
Regional Model Placement
Co-locate your inference GPU in the same region as your largest user concentration. AWS, GCP, and Azure all charge more for egress to certain geographies. Traffic from us-east-1 to Europe costs more than traffic staying within North America. If 60% of your users are in Europe, a Frankfurt or Paris inference endpoint may have lower effective egress costs than serving from Virginia, even accounting for any compute price differences between regions.
For the GPU cost optimization playbook, regional placement is step one before any further optimization. Routing users to the nearest inference node cuts both latency and egress charges.
Edge Inference
Quantized models at 4-bit or 8-bit precision run on consumer-grade GPUs at the network edge. A Llama 3 8B Q4 model fits in 5 GB of VRAM, runs adequately on an L40S or RTX 4090, and can be deployed in dozens of edge PoPs close to user concentrations. Egress from edge nodes typically qualifies for lower rates or is zero on platforms designed for edge workloads. The tradeoff is model quality and maintenance complexity.
For workloads where a smaller quantized model meets quality requirements, edge inference eliminates the egress charge and improves latency simultaneously.
Hybrid Topology
High-egress generation workloads (video, audio) run on a neocloud with zero egress fees. Orchestration, routing, and user-facing APIs stay on a hyperscaler or managed platform. The hybrid model concentrates bandwidth-intensive operations where egress is free and keeps stateful, compliance-sensitive workloads where the hyperscaler's governance features are valuable.
This architecture requires your inference endpoints to be addressable from outside the orchestration plane, which is standard for containerized deployments. The egress from orchestrator to inference node is internal and typically zero or near-zero on most platforms.
Migration Checklist: Moving Egress-Heavy Workloads
Moving an inference endpoint from a hyperscaler to a zero-egress neocloud is straightforward for stateless workloads. The process typically takes a few hours of engineering time for a well-containerized service.
- Rank workloads by monthly egress GB. Use your cloud provider's billing export to identify the top 3-5 highest-egress services. Start with the highest-volume one where the compute configuration is simple (single GPU, stateless inference).
- Containerize your inference stack. Package the model server (vLLM, TGI, or custom) with its runtime dependencies into a Docker image. Test locally with
docker runbefore targeting a new provider. If you're already on Kubernetes or ECS, you likely have container images ready.
- Externalize all state to S3-compatible object storage. The new provider needs to be able to access your model weights, checkpoints, and any persistent artifacts. Most neoclouds support S3-compatible endpoints or allow mounting volumes pre-loaded with weights. Confirm before migrating.
- Benchmark on the target provider. Provision equivalent GPU capacity and run your standard load tests against the new endpoint. Compare tokens/sec, latency percentiles, and memory usage. Differences in NVLink topology, NIC bandwidth, or driver versions can affect inference throughput.
- Shadow-deploy and validate output parity. Route a small percentage of production traffic to the new endpoint (or mirror production requests to it) and compare outputs. For non-deterministic endpoints, compare distributions rather than exact outputs. Run this for 24-48 hours before cutting over.
- Cut over and decommission. Shift all traffic to the new endpoint, monitor for 24 hours, then terminate the original instance. For migrating from AWS to a neocloud, the full mechanics of switching load balancers and updating DNS are covered in detail.
For A100 instances on Spheron or an H100 GPU rental, you can provision capacity immediately without reservation queues or quota approval processes. For technical setup and configuration options, see the Spheron documentation.
Egress is a slow leak. On any single day the charge is small. Over six months of production traffic, it compounds into thousands of dollars that showed up nowhere in the GPU pricing comparison you made before launching. The fix is architectural: zero-egress providers eliminate the problem entirely, and the techniques above address it within a hyperscaler if migration isn't on the table.
Egress fees are the most invisible line item in a hyperscaler GPU bill. Spheron charges no egress fees on GPU rentals, which can save hundreds to thousands of dollars per month for inference endpoints serving millions of tokens daily. See the full GPU pricing → or start with an H100 →.
Quick Setup Guide
Pull 90 days of billing data from your cloud provider. In AWS, use the Cost Explorer with 'Usage Type' filter set to 'DataTransfer-Out-Bytes'. In GCP, filter the Billing export by SKU containing 'Network Egress'. Calculate your monthly egress GB and multiply by the per-GB rate. Add cross-AZ transfer costs (AWS charges $0.01/GB each direction within a region). Compare this total to your compute line item - for high-traffic inference endpoints, egress often represents 10-25% of total cloud spend.
For text LLM endpoints: measure average response size in bytes (curl -w '%{size_download}' against a sample of requests). Divide by the token count in that response to get bytes-per-token for your specific model and typical prompt mix. For streaming endpoints, measure the actual TCP payload size over a 5-minute capture (tcpdump or Wireshark on the inference container's NIC). For multimodal: image outputs average 50-500 KB/response, audio 20-200 KB/second, video 1-5 MB/second at 720p.
Rank workloads by monthly egress GB. Typical high-egress AI workloads in order: (1) video generation pipelines serving 720p/1080p outputs, (2) audio synthesis APIs returning wav/mp3, (3) multimodal RAG systems returning image-augmented responses, (4) high-volume streaming LLM endpoints, (5) embedding APIs returning float32 vectors (a 1536-dimension embedding = 6 KB per request). For a given workload, calculate (requests/month) x (response bytes) to get total monthly egress volume.
Containerize your inference stack (vLLM, TGI, or custom FastAPI). Externalize all state to S3-compatible object storage. Test the container locally, then provision an equivalent GPU on Spheron. Update your load balancer or reverse proxy to route traffic to the new endpoint. Run both endpoints in parallel for 48 hours to verify output parity. Then cut over and decommission the hyperscaler instance. The migration process for a stateless inference endpoint typically takes 2-4 hours of engineering time.
For streaming endpoints: enable HTTP/2 multiplexing to reduce per-stream TCP overhead. For repeated or cacheable responses: front with Cloudflare Workers AI or Nginx proxy cache with a short TTL. For batch workloads: consolidate small requests into larger batches before transmitting results. For RAG pipelines: store retrieved chunks in the same region as your inference GPU so only the final answer travels out. For model loading: keep weights in persistent volume storage on the inference node rather than downloading from S3 on each restart.
Frequently Asked Questions
AWS charges $0.09/GB for the first 10 TB/month of data transferred out to the internet, dropping to $0.085/GB for the next 40 TB and $0.07/GB for the next 100 TB. For GPU instances in us-east-1, this typically adds $9-$12 to every 100 GB served. GCP and Azure have similar structures at $0.08-$0.12/GB depending on tier and region. By contrast, several neoclouds charge zero or near-zero egress fees.
A typical LLM response token encodes to 3-5 bytes in UTF-8 (closer to 4 bytes on average for mixed English/code output). At 4 bytes/token, 100 million tokens/day = 400 MB/day = ~12 GB/month. At AWS egress rates ($0.09/GB), that is roughly $1.08/month in egress alone for a 100M token/day workload. Streaming responses amplify this because the TCP overhead per streamed chunk is much higher than batch delivery. A realistic streaming multiplier is 1.5-2x, pushing the monthly egress bill for 100M tokens/day to $1.60-$2.20/month. At 10 billion tokens/day, this becomes $160-$220/month in egress costs that appear nowhere in per-GPU-hour pricing comparisons.
Spheron, CoreWeave, Lambda Labs, and Vast.ai all charge zero or near-zero egress fees for most workloads. Hyperscalers (AWS, GCP, Azure) charge $0.05-$0.12/GB. The difference compounds at scale: at 10 TB/month egress, the hyperscaler bill adds $900/month that a neocloud user pays nothing for.
Start with your monthly token volume. Multiply by average tokens per response, then by 4 bytes/token. For streaming endpoints, apply a 1.5x overhead multiplier for TCP framing. Add any training data ingest (typically S3 egress if pulling from outside your region). Add model checkpoint downloads if you're loading weights from object storage each restart. Sum all of these, multiply by your provider's egress rate per GB, and that's your monthly egress bill. For multimodal workloads (image, audio, video), measure actual response sizes in bytes rather than estimating from token counts.
The most impactful moves, ranked by impact: (1) Switch to a provider with zero egress fees - this eliminates the cost entirely. (2) Front your inference endpoint with a CDN like Cloudflare to cache repeated responses and offload bandwidth. (3) Enable gzip or brotli compression on JSON responses - LLM text compresses 60-80%, reducing effective egress volume. (4) Batch non-streaming requests where latency tolerance allows. (5) Place your model in the same region as your largest user concentration to minimize cross-region data movement within a hyperscaler's network.
