The AWS EC2 G7 family puts NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs inside a managed EC2 envelope. The smallest single-GPU option, g7.2xlarge, lists at $2.52/hr on-demand in US East (Ohio), where G7 is available. The 8-GPU g7.48xlarge sits at $28.51/hr total, or roughly $3.56 per GPU per hour. That's before EBS volumes, data egress, and support charges appear on the bill. For the hidden billing surprises beyond the EC2 line item, see the guide to avoiding unexpected AWS costs. For a full cross-provider GPU cost view, see the GPU cloud pricing comparison 2026.
AWS launched G7 as generally available in June 2026, with availability in US East (Ohio) and US West (Oregon). It is the first AWS instance family to use Blackwell Server Edition GPUs at this tier. The comparison post on AWS vs GCP vs Azure GPU alternatives covers the broader hyperscaler cost structure, but G7 is worth its own breakdown because the inference-grade positioning and Blackwell FP4 support change the math on cost-per-token in ways the P5 post does not address.
What Is the AWS EC2 G7 Instance Family
G7 instances run NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs with 32GB GDDR7 per GPU. That "Server Edition" suffix matters: it indicates data-center drivers, ECC memory support, and passive cooling designed for rack-mounted density. The RTX PRO 4500 Blackwell spec is 32GB GDDR7 with ECC, the same total GPU memory as the RTX 5090. The G7 is not a training instance. There is no NVLink between GPUs, which limits tensor parallelism efficiency for very large models. It is built for inference, video processing, graphics rendering, and spatial computing workloads.
AWS claims 4.6x AI inference performance over G6 and 2.1x graphics performance. The 700 Gbps EFA-enabled networking (7x more than G6) supports multi-node inference and GPU-direct RDMA for scenarios where inference needs to span instances.
| Instance | GPUs | GPU VRAM | vCPUs | System RAM | Local NVMe | Network |
|---|---|---|---|---|---|---|
| g7.2xlarge | 1 | 32 GB | 8 | 32 GiB | 1 x 600 GB | Up to 60 Gbps |
| g7.4xlarge | 1 | 32 GB | 16 | 64 GiB | 1 x 600 GB | Up to 100 Gbps |
| g7.8xlarge | 1 | 32 GB | 32 | 128 GiB | 1 x 950 GB | Up to 100 Gbps |
| g7.12xlarge | 2 | 64 GB | 48 | 192 GiB | 1 x 1900 GB | 175 Gbps |
| g7.24xlarge | 4 | 128 GB | 96 | 384 GiB | 1 x 3800 GB | 350 Gbps |
| g7.48xlarge | 8 | 256 GB | 192 | 768 GiB | 2 x 3800 GB | 700 Gbps |
| g7.metal (coming soon) | 8 | 256 GB | 192 | 768 GiB | 2 x 3800 GB | 700 Gbps |
G7 is available in US East (Ohio) and US West (Oregon). Other regions are not listed as of the GA launch.
G7 vs G6: What the 4.6x Inference Uplift Means in Practice
G6 uses the NVIDIA L4 Ada Lovelace GPU: 24GB GDDR6, about 300 GB/s memory bandwidth, FP8 Tensor Cores, no FP4. G7 switches to the RTX PRO 4500 Blackwell Server Edition: 32GB GDDR7, roughly ~800 GB/s memory bandwidth per NVIDIA specs (AWS states 2.45x more than L4, which implies ~735 GB/s from a 300 GB/s baseline), Blackwell 5th-gen Tensor Cores with FP4 and FP8 support.
The 4.6x inference claim applies to FP4-quantized workloads on Blackwell, which represents the best case. For FP16 or FP8 serving, the performance difference is smaller, driven by the bandwidth uplift and architectural improvements rather than FP4 Tensor Core TOPS. A realistic expectation for FP8 LLM inference on G7 vs G6 is roughly 1.5 to 2x higher throughput per GPU, with the full 4.6x requiring a model that benefits from FP4 quantization and a framework that supports Blackwell FP4 kernels.
For Llama 3.1 8B as a concrete example: on a G6 L4 instance, FP8 single-stream throughput runs around 250-400 tokens per second with vLLM. On G7 with RTX PRO 4500 at FP8, you can expect 800-1,200 tokens per second based on RTX 5090 benchmarks (same Blackwell generation and GDDR7 memory tier). With FP4, throughput on Blackwell pushes higher, though batch-mode scaling depends on KV cache and concurrency patterns.
The 4.6x number does not mean G7 is always the right upgrade from G6. For workloads using small 8B models with minimal VRAM pressure, a G6 instance can serve the same load at a lower per-hour rate. G7 pays off when:
- Your model is large enough to benefit from 32GB vs 24GB VRAM (14B FP16, 30B AWQ, any model that OOM'd on L4)
- You are running FP4-compatible inference and your serving framework supports Blackwell FP4 kernels
- Memory bandwidth is your primary bottleneck, not Tensor Core TOPS
For Blackwell architecture analysis at a higher tier, the RTX 5090 vs RTX PRO 6000 Blackwell comparison covers GB202 die configuration, SM counts, and FP4 Tensor Core behavior in detail.
AWS G7 On-Demand Pricing: All Sizes
G7 on-demand pricing in US East (Ohio) as of June 2026. The single-GPU instances (g7.2xlarge through g7.8xlarge) show lower per-GPU on-demand rates than the multi-GPU instances because they are priced without the additional host resources and networking that come with larger configurations.
| Instance | GPUs | VRAM | On-Demand $/hr | Per-GPU $/hr |
|---|---|---|---|---|
| g7.2xlarge | 1 | 32 GB | $2.52 | $2.52 |
| g7.4xlarge | 1 | 32 GB | $3.04 | $3.04 |
| g7.8xlarge | 1 | 32 GB | $4.09 | $4.09 |
| g7.12xlarge | 2 | 64 GB | $7.13 | $3.57 |
| g7.24xlarge | 4 | 128 GB | $14.26 | $3.57 |
| g7.48xlarge | 8 | 256 GB | $28.51 | $3.56 |
The g7.2xlarge has the lowest per-GPU rate at $2.52/hr. If you need exactly one GPU and can tolerate 8 vCPUs and 32 GiB of host RAM, it is the cheapest entry point. The 2-GPU g7.12xlarge and 4-GPU g7.24xlarge come to $3.57/GPU on-demand ($7.13 / 2 = $3.565, rounds to $3.57; $14.26 / 4 = $3.565, rounds to $3.57). The 8-GPU g7.48xlarge lands slightly lower at $3.56/GPU ($28.51 / 8 = $3.56375, rounds to $3.56). AWS bundles more vCPUs, host memory, and networking into the larger sizes.
One practical note: the g7.8xlarge at $4.09/hr for a single GPU gives you 32 vCPUs and 128 GiB of host RAM. For inference pipelines that run heavy preprocessing on CPU before GPU inference, the extra host resources may justify the premium over g7.2xlarge.
AWS list prices change without notice. The prices above are based on 29 Jun 2026 and may have changed. Check the AWS EC2 pricing page for current rates.
G7 Spot and Savings Plan Math
G7 spot pricing discounts run significantly deeper than you typically see with P-family training instances. AWS G7 spot is available at 81-83% off on-demand, compared to roughly 40-50% off for P5 H100 spot when it actually shows up. This better availability reflects the inference-grade positioning: G7 instances have more elastic demand patterns and a larger pool of potential spot users, which translates to more consistent spot availability.
| Instance | On-Demand $/hr | Spot $/hr (est.) | Spot Savings | Per-GPU Spot (est.) |
|---|---|---|---|---|
| g7.2xlarge | $2.52 | $0.46 | ~82% | $0.46 |
| g7.4xlarge | $3.04 | $0.57 | ~81% | $0.57 |
| g7.8xlarge | $4.09 | $0.85 | ~79% | $0.85 |
| g7.12xlarge | $7.13 | $1.33 | ~81% | $0.67 |
| g7.24xlarge | $14.26 | $2.38 | ~83% | $0.60 |
| g7.48xlarge | $28.51 | $4.80 | ~83% | $0.60 |
Spot prices float with capacity and are not published as fixed list prices by AWS. The spot figures above are representative estimates based on observed rates; actual spot pricing at the time of your launch may differ.
G7 spot with interruption handling is viable for batch inference jobs, offline embedding pipelines, and scheduled workloads that can checkpoint to S3. It is not suitable for real-time inference APIs serving live traffic, where a 2-minute reclaim notice would cause downtime.
Savings Plans: AWS Compute Savings Plans apply to G7 instances. Traditional Reserved Instances are not yet available for G7 as of the GA launch date. Compute Savings Plans typically offer 20-30% off on-demand for GPU instance families on a 1-year no-upfront commitment. Exact G7 rates are published at the AWS Compute Savings Plans pricing page and should be confirmed before committing, since rates vary by instance family and commitment term.
Monthly cost examples (720 hours at full utilization):
| Config | Billing Mode | Per-GPU $/hr | Monthly per GPU | 8-GPU Node Monthly |
|---|---|---|---|---|
| g7.2xlarge | On-demand | $2.52 | $1,814 | N/A (1-GPU instance) |
| g7.48xlarge | On-demand | $3.56 | $2,566 | $20,527 |
| g7.48xlarge | Spot | $0.60 | $432 | $3,456 |
| g7.48xlarge | Est. 1-yr Savings Plan | ~$2.50-2.85 | ~$1,800-2,052 | ~$14,400-16,416 |
Monthly per-GPU and node totals are derived from the full instance hourly price rather than the rounded per-GPU rate. For g7.48xlarge on-demand: $28.51 × 720 = $20,527.20 ≈ $20,527 node total, $20,527 / 8 = $2,566 per GPU. Multiplying the displayed per-GPU rate ($3.56) directly by 720 gives $2,563, a small difference from the rounding.
The spot-to-on-demand ratio on G7 is notably attractive compared to P5. If your workload tolerates interruption, g7.48xlarge spot at $4.80/hr for 8 GPUs represents genuinely cheap Blackwell-generation inference capacity.
What Workloads the 32GB RTX PRO 4500 Fits
32GB GDDR7 covers a wide range of LLM inference and multimodal workloads. Here is what fits cleanly and what does not.
Fits on a single G7 GPU:
- LLM inference (mid-size): Llama 3.1 8B at FP16 (~16GB), 14B at FP8 (~14GB), 30B at AWQ 4-bit (~18-22GB), Mistral 22B at FP8 (~22GB), Qwen2.5 14B at FP16 (~28GB)
- ASR: Whisper large-v3 at ~3GB; multiple parallel ASR workers per GPU
- Vision models: CLIP ViT-L/14 (~1.7GB), Florence-2-large (~5GB), LLaVA-1.5 7B (~16GB), multiple vision transformers fit simultaneously
- Image generation: SDXL (~10GB), Flux.1 Schnell (~25GB) fits with some headroom, Flux.1 Dev (~25GB)
- Embeddings and recommenders: GTE-large (~1.6GB), E5-mistral (~14GB), user/item towers in RecSys pipelines
Does not fit on a single G7 GPU:
- Llama 3.1 70B at any usable precision (70GB weights exceed 32GB VRAM)
- Llama 3.1 30B at FP16 (~60GB)
- Qwen2.5 32B at FP16 (~64GB)
- Any 70B+ model at FP8 without tensor parallelism across multiple GPUs
Multi-GPU caveat: G7 instances support GPU direct P2P and EFA-backed RDMA, but without NVLink, tensor parallelism is constrained by PCIe Gen 5 bandwidth. For 70B training or high-concurrency 70B serving that requires tight GPU coupling, P5 H100 with NVLink is the right instance family. G7 is not designed to replace P5 for large-model distributed training.
Hidden AWS G7 Costs
The on-demand hourly rate is the starting point, not the total cost. Four cost categories reliably inflate G7 bills beyond the EC2 line item.
EBS root volume
G7 instances include local NVMe SSD storage (600GB to 7.6TB depending on size), which covers model weights and scratch data. The OS root volume still runs on EBS. A 100 GB gp3 root volume costs $8/month at $0.08/GB/month. If you maintain custom AMIs or multiple checkpoint snapshots, storage costs compound. Most G7 inference setups get away with a modest root volume since model weights go on local NVMe, so EBS costs are lower than with P5 training instances, but they are not zero.
Data egress
AWS charges $0.09/GB for data leaving to the internet (first 10 TB/month). For inference applications that serve predictions externally, or for teams downloading model outputs, validation sets, or evaluation results, egress builds up. A model producing 10GB of output data per day runs $27/month in egress alone. Teams running distributed inference pipelines across AZs also pay $0.02/GB per direction for cross-AZ transfers.
Cross-AZ and cross-region data movement
G7 is only available in Ohio and Oregon as of the GA launch. If your dataset is in S3 in us-east-1 or your API frontend is in a different region, data movement costs apply. Pulling a 50GB dataset from a different region before inference costs $1-2 in transfer fees. At scale with frequent re-fetches this adds up.
Support tier overhead
AWS Developer support is $29/month. Business support starts at $100/month or 3% of monthly spend, whichever is higher. For teams that need fast response on quota increases or instance issues, Business support is effectively mandatory, and at higher GPU spend the 3% floor kicks in before you notice. Budget this as operational overhead even if it does not appear on the GPU line item.
For a complete breakdown of hyperscaler hidden costs across all instance types, see avoiding unexpected AWS costs.
Side-by-Side Cost Comparison: AWS G7 vs Spheron
The RTX PRO 4500 Blackwell Server Edition is AWS-exclusive in the EC2 catalog as of June 2026. The closest Spheron equivalent by VRAM and generation is the RTX 5090 (32GB GDDR7, Blackwell). Spheron also offers the L40S (48GB GDDR6, Ada Lovelace, data center driver, ECC memory) and the RTX PRO 6000 (96GB GDDR7, Blackwell) for larger VRAM requirements. These are not the same GPU as the RTX PRO 4500, but they cover the same VRAM tier and generation.
| Provider | GPU | VRAM | Architecture | On-Demand $/hr | Spot $/hr | Egress |
|---|---|---|---|---|---|---|
| AWS G7 (g7.2xlarge) | RTX PRO 4500 | 32 GB GDDR7 | Blackwell | $2.52 | $0.46 | $0.09/GB |
| AWS G7 (g7.48xlarge/GPU) | RTX PRO 4500 | 32 GB GDDR7 | Blackwell | $3.56 | $0.60 | $0.09/GB |
| Spheron RTX 5090 | RTX 5090 | 32 GB GDDR7 | Blackwell | from $0.92* | varies | None |
| Spheron L40S | L40S | 48 GB GDDR6 | Ada Lovelace | from $1.90* | varies | None |
| Spheron RTX PRO 6000 | RTX PRO 6000 | 96 GB GDDR7 | Blackwell | from $2.40* | from $1.32 | None |
| Spheron H100 PCIe | H100 PCIe | 80 GB HBM2e | Hopper | $2.01 | N/A | None |
*Spheron prices sourced from live API as of 29 Jun 2026. Current rates: check /pricing/ since catalog and availability change.
Cost-per-million-tokens comparison (Llama 3.1 8B FP16, ~1,000 tok/s single-stream estimate for Blackwell-class 32GB GPU):
| Provider | GPU | $/hr | Estimated tok/s | $/1M tokens |
|---|---|---|---|---|
| AWS G7 g7.2xlarge | RTX PRO 4500 | $2.52 | ~1,000 | ~$0.70 |
| AWS G7 g7.8xlarge | RTX PRO 4500 | $4.09 | ~1,000 | ~$1.14 |
| Spheron RTX 5090 | RTX 5090 | $0.92 | ~1,000 | ~$0.26 |
| Spheron L40S | L40S | $1.90 | ~700 | ~$0.75 |
Cost per 1M tokens = ($/hr) / (tokens_per_sec × 3600) × 1,000,000. Throughput is a representative single-stream estimate for 8B FP16; batch inference at higher concurrency improves throughput and reduces cost-per-token. G7 g7.8xlarge is $4.09/hr for the same single GPU; its extra vCPUs and host RAM are idle unless your preprocessing pipeline uses them.
Egress effects at scale: if your inference pipeline downloads 10GB of results or model checkpoints daily, AWS adds $0.90/day ($27/month) to the bill. Spheron has no egress fee. At 100GB/day download volume, AWS egress adds $9/day ($270/month) on top of the GPU cost.
Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
For 32GB Blackwell inference, check RTX 5090 availability on Spheron with per-minute billing. For workloads that need more than 32GB, RTX PRO 6000 GPU pricing on Spheron covers the 96GB Blackwell option.
When G7 Makes Sense vs When a Neocloud Is Cheaper
G7 makes sense when:
Your inference pipeline lives inside AWS. If your application calls SageMaker endpoints, stores data in S3, processes via Lambda, or feeds into Bedrock, extracting the GPU to a neocloud means crossing a network boundary for every request. The egress fees and latency from that cross-cloud path can offset or exceed the per-GPU savings, depending on request volume. For tightly AWS-integrated architectures, the convenience of staying inside the VPC is worth more than the per-hour difference.
Compliance or data sovereignty requirements mandate AWS-certified infrastructure. Some industries require FedRAMP, HIPAA on AWS, or PCI-DSS on a specific hyperscaler. Neoclouds have data center partners globally with various certifications, but if your compliance team specifically requires an AWS ARN, G7 may be the only option regardless of cost.
You have an Enterprise Discount Program agreement that brings G7 effective cost below the list price. Large AWS customers negotiate EDP credits that apply to new instance families. If you are already using EDP credits on your account, G7 may cost less than the list prices in this post suggest.
You need multi-instance inference with EFA networking at scale across G7 nodes. G7 supports GPU direct RDMA with EFA, which enables tight multi-node coupling without the PCIe bottleneck. For large-batch distributed inference that must span multiple G7 instances with high-bandwidth GPU-to-GPU communication, the EFA networking is built in.
Neocloud (Spheron) is cheaper when:
Your inference serving has no AWS-native service dependencies. A standalone vLLM server, a Triton Inference Server endpoint, or a custom FastAPI inference API does not need AWS services. In that case, the GPU is the only requirement, and Spheron's RTX 5090 at from $0.92/hr versus G7 at $2.52/hr for the same 32GB Blackwell tier makes the choice straightforward.
You need 48GB or 96GB VRAM per GPU. G7 tops out at 32GB per GPU. Spheron's L40S gives 48GB at $1.90/hr on-demand, and the RTX PRO 6000 gives 96GB at $2.40/hr (Jun 2026 reference). Neither of these is available in G7. If your models need more than 32GB per GPU, you need a different instance or a different cloud entirely.
You need short-burst capacity without commitment. Spheron bills per-minute with no minimum (see per-minute billing docs). AWS does not guarantee spot availability, and on-demand G7 commits you to full-hour increments. For a 20-minute inference experiment or an overnight fine-tuning run on a small dataset, per-minute billing saves meaningful money vs being rounded up to a full hour.
You want to test across GPU types before committing to a single SKU. Spheron's catalog spans RTX 5090, L40S, H100 PCIe, H100 SXM5, B200, RTX PRO 6000, and others under one account. Running the same inference benchmark across GPU types to pick the best cost-per-token for your specific model takes minutes, with no service quota requests or approval delays.
For the H100-tier equivalent analysis, see the AWS H100 P5 pricing guide. For Blackwell SXM6 cloud pricing at the B200 tier, see NVIDIA B200 cloud pricing 2026.
FAQ
What GPU does the AWS EC2 G7 instance use?
The AWS EC2 G7 instance family uses the NVIDIA RTX PRO 4500 Blackwell Server Edition with 32GB GDDR7 memory per GPU. It is available in six GA sizes from g7.2xlarge (1 GPU) to g7.48xlarge (8 GPUs, 256GB total GPU memory). A bare-metal variant is listed as coming soon. G7 is the Blackwell-generation upgrade to G6, which used the NVIDIA L4 Ada GPU with 24GB GDDR6.
How much does a G7 instance cost per hour on AWS?
As of June 2026, the g7.2xlarge (1x RTX PRO 4500) starts at $2.52/hr on-demand in US East (Ohio). The g7.48xlarge (8x RTX PRO 4500) costs $28.51/hr on-demand, or about $3.56 per GPU per hour. Spot instances discount 81-83% off on-demand: g7.2xlarge spot is $0.46/hr, g7.48xlarge spot is $4.80/hr total. Compute Savings Plans are available and typically offer 20-30% off on-demand for a 1-year no-upfront commitment; confirm exact G7 rates at the AWS Savings Plans pricing page.
How much cheaper is Spheron compared to AWS G7 for inference workloads?
For 32GB Blackwell inference, Spheron's RTX 5090 on-demand rate (from $0.92/hr as of Jun 2026) is roughly 63% cheaper than AWS G7 on-demand at g7.2xlarge ($2.52/hr). On cost-per-million-tokens for Llama 3.1 8B, that translates to approximately $0.26 on Spheron versus $0.70 on AWS G7. Spheron also charges no data egress fees, which further widens the real total cost gap for any workload that regularly downloads inference results or model checkpoints.
What is the difference between G6 and G7 instances?
G6 uses the NVIDIA L4 Ada Lovelace GPU (24GB GDDR6, FP8 support, ~300 GB/s memory bandwidth). G7 upgrades to the NVIDIA RTX PRO 4500 Blackwell Server Edition (32GB GDDR7, FP4 plus FP8 support, ~800 GB/s memory bandwidth per NVIDIA specs). AWS reports a 4.6x inference throughput uplift on G7 versus G6. That figure applies to FP4-quantized workloads where Blackwell's 5th-generation Tensor Cores run at maximum efficiency. FP16 or FP8 inference sees a smaller improvement, driven primarily by the GDDR7 bandwidth and higher VRAM headroom.
Does Spheron offer the RTX PRO 4500 Blackwell?
Spheron does not currently stock the RTX PRO 4500 Blackwell Server Edition. The closest alternatives are the RTX 5090 (32GB GDDR7, Blackwell GB202-class, same VRAM tier as the RTX PRO 4500 Server Edition) and the L40S (48GB GDDR6, Ada Lovelace, inference-optimized with ECC memory and data center certified drivers). The RTX PRO 6000 (96GB GDDR7, Blackwell) is the higher-VRAM option for models that exceed 32GB. None of these are the RTX PRO 4500 specifically, but the RTX 5090 matches it on VRAM and Blackwell-generation Tensor Core capabilities.
AWS G7 costs adding up? Spheron offers 32GB Blackwell inference capacity on RTX 5090, with no egress fees, no Savings Plan lock-in, and per-minute billing. The RTX PRO 6000 at 96GB is available if your models outgrow 32GB.
Check RTX 5090 availability → | RTX PRO 6000 GPU pricing → | View all GPU pricing →
Quick Setup Guide
Visit spheron.network/pricing/ to see current on-demand and spot rates for RTX 5090, L40S, RTX PRO 6000, and all other Spheron GPU models. Prices update in real time from the Spheron GPU marketplace.
Sign in at app.spheron.ai, select the GPU that fits your VRAM requirement (RTX 5090 for 32GB workloads, L40S for 48GB, RTX PRO 6000 for up to 96GB), choose on-demand or spot billing, and deploy. SSH access is available in under 2 minutes with per-minute billing and no minimum commitment.
Frequently Asked Questions
The AWS EC2 G7 instance family uses the NVIDIA RTX PRO 4500 Blackwell Server Edition with 32GB GDDR7 memory per GPU. It is available in six GA configurations ranging from 1 GPU (g7.2xlarge) up to 8 GPUs (g7.48xlarge, with 256GB total GPU memory). A bare-metal variant (g7.metal) is listed as coming soon. G7 is the Blackwell-generation upgrade to the G6 family, which used the NVIDIA L4 Ada GPU with 24GB GDDR6.
As of June 2026, the g7.2xlarge (1x RTX PRO 4500) starts at $2.52/hr on-demand in US East (Ohio). The g7.48xlarge (8x RTX PRO 4500) costs $28.51/hr on-demand, or about $3.56 per GPU per hour. Spot instances run 81-83% off the on-demand rate: g7.2xlarge spot is $0.46/hr, g7.48xlarge spot is $4.80/hr. Compute Savings Plans are available and typically reduce GPU instance costs 20-30% on a 1-year no-upfront commitment, but exact G7 rates should be confirmed at the AWS pricing page.
For 32GB Blackwell workloads, Spheron's RTX 5090 on-demand rate (from $0.92/hr as of Jun 2026) is roughly 63% cheaper than AWS G7 on-demand at the comparable smallest single-GPU instance (g7.2xlarge at $2.52/hr). On cost-per-million-tokens for Llama 3.1 8B inference, the gap is similar: AWS G7 at approximately $0.70/M tokens versus Spheron RTX 5090 at approximately $0.26/M tokens. Spheron also charges no data egress fees, which further widens the gap for workloads that regularly download model checkpoints or results.
G6 uses the NVIDIA L4 Ada Lovelace GPU (24GB GDDR6, FP8 support, ~300 GB/s memory bandwidth). G7 upgrades to the NVIDIA RTX PRO 4500 Blackwell Server Edition (32GB GDDR7, FP4 plus FP8 support, ~800 GB/s memory bandwidth per NVIDIA specs). AWS reports a 4.6x inference throughput uplift on G7 versus G6, driven by Blackwell's FP4 Tensor Core efficiency, 8GB more VRAM per GPU, and higher GDDR7 bandwidth. The FP4 gain applies specifically to quantized inference with FP4-compatible frameworks; FP16 or FP8 workloads see a smaller uplift, primarily from bandwidth and architecture improvements.
Spheron does not currently stock the RTX PRO 4500 Blackwell Server Edition specifically. The closest alternatives are the RTX 5090 (32GB GDDR7, Blackwell, same memory capacity as the RTX PRO 4500) and the L40S (48GB GDDR6, Ada Lovelace, inference-optimized with ECC memory and data center drivers). The RTX PRO 6000 (96GB GDDR7, Blackwell) is the right choice for models larger than 32GB. None of these are the same GPU as the RTX PRO 4500, but the RTX 5090 matches it on VRAM and Blackwell-generation architecture.
