Dedicated vs Shared GPU Memory: Why VRAM Matters for AI Workloads

GPU memory architecture is one of the most misunderstood factors in AI system design. Teams routinely deploy models on GPUs advertised with a certain memory capacity, only to discover that a significant portion of that memory is shared with the CPU, borrowed from system RAM over a slow bus. The result is unpredictable performance, latency spikes under load, and production outages that trace back to a simple architectural mismatch.

The difference is stark: dedicated VRAM on modern data center GPUs delivers 2,000–4,800 GB/s of memory bandwidth, while shared system RAM tops out at 50–100 GB/s. That's a 20–100x gap. For LLM inference at batch size 1, where performance is almost entirely memory-bandwidth-bound, this gap translates directly into tokens-per-second throughput.

This guide explains what dedicated and shared GPU memory actually are, how they work at the hardware level, why the distinction matters for AI workloads, and how to make the right choice for your deployment.

What Is Dedicated GPU Memory?

Dedicated GPU memory, commonly called VRAM (Video Random Access Memory), is memory physically mounted on the GPU board itself. It connects to the GPU's compute units through a wide, high-bandwidth memory bus designed specifically for the parallel access patterns that GPUs require.

There are two main types of dedicated VRAM used in modern GPUs:

GDDR (Graphics Double Data Rate)

GDDR6 and GDDR6X are used in consumer and mid-range data center GPUs like the RTX 4090, L40S, and L4. GDDR memory chips sit alongside the GPU die on the PCB, connected through a 256-bit or 384-bit memory bus. The RTX 4090's GDDR6X delivers approximately 1,008 GB/s of bandwidth, while the L40S achieves 864 GB/s.

GDDR is cost-effective and well-suited for inference workloads on models that fit within the available capacity (typically 24–48 GB). Its main limitation is capacity; GDDR packages are physically larger than HBM stacks, which limits how much memory can be placed on a single GPU board.

HBM (High Bandwidth Memory)

HBM2e, HBM3, and HBM3e are used in data center GPUs like the A100, H100, and H200. HBM stacks memory dies vertically using through-silicon vias (TSVs), creating an extremely wide memory interface (typically 4096-bit or wider) in a compact footprint. This vertical stacking delivers dramatically higher bandwidth. The A100's HBM2e achieves approximately 2,000 GB/s, the H100's HBM3 reaches 3,350 GB/s, and the H200's HBM3e delivers 4,800 GB/s.

HBM's combination of high capacity (80–192 GB per GPU) and extreme bandwidth (2–5 TB/s) makes it the memory technology of choice for large-scale AI training and inference.

What Is Shared GPU Memory?

Shared GPU memory is system RAM (DDR4 or DDR5) that the GPU accesses over the system bus (typically PCIe) when its dedicated VRAM is insufficient. The operating system and GPU driver manage this transparently; when a model or dataset exceeds the GPU's onboard VRAM, pages are automatically moved to or accessed from CPU memory.

This mechanism exists in several forms:

Integrated GPUs (Intel UHD, AMD Radeon integrated, Apple Silicon) have no dedicated VRAM at all. They use a portion of system RAM as their primary memory pool, accessing it through the CPU's memory controller.

CUDA Unified Memory allows NVIDIA GPUs to transparently access system RAM when VRAM is exhausted. The driver handles page migration between GPU and CPU memory, but every access to a page that resides in system RAM triggers a page fault and a PCIe transfer. This is orders of magnitude slower than a local HBM read.

Virtualized and overcommitted cloud instances may present more "GPU memory" than physically exists as dedicated VRAM, using hypervisor-level memory management to borrow system RAM when needed. This is particularly common in multi-tenant GPU cloud environments.

The fundamental problem with shared memory is bandwidth. Even the fastest DDR5 systems deliver roughly 50–100 GB/s per CPU socket, while PCIe Gen5 x16 provides approximately 64 GB/s. Compare this to the 1,000–4,800 GB/s available from dedicated VRAM and the performance implications become clear.

Memory Bandwidth Comparison

Memory bandwidth is the critical specification for AI workloads, particularly LLM inference, where the GPU must read the entire model's weights for every token generated. The following table shows the bandwidth gap across memory types commonly encountered in AI systems:

Memory Type	Example Hardware	Bandwidth	Typical Use
DDR4 (dual-channel)	Standard server RAM	~50 GB/s	CPU workloads, shared GPU memory fallback
DDR5 (quad-channel)	High-end server RAM	~90 GB/s	CPU workloads, shared GPU memory fallback
PCIe Gen5 x16	CPU-GPU interconnect	~64 GB/s	Data transfer between CPU and GPU
GDDR6	L4 (24 GB)	300 GB/s	Efficient inference on small models
GDDR6 with ECC	L40S (48 GB)	864 GB/s	Mid-range inference and multimodal AI
GDDR6X	RTX 4090 (24 GB)	1,008 GB/s	Development, small-scale inference
Apple Unified	M4 Max (128 GB)	546 GB/s	Local inference on Mac hardware
HBM2e	A100 80 GB	2,039 GB/s	Training and inference (previous gen)
HBM3	H100 80 GB	3,350 GB/s	Production training and inference
HBM3e	H200 141 GB	4,800 GB/s	Large model inference, long context
HBM3e	B200 192 GB	8,000 GB/s	Next-gen training and inference

The spread from system DDR5 (~90 GB/s) to the latest HBM3e (8,000 GB/s) is nearly 90x. When a model spills from dedicated VRAM into shared system RAM, it effectively drops from terabytes-per-second to tens of gigabytes-per-second. This is a performance cliff, not a gradual degradation.

How VRAM Spill Destroys LLM Performance

LLM inference at low batch sizes is almost entirely memory-bandwidth-bound. The GPU must read the model's full weight matrix for every token it generates, and the rate at which it can read those weights determines tokens-per-second throughput. When model weights spill from dedicated VRAM into shared system RAM, the impact is dramatic and measurable.

The Performance Cliff

Benchmark data shows a clear pattern: models that fit entirely in GPU VRAM perform at full speed (30–150+ tokens per second depending on model size), while models that exceed VRAM capacity and spill into system RAM experience catastrophic throughput drops, often falling to single-digit tokens per second.

This isn't a gradual degradation. It's a cliff. A 14B model that fits in 24 GB of VRAM might generate 40 tokens/second, while a 30B model that overflows the same GPU's VRAM and spills into system RAM might manage 3–5 tokens/second. The model is technically running, but at 10–15x lower throughput.

Real-World Example: Apple Unified Memory vs Discrete GPU

Apple Silicon's unified memory architecture provides an interesting comparison point. On a Llama 3 70B model at 4-bit quantization (batch size 1), an M4 Max with 64 GB unified memory achieved 28 tokens/second with a time-to-first-token of 420ms. The same model on an RTX 4090 (24 GB VRAM) with 128 GB DDR5 system RAM achieved only 10 tokens/second with a 2.1-second TTFT because the model had to be split across VRAM and system RAM, with the PCIe bus becoming the bottleneck.

The M4 Max has roughly half the raw memory bandwidth of the RTX 4090 (546 GB/s vs 1,008 GB/s), yet delivered nearly 3x the throughput on this specific workload. The reason: unified memory eliminates the PCIe bottleneck entirely. Every byte of the model is accessible at full memory bandwidth, while the RTX 4090 was throttled by the ~64 GB/s PCIe link for the portion of the model stored in system RAM.

This illustrates the core principle: it is not about how fast your fastest memory is. It is about how fast the memory is where your model actually lives.

Impact on Training Workloads

Training is even more memory-intensive than inference because the GPU must simultaneously hold model parameters, activations, gradients, and optimizer states. A 70B parameter model in FP16 requires approximately 140 GB for weights alone, plus 2–3x additional memory for optimizer states (Adam stores two additional copies of the weights), gradients, and activation checkpoints.

Memory Budget Breakdown for Training

For a typical training setup on a 70B model:

Model weights (FP16): ~140 GB
Optimizer states (Adam FP32): ~280 GB
Gradients (FP16): ~140 GB
Activations (varies with batch size): 10–100+ GB

Total: 570–660+ GB, requiring multiple GPUs even with the largest HBM capacities. On GPUs with 80 GB HBM like the H100, teams use tensor parallelism and pipeline parallelism to distribute this memory budget across 8+ GPUs.

If any portion of this memory budget spills into shared system RAM, the training pipeline becomes I/O-bound. Batch sizes must shrink, gradient accumulation steps increase, and wall-clock training time can stretch by 2–5x. At $2-$4/hr per H100, that memory spill translates directly into wasted budget.

Dedicated VRAM Advantages for Training

When all training state fits in dedicated HBM, the GPU can stream weights, compute gradients, and update optimizer states at full bandwidth without any system bus contention. This enables larger batch sizes (better GPU utilization), faster gradient communication between GPUs (NVLink operates independently from the system bus), and predictable training throughput that scales linearly with GPU count.

Impact on Production Inference

Production inference SLAs live or die on tail latency: the P95 and P99 response times, not the median. Shared GPU memory introduces variance that specifically targets tail latency.

Why Shared Memory Creates Latency Spikes

When a model partially resides in shared system RAM, individual inference requests may trigger GPU page faults. The GPU requests a memory page that is not in VRAM, causing a stall while the page is fetched from system RAM over PCIe. These faults are unpredictable because system RAM is shared with CPU workloads, networking stacks, disk I/O, and other processes. Under load, contention for system RAM bandwidth increases, making page fault latency even more variable.

The result is a bimodal latency distribution: most requests complete quickly (hitting VRAM), but a fraction experience 10–100x higher latency (hitting shared memory). For a production chatbot or API, this means most responses feel instant while some users experience multi-second delays. This is exactly the kind of inconsistency that erodes user trust.

Dedicated VRAM Eliminates Variance

With dedicated VRAM, every memory access from the GPU follows the same high-bandwidth, low-latency path. There are no page faults, no bus contention, and no interference from CPU workloads. The latency distribution is tight and predictable, exactly what production SLAs require.

Cloud GPU Memory: Hidden Traps

Cloud platforms abstract hardware specifications behind simple numbers: "16 GB GPU memory" or "24 GB GPU." But the implementation details vary significantly between providers, and the difference between dedicated and shared memory is often not disclosed clearly.

Common Cloud Memory Pitfalls

Overcommitted instances: Some providers use GPU virtualization (MIG, vGPU, or hypervisor-level partitioning) to share a single GPU across multiple tenants. The "GPU memory" each tenant sees may include a shared pool that degrades under contention.

Inflated memory numbers: A listing that says "16 GB GPU memory" might include 8 GB of dedicated VRAM plus 8 GB of shared system RAM. The model fits in "16 GB" on paper but hits the bandwidth cliff the moment it touches the shared portion.

Background process overhead: Even on dedicated GPU instances, the hypervisor, monitoring agents, and driver overhead can consume 500 MB–2 GB of VRAM that appears available but isn't usable by your workload.

What to Look For

When evaluating cloud GPU providers for AI workloads, two questions matter more than the headline VRAM number. First, how much memory is true on-board VRAM versus shared or borrowed system memory? Second, is the instance bare-metal or virtualized, and what is the actual contention pattern under load?

Providers that offer bare-metal GPU access with full dedicated VRAM, where the hardware you see is the hardware your model gets, eliminate the shared memory cliff entirely.

Choosing the Right Memory Architecture by Workload

Different workloads have different tolerance for shared memory. Here's a practical guide:

Workload	Memory Need	Shared Memory OK?	Recommended Setup
Local prototyping (7B models)	8–16 GB	Yes, if model fits in VRAM	RTX 4090 (24 GB), Mac M-series
Fine-tuning with LoRA (7B–13B)	16–32 GB	No: training needs consistent bandwidth	A100 40 GB, L40S 48 GB
Production inference (7B–13B)	16–48 GB	No: tail latency matters	L40S 48 GB, H100 80 GB
Production inference (70B+)	80–141 GB	No: must fit entirely in HBM	H100 80 GB, H200 141 GB
Large-scale training (70B+)	500+ GB distributed	No: bandwidth-critical	Multi-GPU H100/H200 with NVLink
Development and experimentation	Varies	Acceptable for small models	RTX 4090, A100 40 GB

The general rule: if your workload touches real users or carries an SLA, dedicated VRAM is not optional. Shared memory is acceptable only for prototyping and experimentation where performance inconsistency doesn't matter.

Practical Workarounds and Their Limits

Teams use several techniques to work within VRAM constraints. These help, but they cannot turn shared memory into dedicated VRAM.

Quantization (INT8, INT4, GPTQ, AWQ) reduces model memory footprint by 2–4x with minimal quality loss for most inference applications. A 70B model at INT4 fits in ~40 GB, bringing it within reach of a single A100 80 GB or L40S 48 GB. However, quantization does not help if the remaining model still spills into shared memory.

Model parallelism splits models across multiple GPUs, each with its own dedicated VRAM. This works well when GPUs are connected via high-bandwidth NVLink (900 GB/s), but performs poorly if individual GPUs are already starved for bandwidth from shared memory access.

Mixed precision training (FP16/BF16/FP8) cuts memory footprint and often boosts throughput, but the benefit depends entirely on having fast dedicated VRAM to realize the bandwidth gains.

Gradient accumulation simulates large batches using multiple smaller forward passes, reducing peak VRAM usage. It helps with capacity constraints but increases wall-clock time proportionally. This is a workaround, not a solution.

These techniques are multipliers on good hardware. They amplify the benefits of dedicated VRAM but cannot compensate for the fundamental bandwidth gap of shared memory.

Monitoring GPU Memory in Production

Teams that avoid memory-related outages treat VRAM utilization as a first-class operational metric. The key signals to monitor:

High memory bandwidth utilization with low compute utilization indicates a memory-bound workload. The GPU is waiting for data, not processing it. This is expected for LLM inference but becomes problematic if bandwidth utilization correlates with latency spikes.

Frequent host-to-device transfers (visible in nvidia-smi or Nsight Systems profiling) indicate that the workload is spilling into shared memory. Any significant PCIe transfer volume during steady-state inference should raise an alert.

GPU page fault counters directly measure VRAM overflow events. A non-zero page fault rate during production inference means the model does not fully fit in dedicated VRAM and is experiencing the shared memory performance cliff.

The goal is to detect "VRAM nearly full, bandwidth saturated, compute idle" patterns. These are the classic signature of shared memory pain before they translate into user-facing latency spikes.

Deploy on Dedicated GPU Infrastructure with Spheron

Spheron is built around dedicated VRAM and bare-metal GPU deployments. When you deploy on Spheron, the GPU memory you see is the memory your model actually gets. There is no silent borrowing from system RAM, no overcommitted instances, and no shared memory surprises under load.

Deploy on H100, H200, A100, and RTX 4090 GPUs with full dedicated VRAM, transparent pricing, and no long-term contracts.

Explore GPU options on Spheron →

Frequently Asked Questions

What happens when a model exceeds GPU VRAM?

When a model's memory requirements exceed dedicated VRAM, the GPU driver automatically pages data to system RAM via the PCIe bus. This causes a dramatic performance drop, often 10–15x lower throughput, because system RAM bandwidth (50–100 GB/s) is a fraction of dedicated VRAM bandwidth (1,000–4,800 GB/s). The model still runs, but at a fraction of its potential speed with unpredictable latency spikes.

Is Apple Silicon's unified memory the same as shared GPU memory?

Not exactly. Apple Silicon's unified memory provides a single physical memory pool that both CPU and GPU access at full bandwidth (400–546 GB/s on M4-series chips). Traditional shared GPU memory forces the GPU to access system RAM over the much slower PCIe bus (~64 GB/s). Unified memory is faster than PCIe-based sharing but still slower than dedicated HBM (2,000–4,800 GB/s), making it suitable for development and local inference but not competitive with data center GPUs for production serving purposes.

How much dedicated VRAM do I need for LLM inference?

For a 7B model at FP16, you need approximately 14–16 GB of dedicated VRAM. For 70B at INT8, approximately 70–80 GB. For 70B at INT4, approximately 35–40 GB. Always add 20–30% overhead for KV-cache, framework buffers, and activation memory. The key rule: the entire model plus inference overhead should fit within dedicated VRAM without any spill to system RAM.

Can I use shared memory for AI training?

Only for very small models during prototyping. Training requires simultaneous storage of model weights, optimizer states, gradients, and activations, typically 3–5x the model's weight size. Any spill into shared memory during training causes 2–5x slowdowns and unpredictable stalls. For any serious training workload, dedicated HBM-based GPUs (A100, H100, H200) are essential.

How do I check if my GPU is using shared memory?

Run nvidia-smi and compare "Memory-Usage" against the GPU's physical VRAM capacity. If your workload's memory usage approaches or exceeds the physical VRAM, the driver may be paging to system RAM. For more detailed analysis, NVIDIA's Nsight Systems profiler can show host-to-device transfer volumes and GPU page fault events. Any significant page fault activity during steady-state inference indicates shared memory spill.

Why do some cloud GPUs perform worse than their specs suggest?

Virtualized cloud GPU instances may share physical VRAM across tenants, reserve memory for hypervisor overhead, or present inflated memory numbers that include system RAM. Bare-metal GPU instances guarantee that you get the full physical VRAM with no sharing or overhead. When evaluating providers, ask specifically about bare-metal versus virtualized instances, actual dedicated VRAM capacity, and whether any memory is shared or overcommitted.