Alibaba's model marketplace needed 1,192 GPUs to serve its catalog of LLMs. Aegaeon, described at SOSP '25, brought that down to 213 - an 82% cut - without touching the latency SLA. The architectural shift was simple to state and hard to execute: stop assigning GPUs to models. Schedule tokens across a shared pool instead. If you want the hardware-partitioning approach first, the MIG and time-slicing guide covers that. For disaggregating prefill and decode into separate GPU pools, see the prefill-decode disaggregation guide.
Why Hardware Partitioning Caps at 2-3 Models
MIG and time-slicing both assign a GPU slice to each model at configuration time. The assignment is static. If a model handles one request per minute, its MIG instance sits idle 98% of the time. That 98% is wasted. You paid for it, the VRAM is reserved, and no other model can use it.
| Approach | Isolation | Practical models/GPU | Utilization ceiling | 2026 tooling |
|---|---|---|---|---|
| MIG (H100 80GB) | Hardware | 3 (2g.20gb) | Per-slice QPS | nvidia-smi, GPU Operator |
| Time-slicing | None | 4-6 | Shared pool QPS | NVIDIA device plugin |
| Token-level pooling | Scheduler-enforced | 20-100+ per cluster | Near 100% | vLLM + llm-d + router |
The fundamental ceiling problem: each MIG instance is a fixed VRAM bucket. A 7B model in BF16 needs about 14GB. On an H100 80GB, that means 5 models maximum with 1g.10gb slices - and in practice closer to 3, because you need headroom for KV cache. Time-slicing solves the fixed-partition problem but not the per-model idle problem: models that handle sporadic traffic leave the shared compute mostly unused between requests.
Token-level pooling breaks this by making allocation dynamic. The scheduler decides which models get compute on every forward pass, based on which ones have active token batches right now. A model that gets zero requests in a 10-second window gets zero GPU time in that window.
For the full partitioning spectrum from vGPU to MPS, the fractional GPU inference guide covers where each approach fits.
How Token-Level Scheduling Works
Three mechanisms from the Aegaeon paper explain how this becomes possible at scale.
Per-token auto-scaling. The scheduler decides batch composition per forward pass, not per request or per model. It maintains a priority queue of pending token batches across all active models. At each scheduling step, it selects which batches go into the next GPU kernel launch. Models with queued tokens get included. Idle models get nothing. This means a GPU running 50 models is doing useful work as long as any subset of those models has active traffic.
Explicit KV-cache memory management. PagedAttention allocates KV cache blocks dynamically per sequence. Aegaeon instead pre-reserves KV cache pools per active session and evicts by priority when capacity is reached. The explicit management allows the scheduler to reason about memory usage deterministically and avoid the unbounded fragmentation that dynamic block allocation can cause under high concurrency.
Component reuse. Many model marketplace entries share a base architecture - Llama 3.1 70B variants, Qwen3 fine-tunes, Mistral derivatives. When models share the same base weights, the scheduler can route their token batches to the same loaded instance. Swapping adapters between requests costs microseconds, not seconds. The LoRA multi-adapter serving guide covers the vLLM implementation of this in detail.
The data flow looks like this:
Requests from 100+ model aliases
|
v
Unified Scheduler
(token-batch dispatcher)
|
+----|----+
| |
v v
Prefill pool Decode pool
(H200/B200) (H200 SXM5)
| |
+----+----+
|
KV Cache Manager
(HBM pool, LRU eviction)
|
v
Response to clientArchitecture Blueprint
A production token-level gateway has four distinct components, each with a specific responsibility.
Unified scheduler. The token-batch dispatcher. It maintains a priority queue of pending token batches per model and schedules batches to prefill or decode slots on each forward pass cycle. It enforces per-model token quotas via token-bucket rate limiting. Critically, it does not need to know model weights - it routes to whichever node already has the model loaded, not to a fixed assignment.
Prefill pool. Two nodes with compute-dense GPUs (B200 or H200 SXM5). These run the full prompt forward pass and produce the KV cache for each new session. Once the prompt is processed, the KV cache is transferred to the decode pool via NIXL (NVIDIA Inference Xfer Library). Prefill is compute-bound, so the choice here is the GPU with the highest FLOP throughput.
Decode pool. Six nodes with memory-bandwidth-optimized GPUs (H200 SXM5 preferred at 4.8 TB/s HBM3e bandwidth). Autoregressive generation is memory-bound - each decode step reads the full KV cache for all active sequences. Higher bandwidth means higher token throughput per second. Decode nodes hold the KV cache for all in-flight sessions.
KV cache manager. Explicit HBM allocation per active session. Eviction priority is scored as: (request_age × token_value_estimate) / session_cost. Cold sessions with low value and high cost evict first. The manager signals the scheduler to pause or reject new sessions when HBM utilization crosses the admission threshold.
The prefill/decode split here maps directly onto what llm-d on Kubernetes implements with InferencePool CRDs. The difference is that llm-d routes at the request level, while an Aegaeon-style scheduler routes at the token-batch level - a finer granularity that allows interleaving from more models per GPU.
Open-Source Building Blocks in 2026
No single project fully replicates Aegaeon's scheduler. The ecosystem in mid-2026 gives you the pieces to assemble something close.
| Tool | What it covers | Gap vs Aegaeon |
|---|---|---|
| vLLM multi-LoRA | Base model reuse, adapter hot-swap, LRU adapter cache | No cross-model token-batch scheduler |
| llm-d v0.5 | Kubernetes prefill/decode disaggregation, cache-aware routing | Request-level routing, not token-level |
| SGLang RadixAttention | Prefix cache sharing across requests and models | No multi-model scheduler |
| KServe + ModelMesh | Multi-model Kubernetes routing, model versioning | No token-batch interleaving |
The vLLM production guide and SGLang production guide cover the individual serving engines. The llm-d guide covers the Kubernetes disaggregation layer.
No single open-source project fully replicates Aegaeon's scheduler. What follows is a practical architecture using these tools as building blocks.
Step-by-Step: 100-Model Gateway on Spheron H200/B200
Step 1: Provision the GPU Pool
Rent an 8x H200 SXM5 cluster from Spheron's H200 instances. Alternatively, use B200 SXM6 nodes for the compute-dense prefill tier. Bare-metal root access is required for the NIXL transport layer.
Verify NVLink topology once you're in:
nvidia-smi topo -mExpected output on an 8x H200 NVLink cluster:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18
...All GPUs should show NV18 interconnect. If you see PHB (PCIe Host Bridge) between any pair, NIXL KV transfer between those nodes will fall back to TCP, adding latency. The NVLink topology is essential for the 4.8 TB/s inter-GPU bandwidth that makes KV cache transfer fast enough to be practical.
Step 2: Stand Up Prefill and Decode Pools
Partition the cluster: 2 prefill nodes and 6 decode nodes.
Create a kv-transfer-config.yaml for prefill nodes:
kv_transfer_config:
kv_connector: NixlConnector
kv_role: kv_producer
kv_rank: 0
kv_parallel_size: 2
kv_buffer_device: cuda
kv_port: 14579And for decode nodes:
kv_transfer_config:
kv_connector: NixlConnector
kv_role: kv_consumer
kv_rank: 0
kv_parallel_size: 6
kv_buffer_device: cuda
kv_port: 14579Launch vLLM 0.8+ on each prefill node:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--kv-transfer-config kv-transfer-config-prefill.yaml \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-lora \
--max-loras 8 \
--max-cpu-loras 64 \
--port 8100And on each decode node (same flags, different KV config):
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--kv-transfer-config kv-transfer-config-decode.yaml \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-lora \
--max-loras 8 \
--max-cpu-loras 64 \
--port 8200Set --max-model-len uniformly across all nodes. Mismatched values cause routing failures when a session migrates between nodes.
Step 3: Deploy the Unified Model Router
A LiteLLM proxy works as the OpenAI-compatible router layer. Here is a litellm-config.yaml mapping 10 model aliases across your backend nodes:
model_list:
- model_name: llama-3.1-70b
litellm_params:
model: openai/meta-llama/Llama-3.1-70B-Instruct
api_base: http://decode-node-1:8200/v1
api_key: none
- model_name: customer-a-fine-tune
litellm_params:
model: openai/meta-llama/Llama-3.1-70B-Instruct
api_base: http://decode-node-1:8200/v1
api_key: none
extra_body:
lora_name: customer-a-adapter
- model_name: customer-b-fine-tune
litellm_params:
model: openai/meta-llama/Llama-3.1-70B-Instruct
api_base: http://decode-node-2:8200/v1
api_key: none
extra_body:
lora_name: customer-b-adapter
# ... up to 100+ aliases across decode nodes
router_settings:
routing_strategy: least-busy
num_retries: 2
timeout: 30
litellm_settings:
success_callback: ["prometheus"]
failure_callback: ["prometheus"]For Kubernetes, replace LiteLLM with llm-d's Gateway API Inference Extension using InferencePool CRDs for cache-aware routing that tracks which decode node has which model adapter loaded.
Step 4: Configure the Warm Model Pool and KV Cache Eviction
Calculate warm pool capacity before launch:
warm_pool_size = floor((total_hbm_bytes × 0.90) / avg_kv_footprint_per_model)For 6x H200 SXM5 decode nodes (141 GB HBM3e each):
- Total HBM: 6 × 141 GB = 846 GB
- Usable at 90%: ~761 GB
- KV footprint per active session at 32k context, 70B model, BF16: ~8 GB
- Max concurrent active sessions: ~95
Enable prefix caching to reuse KV cache across requests with shared prefixes:
# Add to vLLM launch command
--enable-prefix-cachingSGLang's RadixAttention handles this more aggressively by building a radix tree of cached prefix chunks. If you switch to SGLang as the serving engine, the warm pool arithmetic remains the same but eviction is managed by the RadixAttention runtime rather than vLLM's block manager.
Step 5: Add Per-Model Token Quotas and Admission Control
A token-bucket rate limiter in the router layer prevents one model from monopolizing decode capacity. Here is a minimal Python implementation for the FastAPI router layer:
import asyncio
import time
from collections import defaultdict
class TokenBucket:
def __init__(self, burst: int, rate: float):
self.burst = burst
self.rate = rate
self.tokens = burst
self.last_refill = time.monotonic()
def _refill(self) -> None:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.burst, self.tokens + elapsed * self.rate)
self.last_refill = now
def consume(self, n: int) -> bool:
if n > self.burst:
# Request exceeds burst capacity and can never succeed; fail immediately
return False
self._refill()
if self.tokens >= n:
self.tokens -= n
return True
return False
def wait_time(self, n: int) -> float:
"""Seconds until the bucket can serve n tokens based on the current deficit."""
self._refill()
deficit = n - self.tokens
return max(0.0, deficit / self.rate)
buckets: dict[str, TokenBucket] = defaultdict(
lambda: TokenBucket(burst=32768, rate=200)
)
_MAX_WAIT = 30.0 # drop the request if it cannot be served within this window
async def route_request(model_id: str, token_count: int) -> bool:
bucket = buckets[model_id]
if token_count > bucket.burst:
# Request exceeds burst capacity and can never succeed; fail immediately
return False
deadline = time.monotonic() + _MAX_WAIT
while not bucket.consume(token_count):
wait = bucket.wait_time(token_count)
if time.monotonic() + wait > deadline:
return False
# Sleep for the deficit-based refill time, then retry. We loop rather
# than doing a single retry because concurrent coroutines may consume
# tokens while we sleep, so one wake-up is not guaranteed to succeed.
await asyncio.sleep(wait)
return TrueMonitor per-model decode queue depth. When any model's queue consistently exceeds 10 seconds of wait time, add a decode replica for that model's base architecture.
Step 6: Validate with a Multi-Model Load Test
Use vLLM's benchmark_serving.py to drive 50-100 concurrent model aliases through the gateway simultaneously:
# Install locust for the rotating model test
pip install locust
# Or use vLLM's built-in benchmark tool with model rotation
python -m vllm.entrypoints.openai.benchmark_serving \
--backend openai \
--base-url http://your-gateway:4000 \
--model llama-3.1-70b \
--dataset-name random \
--num-prompts 1000 \
--request-rate 10 \
--max-concurrency 50Run the test for at least 30 minutes to capture cold-start events, KV eviction spikes, and noisy-neighbor behavior. Compare TTFT and decode tokens/s per model against a baseline of dedicated vLLM instances at matched total cost. A well-tuned pool should show 70-80% GPU utilization across the cluster while keeping p99 TTFT within 2x of the dedicated baseline.
Goodput Math: When Pooling Beats Dedicated GPUs
Goodput is the useful output tokens delivered per total GPU-hours paid. A dedicated GPU serving a model that handles 0.1 requests per second has low goodput - most of the GPU-hours are spent idle between requests.
The break-even point is approximately 35% per-model GPU utilization. Below that, pooling wins. Above 70%, dedicated instances are more predictable.
| Scenario | Per-model avg QPS | Dedicated cost (50 models) | Pooled cost (8x H200) | Winner |
|---|---|---|---|---|
| Very low (0.1 QPS/model) | 0.1 | 50 × $1.14/hr = $57.00/hr | $36.96/hr | Pooled |
| Low (0.5 QPS/model) | 0.5 | 50 × $1.14/hr = $57.00/hr | $36.96/hr | Pooled |
| Medium (2 QPS/model) | 2.0 | 50 × $1.14/hr = $57.00/hr | $36.96/hr | Pooled |
| High (10 QPS/model) | 10.0 | 50 × $1.14/hr = $57.00/hr | $36.96/hr | Dedicated* |
*At 10 QPS per model across 50 models, the pool saturates and you need more than 8 GPUs anyway. Dedicated instances become cheaper per-model when each model can fill a GPU on its own.
Dedicated cost uses A100 80G PCIe at $1.14/hr as the comparison: the cheapest GPU that can serve a 7B model in BF16. A pool of 8x H200 SXM5 costs $36.96/hr total. At very low to medium traffic, the pool costs 35% less for the same model catalog coverage.
Pricing fluctuates based on GPU availability. The prices above are based on 19 May 2026 and may have changed. Check current GPU pricing → for live rates.
Failure Modes and Mitigations
Cold-Start Latency
Loading a 7B model in BF16 from object storage takes 10-60 seconds depending on network throughput. A request arriving for a cold model will time out before loading completes.
Mitigations: size the warm pool to cover the top-N models by 24-hour request count (N = warm pool capacity computed in Step 4). Implement predictive pre-warming triggered by request-rate spike detection - if a model's request rate doubles in a 5-minute window, pre-load its next logical neighbor in the priority list. Set client-side retry with exponential backoff (base 2s, max 30s, 3 retries) so cold-start timeouts are transparent to end users.
KV-Cache Thrash
When the total active session KV footprint exceeds the pool HBM, the manager evicts live sessions. Eviction forces recompute (the prefill node re-runs the full prompt), which spikes TTFT for the evicted session and consumes prefill capacity that could serve new requests.
Mitigations: implement admission control that rejects new sessions when HBM utilization crosses 85%. Enforce a uniform --max-model-len across all nodes so the KV footprint per session is predictable. Use HBM utilization as the primary autoscaling signal - add decode replicas before reaching 85%, not after.
Noisy-Neighbor Starvation
One model with a burst of long-context requests fills all decode slots with its active sequences. Other models queue behind it and miss their latency SLAs.
Mitigations: the per-model token-bucket quota in the router layer is the first line of defense - bursty models get queued, not given unbounded access. Add weighted fair queuing in the scheduler (weight by SLA tier: premium models get 2x weight over standard). For extreme cases, implement preemption: the scheduler can suspend low-priority batches mid-sequence and resume them later. Preemption adds TTFT overhead for the preempted session but protects high-priority models from starvation.
Pricing: 8-GPU H200 Pool for 50 Models
Using live-fetched rates from the Spheron GPU pricing API (19 May 2026):
| Allocation | GPU | Count | On-demand ($/hr/GPU) | Spot ($/hr/GPU) | Pool cost/hr (on-demand) |
|---|---|---|---|---|---|
| Prefill nodes | H200 SXM5 | 2 | $4.62 | $1.19 | $9.24 |
| Decode nodes | H200 SXM5 | 6 | $4.62 | $1.19 | $27.72 |
| Total pool | H200 SXM5 | 8 | $36.96 | ||
| Per model (50 models) | $0.74 | ||||
| Alt: prefill (compute-dense) | B200 | 2 | $7.21 | $3.81 | $14.42 |
Compare against 50 dedicated A100 80G PCIe instances at $1.14/hr each: $57.00/hr for the same model catalog. The pool costs 35% less while maintaining the same catalog coverage and similar latency profiles at low to medium per-model traffic.
Decode nodes are good candidates for Spheron spot instances since preemption of a decode session is recoverable - the prefill node re-submits the KV cache on the next scheduled slot. Using spot pricing for decode nodes can reduce the pool cost by 20-40% further, depending on availability.
Pricing fluctuates based on GPU availability. The prices above are based on 19 May 2026 and may have changed. Check current GPU pricing → for live rates.
Spheron's H200 and B200 bare-metal clusters include root access, NVLink topology, hourly billing, and spot pricing - the right substrate for a token-level multi-model gateway. Rent individual nodes to prototype the architecture, or contact us for a dedicated 8-GPU cluster.
Quick Setup Guide
Rent an 8x H200 SXM5 or 8x B200 bare-metal cluster on Spheron. Verify NVLink topology with nvidia-smi topo -m. Bare-metal root access is required for the NIXL transport layer used for KV cache transfer between prefill and decode nodes.
Partition the cluster: 2 prefill nodes (compute-dense, H200 or B200) and 6 decode nodes (memory-bandwidth-optimized, H200 SXM5 preferred). Launch vLLM 0.8+ on each with --kv-transfer-config specifying role kv_producer on prefill nodes and kv_consumer on decode nodes. Set a uniform --max-model-len (e.g., 32768) across all nodes.
Start an OpenAI-compatible router - LiteLLM proxy or a FastAPI service - that maps incoming model IDs to vLLM backend instances. Maintain a registry of loaded models per node. For Kubernetes, use llm-d's Gateway API Inference Extension with InferencePool CRDs for cache-aware routing.
Calculate warm pool capacity: total_hbm / avg_model_kv_footprint. Set --gpu-memory-utilization to 0.90. Enable prefix caching with SGLang RadixAttention or vLLM's built-in prefix cache. Use LRU eviction for models beyond the warm pool, tracking request frequency per model over a 24-hour rolling window.
Implement a token-bucket rate limiter per model in the router layer (burst: 500 tokens/s, sustained: 200 tokens/s as starting values). Requests exceeding burst are queued, not dropped. Monitor per-model decode queue depth; scale decode replicas horizontally when any model's queue consistently exceeds 10 seconds.
Use vLLM's benchmark_serving.py or locust to drive 50-100 concurrent model aliases through the gateway simultaneously. Measure TTFT and decode tokens/s per model, total GPU utilization across the pool, and cold-start frequency over 30 minutes. Compare against a baseline of dedicated vLLM instances at matched total cost.
Frequently Asked Questions
Token-level GPU pooling virtualizes GPU access at individual token batches rather than assigning a full GPU (or GPU slice) to each model. A scheduler interleaves token batches from dozens of active LLMs within the same GPU forward pass. The Aegaeon system (SOSP '25) used this approach to reduce Alibaba's model marketplace from 1,192 GPUs to 213 - an 82% reduction - without changing latency SLAs.
MIG and time-slicing use static partitioning: each GPU slice hosts exactly one model, so utilization is bounded by that model's per-slice traffic. Token-level scheduling is dynamic: the scheduler packs token batches from whichever models have active requests into the same kernel launch. Fifty models sharing an 8-GPU cluster can keep GPUs nearly fully occupied even when each model averages only a few requests per second.
No single project fully replicates Aegaeon as of mid-2026. The building blocks are: vLLM multi-LoRA for base-model reuse and adapter hot-swap, llm-d for Kubernetes-native prefill/decode disaggregation, SGLang RadixAttention for prefix cache sharing across requests, and KServe + ModelMesh for multi-model routing on Kubernetes. A production gateway combines these with a router layer and explicit KV cache management.
Pooling wins when per-model average utilization on a dedicated GPU falls below roughly 30-40%. This is the standard pattern in model marketplaces where dozens of models each handle sporadic traffic. Above 70% steady-state utilization per model, dedicated instances are more predictable and easier to scale independently.
Three dominate: (1) Cold-start latency - a model outside the warm pool takes 10-60 seconds to load, causing request timeouts. (2) KV-cache thrash - concurrent active sessions exhaust total HBM across the pool, forcing evictions that spike TTFT. (3) Noisy-neighbor - one bursty model monopolizes decode capacity and starves others. Mitigations are warm pool sizing, admission control, and per-model token-bucket quotas.
