Token-Level GPU Pooling for Multi-LLM Marketplace Inference (2026 Guide)

Alibaba's model marketplace needed 1,192 GPUs to serve its catalog of LLMs. Aegaeon, described at SOSP '25, brought that down to 213 - an 82% cut - without touching the latency SLA. The architectural shift was simple to state and hard to execute: stop assigning GPUs to models. Schedule tokens across a shared pool instead. If you want the hardware-partitioning approach first, the MIG and time-slicing guide covers that. For disaggregating prefill and decode into separate GPU pools, see the prefill-decode disaggregation guide.

Why Hardware Partitioning Caps at 2-3 Models

MIG and time-slicing both assign a GPU slice to each model at configuration time. The assignment is static. If a model handles one request per minute, its MIG instance sits idle 98% of the time. That 98% is wasted. You paid for it, the VRAM is reserved, and no other model can use it.

Approach	Isolation	Practical models/GPU	Utilization ceiling	2026 tooling
MIG (H100 80GB)	Hardware	3 (2g.20gb)	Per-slice QPS	nvidia-smi, GPU Operator
Time-slicing	None	4-6	Shared pool QPS	NVIDIA device plugin
Token-level pooling	Scheduler-enforced	20-100+ per cluster	Near 100%	vLLM + llm-d + router

The fundamental ceiling problem: each MIG instance is a fixed VRAM bucket. A 7B model in BF16 needs about 14GB. On an H100 80GB, that means 5 models maximum with 1g.10gb slices - and in practice closer to 3, because you need headroom for KV cache. Time-slicing solves the fixed-partition problem but not the per-model idle problem: models that handle sporadic traffic leave the shared compute mostly unused between requests.

Token-level pooling breaks this by making allocation dynamic. The scheduler decides which models get compute on every forward pass, based on which ones have active token batches right now. A model that gets zero requests in a 10-second window gets zero GPU time in that window.

For the full partitioning spectrum from vGPU to MPS, the fractional GPU inference guide covers where each approach fits.

How Token-Level Scheduling Works

Three mechanisms from the Aegaeon paper explain how this becomes possible at scale.

Per-token auto-scaling. The scheduler decides batch composition per forward pass, not per request or per model. It maintains a priority queue of pending token batches across all active models. At each scheduling step, it selects which batches go into the next GPU kernel launch. Models with queued tokens get included. Idle models get nothing. This means a GPU running 50 models is doing useful work as long as any subset of those models has active traffic.

Explicit KV-cache memory management. PagedAttention allocates KV cache blocks dynamically per sequence. Aegaeon instead pre-reserves KV cache pools per active session and evicts by priority when capacity is reached. The explicit management allows the scheduler to reason about memory usage deterministically and avoid the unbounded fragmentation that dynamic block allocation can cause under high concurrency.

Component reuse. Many model marketplace entries share a base architecture - Llama 3.1 70B variants, Qwen3 fine-tunes, Mistral derivatives. When models share the same base weights, the scheduler can route their token batches to the same loaded instance. Swapping adapters between requests costs microseconds, not seconds. The LoRA multi-adapter serving guide covers the vLLM implementation of this in detail.

The data flow looks like this:

Requests from 100+ model aliases
         |
         v
   Unified Scheduler
   (token-batch dispatcher)
         |
    +----|----+
    |         |
    v         v
Prefill pool  Decode pool
(H200/B200)   (H200 SXM5)
    |         |
    +----+----+
         |
   KV Cache Manager
   (HBM pool, LRU eviction)
         |
         v
   Response to client

Architecture Blueprint

A production token-level gateway has four distinct components, each with a specific responsibility.

Unified scheduler. The token-batch dispatcher. It maintains a priority queue of pending token batches per model and schedules batches to prefill or decode slots on each forward pass cycle. It enforces per-model token quotas via token-bucket rate limiting. Critically, it does not need to know model weights - it routes to whichever node already has the model loaded, not to a fixed assignment.

Prefill pool. Two nodes with compute-dense GPUs (B200 or H200 SXM5). These run the full prompt forward pass and produce the KV cache for each new session. Once the prompt is processed, the KV cache is transferred to the decode pool via NIXL (NVIDIA Inference Xfer Library). Prefill is compute-bound, so the choice here is the GPU with the highest FLOP throughput.

Decode pool. Six nodes with memory-bandwidth-optimized GPUs (H200 SXM5 preferred at 4.8 TB/s HBM3e bandwidth). Autoregressive generation is memory-bound - each decode step reads the full KV cache for all active sequences. Higher bandwidth means higher token throughput per second. Decode nodes hold the KV cache for all in-flight sessions.

KV cache manager. Explicit HBM allocation per active session. Eviction priority is scored as: (request_age × token_value_estimate) / session_cost. Cold sessions with low value and high cost evict first. The manager signals the scheduler to pause or reject new sessions when HBM utilization crosses the admission threshold.

The prefill/decode split here maps directly onto what llm-d on Kubernetes implements with InferencePool CRDs. The difference is that llm-d routes at the request level, while an Aegaeon-style scheduler routes at the token-batch level - a finer granularity that allows interleaving from more models per GPU.

Open-Source Building Blocks in 2026

No single project fully replicates Aegaeon's scheduler. The ecosystem in mid-2026 gives you the pieces to assemble something close.

Tool	What it covers	Gap vs Aegaeon
vLLM multi-LoRA	Base model reuse, adapter hot-swap, LRU adapter cache	No cross-model token-batch scheduler
llm-d v0.5	Kubernetes prefill/decode disaggregation, cache-aware routing	Request-level routing, not token-level
SGLang RadixAttention	Prefix cache sharing across requests and models	No multi-model scheduler
KServe + ModelMesh	Multi-model Kubernetes routing, model versioning	No token-batch interleaving

The vLLM production guide and SGLang production guide cover the individual serving engines. The llm-d guide covers the Kubernetes disaggregation layer.

No single open-source project fully replicates Aegaeon's scheduler. What follows is a practical architecture using these tools as building blocks.

Step-by-Step: 100-Model Gateway on Spheron H200/B200

Step 1: Provision the GPU Pool

Rent an 8x H200 SXM5 cluster from Spheron's H200 instances. Alternatively, use B200 SXM6 nodes for the compute-dense prefill tier. Bare-metal root access is required for the NIXL transport layer.

Verify NVLink topology once you're in:

bash

nvidia-smi topo -m

Expected output on an 8x H200 NVLink cluster:

        GPU0   GPU1   GPU2   GPU3   GPU4   GPU5   GPU6   GPU7
GPU0     X    NV18   NV18   NV18   NV18   NV18   NV18   NV18
GPU1    NV18   X     NV18   NV18   NV18   NV18   NV18   NV18
...

All GPUs should show NV18 interconnect. If you see PHB (PCIe Host Bridge) between any pair, NIXL KV transfer between those nodes will fall back to TCP, adding latency. The NVLink topology is essential for the 4.8 TB/s inter-GPU bandwidth that makes KV cache transfer fast enough to be practical.

Step 2: Stand Up Prefill and Decode Pools

Partition the cluster: 2 prefill nodes and 6 decode nodes.

Create a kv-transfer-config.yaml for prefill nodes:

yaml

kv_transfer_config:
  kv_connector: NixlConnector
  kv_role: kv_producer
  kv_rank: 0
  kv_parallel_size: 2
  kv_buffer_device: cuda
  kv_port: 14579

And for decode nodes:

yaml

kv_transfer_config:
  kv_connector: NixlConnector
  kv_role: kv_consumer
  kv_rank: 0
  kv_parallel_size: 6
  kv_buffer_device: cuda
  kv_port: 14579

Launch vLLM 0.8+ on each prefill node:

bash

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --kv-transfer-config kv-transfer-config-prefill.yaml \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-lora \
  --max-loras 8 \
  --max-cpu-loras 64 \
  --port 8100

And on each decode node (same flags, different KV config):

bash

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --kv-transfer-config kv-transfer-config-decode.yaml \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-lora \
  --max-loras 8 \
  --max-cpu-loras 64 \
  --port 8200

Set --max-model-len uniformly across all nodes. Mismatched values cause routing failures when a session migrates between nodes.

Step 3: Deploy the Unified Model Router

A LiteLLM proxy works as the OpenAI-compatible router layer. Here is a litellm-config.yaml mapping 10 model aliases across your backend nodes:

yaml

model_list:
  - model_name: llama-3.1-70b
    litellm_params:
      model: openai/meta-llama/Llama-3.1-70B-Instruct
      api_base: http://decode-node-1:8200/v1
      api_key: none
  - model_name: customer-a-fine-tune
    litellm_params:
      model: openai/meta-llama/Llama-3.1-70B-Instruct
      api_base: http://decode-node-1:8200/v1
      api_key: none
      extra_body:
        lora_name: customer-a-adapter
  - model_name: customer-b-fine-tune
    litellm_params:
      model: openai/meta-llama/Llama-3.1-70B-Instruct
      api_base: http://decode-node-2:8200/v1
      api_key: none
      extra_body:
        lora_name: customer-b-adapter
  # ... up to 100+ aliases across decode nodes

router_settings:
  routing_strategy: least-busy
  num_retries: 2
  timeout: 30

litellm_settings:
  success_callback: ["prometheus"]
  failure_callback: ["prometheus"]

For Kubernetes, replace LiteLLM with llm-d's Gateway API Inference Extension using InferencePool CRDs for cache-aware routing that tracks which decode node has which model adapter loaded.

Step 4: Configure the Warm Model Pool and KV Cache Eviction

Calculate warm pool capacity before launch:

warm_pool_size = floor((total_hbm_bytes × 0.90) / avg_kv_footprint_per_model)

For 6x H200 SXM5 decode nodes (141 GB HBM3e each):

Total HBM: 6 × 141 GB = 846 GB
Usable at 90%: ~761 GB
KV footprint per active session at 32k context, 70B model, BF16: ~8 GB
Max concurrent active sessions: ~95

Enable prefix caching to reuse KV cache across requests with shared prefixes:

bash

# Add to vLLM launch command
--enable-prefix-caching

SGLang's RadixAttention handles this more aggressively by building a radix tree of cached prefix chunks. If you switch to SGLang as the serving engine, the warm pool arithmetic remains the same but eviction is managed by the RadixAttention runtime rather than vLLM's block manager.

Step 5: Add Per-Model Token Quotas and Admission Control

A token-bucket rate limiter in the router layer prevents one model from monopolizing decode capacity. Here is a minimal Python implementation for the FastAPI router layer:

python

import asyncio
import time
from collections import defaultdict

class TokenBucket:
    def __init__(self, burst: int, rate: float):
        self.burst = burst
        self.rate = rate
        self.tokens = burst
        self.last_refill = time.monotonic()

    def _refill(self) -> None:
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(self.burst, self.tokens + elapsed * self.rate)
        self.last_refill = now

    def consume(self, n: int) -> bool:
        if n > self.burst:
            # Request exceeds burst capacity and can never succeed; fail immediately
            return False
        self._refill()
        if self.tokens >= n:
            self.tokens -= n
            return True
        return False

    def wait_time(self, n: int) -> float:
        """Seconds until the bucket can serve n tokens based on the current deficit."""
        self._refill()
        deficit = n - self.tokens
        return max(0.0, deficit / self.rate)

buckets: dict[str, TokenBucket] = defaultdict(
    lambda: TokenBucket(burst=32768, rate=200)
)

_MAX_WAIT = 30.0  # drop the request if it cannot be served within this window

async def route_request(model_id: str, token_count: int) -> bool:
    bucket = buckets[model_id]
    if token_count > bucket.burst:
        # Request exceeds burst capacity and can never succeed; fail immediately
        return False
    deadline = time.monotonic() + _MAX_WAIT
    while not bucket.consume(token_count):
        wait = bucket.wait_time(token_count)
        if time.monotonic() + wait > deadline:
            return False
        # Sleep for the deficit-based refill time, then retry. We loop rather
        # than doing a single retry because concurrent coroutines may consume
        # tokens while we sleep, so one wake-up is not guaranteed to succeed.
        await asyncio.sleep(wait)
    return True

Monitor per-model decode queue depth. When any model's queue consistently exceeds 10 seconds of wait time, add a decode replica for that model's base architecture.

Step 6: Validate with a Multi-Model Load Test

Use vLLM's benchmark_serving.py to drive 50-100 concurrent model aliases through the gateway simultaneously:

bash

# Install locust for the rotating model test
pip install locust

# Or use vLLM's built-in benchmark tool with model rotation
python -m vllm.entrypoints.openai.benchmark_serving \
  --backend openai \
  --base-url http://your-gateway:4000 \
  --model llama-3.1-70b \
  --dataset-name random \
  --num-prompts 1000 \
  --request-rate 10 \
  --max-concurrency 50

Run the test for at least 30 minutes to capture cold-start events, KV eviction spikes, and noisy-neighbor behavior. Compare TTFT and decode tokens/s per model against a baseline of dedicated vLLM instances at matched total cost. A well-tuned pool should show 70-80% GPU utilization across the cluster while keeping p99 TTFT within 2x of the dedicated baseline.

Goodput Math: When Pooling Beats Dedicated GPUs

Goodput is the useful output tokens delivered per total GPU-hours paid. A dedicated GPU serving a model that handles 0.1 requests per second has low goodput - most of the GPU-hours are spent idle between requests.

The break-even point is approximately 35% per-model GPU utilization. Below that, pooling wins. Above 70%, dedicated instances are more predictable.

Scenario	Per-model avg QPS	Dedicated cost (50 models)	Pooled cost (8x H200)	Winner
Very low (0.1 QPS/model)	0.1	50 × $1.14/hr = $57.00/hr	$36.96/hr	Pooled
Low (0.5 QPS/model)	0.5	50 × $1.14/hr = $57.00/hr	$36.96/hr	Pooled
Medium (2 QPS/model)	2.0	50 × $1.14/hr = $57.00/hr	$36.96/hr	Pooled
High (10 QPS/model)	10.0	50 × $1.14/hr = $57.00/hr	$36.96/hr	Dedicated*

*At 10 QPS per model across 50 models, the pool saturates and you need more than 8 GPUs anyway. Dedicated instances become cheaper per-model when each model can fill a GPU on its own.

Dedicated cost uses A100 80G PCIe at $1.14/hr as the comparison: the cheapest GPU that can serve a 7B model in BF16. A pool of 8x H200 SXM5 costs $36.96/hr total. At very low to medium traffic, the pool costs 35% less for the same model catalog coverage.

Pricing fluctuates based on GPU availability. The prices above are based on 19 May 2026 and may have changed. Check current GPU pricing → for live rates.

Failure Modes and Mitigations

Cold-Start Latency

Loading a 7B model in BF16 from object storage takes 10-60 seconds depending on network throughput. A request arriving for a cold model will time out before loading completes.

Mitigations: size the warm pool to cover the top-N models by 24-hour request count (N = warm pool capacity computed in Step 4). Implement predictive pre-warming triggered by request-rate spike detection - if a model's request rate doubles in a 5-minute window, pre-load its next logical neighbor in the priority list. Set client-side retry with exponential backoff (base 2s, max 30s, 3 retries) so cold-start timeouts are transparent to end users.

KV-Cache Thrash

When the total active session KV footprint exceeds the pool HBM, the manager evicts live sessions. Eviction forces recompute (the prefill node re-runs the full prompt), which spikes TTFT for the evicted session and consumes prefill capacity that could serve new requests.

Mitigations: implement admission control that rejects new sessions when HBM utilization crosses 85%. Enforce a uniform --max-model-len across all nodes so the KV footprint per session is predictable. Use HBM utilization as the primary autoscaling signal - add decode replicas before reaching 85%, not after.

Noisy-Neighbor Starvation

One model with a burst of long-context requests fills all decode slots with its active sequences. Other models queue behind it and miss their latency SLAs.

Mitigations: the per-model token-bucket quota in the router layer is the first line of defense - bursty models get queued, not given unbounded access. Add weighted fair queuing in the scheduler (weight by SLA tier: premium models get 2x weight over standard). For extreme cases, implement preemption: the scheduler can suspend low-priority batches mid-sequence and resume them later. Preemption adds TTFT overhead for the preempted session but protects high-priority models from starvation.

Pricing: 8-GPU H200 Pool for 50 Models

Using live-fetched rates from the Spheron GPU pricing API (19 May 2026):

Allocation	GPU	Count	On-demand ($/hr/GPU)	Spot ($/hr/GPU)	Pool cost/hr (on-demand)
Prefill nodes	H200 SXM5	2	$4.62	$1.19	$9.24
Decode nodes	H200 SXM5	6	$4.62	$1.19	$27.72
Total pool	H200 SXM5	8	$36.96
Per model (50 models)	$0.74
Alt: prefill (compute-dense)	B200	2	$7.21	$3.81	$14.42

Compare against 50 dedicated A100 80G PCIe instances at $1.14/hr each: $57.00/hr for the same model catalog. The pool costs 35% less while maintaining the same catalog coverage and similar latency profiles at low to medium per-model traffic.

Decode nodes are good candidates for Spheron spot instances since preemption of a decode session is recoverable - the prefill node re-submits the KV cache on the next scheduled slot. Using spot pricing for decode nodes can reduce the pool cost by 20-40% further, depending on availability.

Pricing fluctuates based on GPU availability. The prices above are based on 19 May 2026 and may have changed. Check current GPU pricing → for live rates.

Spheron's H200 and B200 bare-metal clusters include root access, NVLink topology, hourly billing, and spot pricing - the right substrate for a token-level multi-model gateway. Rent individual nodes to prototype the architecture, or contact us for a dedicated 8-GPU cluster.
Spheron H200 → | B200 SXM6 on Spheron → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

Provision the GPU pool on Spheron
Rent an 8x H200 SXM5 or 8x B200 bare-metal cluster on Spheron. Verify NVLink topology with nvidia-smi topo -m. Bare-metal root access is required for the NIXL transport layer used for KV cache transfer between prefill and decode nodes.
Stand up prefill and decode node pools
Partition the cluster: 2 prefill nodes (compute-dense, H200 or B200) and 6 decode nodes (memory-bandwidth-optimized, H200 SXM5 preferred). Launch vLLM 0.8+ on each with --kv-transfer-config specifying role kv_producer on prefill nodes and kv_consumer on decode nodes. Set a uniform --max-model-len (e.g., 32768) across all nodes.
Deploy the unified model router
Start an OpenAI-compatible router - LiteLLM proxy or a FastAPI service - that maps incoming model IDs to vLLM backend instances. Maintain a registry of loaded models per node. For Kubernetes, use llm-d's Gateway API Inference Extension with InferencePool CRDs for cache-aware routing.
Configure the warm model pool and KV cache eviction
Calculate warm pool capacity: total_hbm / avg_model_kv_footprint. Set --gpu-memory-utilization to 0.90. Enable prefix caching with SGLang RadixAttention or vLLM's built-in prefix cache. Use LRU eviction for models beyond the warm pool, tracking request frequency per model over a 24-hour rolling window.
Add per-model token quotas and admission control
Implement a token-bucket rate limiter per model in the router layer (burst: 500 tokens/s, sustained: 200 tokens/s as starting values). Requests exceeding burst are queued, not dropped. Monitor per-model decode queue depth; scale decode replicas horizontally when any model's queue consistently exceeds 10 seconds.
Validate with a multi-model load test
Use vLLM's benchmark_serving.py or locust to drive 50-100 concurrent model aliases through the gateway simultaneously. Measure TTFT and decode tokens/s per model, total GPU utilization across the pool, and cold-start frequency over 30 minutes. Compare against a baseline of dedicated vLLM instances at matched total cost.

FAQ / 05

Frequently Asked Questions

Token-level GPU pooling virtualizes GPU access at individual token batches rather than assigning a full GPU (or GPU slice) to each model. A scheduler interleaves token batches from dozens of active LLMs within the same GPU forward pass. The Aegaeon system (SOSP '25) used this approach to reduce Alibaba's model marketplace from 1,192 GPUs to 213 - an 82% reduction - without changing latency SLAs.

MIG and time-slicing use static partitioning: each GPU slice hosts exactly one model, so utilization is bounded by that model's per-slice traffic. Token-level scheduling is dynamic: the scheduler packs token batches from whichever models have active requests into the same kernel launch. Fifty models sharing an 8-GPU cluster can keep GPUs nearly fully occupied even when each model averages only a few requests per second.

No single project fully replicates Aegaeon as of mid-2026. The building blocks are: vLLM multi-LoRA for base-model reuse and adapter hot-swap, llm-d for Kubernetes-native prefill/decode disaggregation, SGLang RadixAttention for prefix cache sharing across requests, and KServe + ModelMesh for multi-model routing on Kubernetes. A production gateway combines these with a router layer and explicit KV cache management.

Pooling wins when per-model average utilization on a dedicated GPU falls below roughly 30-40%. This is the standard pattern in model marketplaces where dozens of models each handle sporadic traffic. Above 70% steady-state utilization per model, dedicated instances are more predictable and easier to scale independently.

Three dominate: (1) Cold-start latency - a model outside the warm pool takes 10-60 seconds to load, causing request timeouts. (2) KV-cache thrash - concurrent active sessions exhaust total HBM across the pool, forcing evictions that spike TTFT. (3) Noisy-neighbor - one bursty model monopolizes decode capacity and starves others. Mitigations are warm pool sizing, admission control, and per-model token-bucket quotas.

Why Hardware Partitioning Caps at 2-3 Models

How Token-Level Scheduling Works

Architecture Blueprint

Open-Source Building Blocks in 2026

Step-by-Step: 100-Model Gateway on Spheron H200/B200

Step 1: Provision the GPU Pool

Step 2: Stand Up Prefill and Decode Pools

Step 3: Deploy the Unified Model Router

Step 4: Configure the Warm Model Pool and KV Cache Eviction

Step 5: Add Per-Model Token Quotas and Admission Control

Step 6: Validate with a Multi-Model Load Test

Goodput Math: When Pooling Beats Dedicated GPUs

Failure Modes and Mitigations

Cold-Start Latency

KV-Cache Thrash

Noisy-Neighbor Starvation

Pricing: 8-GPU H200 Pool for 50 Models

Quick Setup Guide

Provision the GPU pool on Spheron

Stand up prefill and decode node pools

Deploy the unified model router

Configure the warm model pool and KV cache eviction

Add per-model token quotas and admission control

Validate with a multi-model load test

Frequently Asked Questions

01What is token-level GPU pooling for multi-LLM serving?

02How does Aegaeon-style scheduling differ from MIG and time-slicing?

03What open-source tools implement token-level multi-model scheduling in 2026?

04When does token-level pooling beat per-tenant dedicated GPUs?

05What are the main failure modes in a shared multi-LLM gateway?

Build what's next.