What is batch LLM inference and how does it differ from real-time serving?

Batch LLM inference processes a large, pre-collected dataset of prompts in a single offline run, without latency requirements. Real-time serving responds to individual requests within milliseconds. Batch jobs can use spot GPUs, skip streaming overhead, and pack requests into much larger batches, cutting cost by 5-10x compared to an always-on serving endpoint.

Which frameworks support offline batch LLM inference?

vLLM has a built-in LLMEngine offline mode (pass a list of prompts, get completions back). Ray Data LLM distributes batches across a cluster. SkyPilot handles job scheduling and spot failover. BentoML supports batch inference endpoints with configurable batch sizes and timeouts.

How do spot GPUs reduce batch inference costs?

Spot instances on Spheron cost 50-70% less than on-demand. Because batch jobs are not latency-sensitive, an interruption only requires resuming from the last checkpoint, not re-running the entire job. With checkpoint-and-resume logic every 1,000-5,000 documents, you lose at most a few minutes of progress on a preemption event.

What is the minimum batch size for efficient LLM inference?

A batch size of 32-64 prompts typically saturates a single H100 GPU for 7B-13B models. Smaller batches leave GPU compute underutilized. For larger models (70B+), even a batch of 8-16 can be efficient if sequences are long. Always profile with your actual sequence lengths before setting --max-num-seqs.

When should I NOT use batch inference?

Batch inference is wrong for any workload with a latency SLA: chatbots, interactive code assistants, real-time API products, or anything a user is waiting on. It is also wrong for streaming responses. If time-to-first-token matters, use an online serving stack instead.

Batch LLM Inference on GPU Cloud: Offline Processing Pipelines for 10x Lower Cost vs Real-Time Serving (2026 Guide)

Teams running document summarization or classification pipelines at scale are consistently over-spending on GPU compute. When you have 10 million contracts to summarize, a legal document archive to classify, or a product catalog to embed, you don't need a low-latency serving endpoint. You need a batch pipeline that processes everything overnight and shuts down when done. The cost difference is not marginal. Online serving endpoints running at 20-30% average utilization cost 5-10x more per processed token than a well-structured offline batch job. See the GPU cost optimization playbook for the full picture of where GPU budgets go, and serverless vs on-demand vs reserved billing explained for the billing model comparison.

Online serving vs offline batch: why the economics differ

An always-on serving endpoint bills you for every hour the GPU is allocated, regardless of whether requests are coming in. At 25% average utilization, you're paying for four GPUs to do the work of one. Add streaming overhead, connection management, and the memory footprint of keeping KV caches warm for concurrent connections, and you're burning GPU budget on infrastructure glue rather than actual inference.

A batch job is fundamentally different. The GPU is only provisioned during active compute. There's no idle billing between inference runs. You can pack requests into much larger effective batch sizes because you're not constrained by per-request latency. And since the job is asynchronous and interruptible, it's eligible for spot pricing.

Dimension	Online Serving	Offline Batch
GPU utilization	20-40% average	70-90% sustained
Spot eligible	No (latency SLA)	Yes
Latency requirement	<500ms TTFT	None
Throughput optimization	Per-request	Per-shard
Typical cost per 1M tokens	$0.40-1.20	$0.04-0.15

The cost gap widens further when you factor in spot pricing. Spot-eligible GPU models on Spheron run at 50-70% below on-demand pricing. For a batch job running 150 hours on 8 GPUs, that discount cuts the compute bill by more than most teams spend on GPU in a month. The worked example below breaks down the exact numbers.

Architectural patterns for reliable batch pipelines

Three patterns cover most offline batch use cases. Which one you need depends on scale and fault-tolerance requirements.

Pattern 1: Simple sequential. One process reads prompts from a file, calls the model, writes outputs. Fine for jobs under 100K documents that you can monitor manually. No fault tolerance.

Pattern 2: Sharded parallel. Split the corpus into N fixed-size shards (10,000-50,000 documents each). Each shard is an independent job unit. Workers pick up shards from a queue or a file list. A completion checkpoint is written after each shard finishes. On a failure or preemption, only the in-progress shard needs to be re-run. This is the right pattern for most production batch pipelines.

Pattern 3: Distributed cluster. Ray Data or similar distributes shards across multiple nodes. Handles petabyte-scale corpora and heterogeneous GPU pools. Adds operational complexity. Most teams don't need this until they're above 500M documents/run.

The core infrastructure for Pattern 2 is simpler than it looks:

Input corpus (10M docs)
        |
    Shard 0..N (10K docs/shard)
        |
  Worker pool (N workers, 1 per GPU or per node)
        |
   Checkpoint store (completed shard IDs)
        |
   Output store (shard_000.jsonl, shard_001.jsonl, ...)

Workers check the checkpoint store on startup and skip already-completed shards. A shard is only marked complete after its output file is fully written and synced. On spot preemption, the job restarts and resumes from the last checkpoint with no re-work beyond the current shard. See the spot GPU training case study for a detailed checkpoint implementation that adapts directly to batch inference pipelines.

Framework comparison for batch LLM inference

Framework	Batch mode	Cluster support	Fault tolerance	Best for
vLLM offline (`vllm.LLM`)	Single-node	None built-in	Manual checkpointing	Single-GPU or tensor-parallel batch on one node
Ray Data LLM	Distributed	Native (Ray cluster)	Ray lineage + retry	Multi-node, heterogeneous GPU pools
SkyPilot	Job-level scheduling	Cloud-agnostic	Spot failover + retry	Cloud-portable batch with automatic preemption retry
BentoML batch	Endpoint-based	Limited	None built-in	Hybrid online+batch with a single deployment

vLLM's offline mode is the right starting point for most teams. The API is minimal: instantiate vllm.LLM, call .generate() with a list of prompts, get a list of completions back. vLLM handles KV cache allocation, continuous batching, and tensor parallelism internally.

python

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tensor_parallel_size=8,   # number of GPUs
    max_num_seqs=256,          # max batch size in flight
    enable_prefix_caching=True,
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=256)

# Process one shard (e.g., 10K documents)
outputs = llm.generate(prompts, sampling_params)

# Write outputs and checkpoint
for output in outputs:
    result = output.outputs[0].text
    # write to output file...

The key parameters for throughput tuning are tensor_parallel_size (GPUs per node), max_num_seqs (controls KV cache allocation and throughput), and enable_prefix_caching (covered below).

Spot and preemptible GPU strategies on Spheron

Batch jobs are the ideal use case for spot instances. They're asynchronous, not user-facing, and naturally interruptible. Spot instances on Spheron run at 50-70% below on-demand pricing for eligible GPU models. For current spot pricing, see GPU pricing on Spheron.

The fault-tolerant design for spot batch jobs has three layers:

Checkpoint every N shards. Write a JSON state file after each shard completes. On restart, load the state file and skip completed shards. This keeps re-work to at most one shard (10K-50K documents), even if the instance is preempted mid-job.

Detect SIGTERM before preemption. Cloud providers typically send SIGTERM 30-90 seconds before reclaiming a spot instance. Register a SIGTERM handler that flushes the current shard's output buffer and writes a partial checkpoint, then exits cleanly.

Restart automatically. A simple launch script requests a new spot instance on failure and re-runs the batch script from the checkpoint. With a properly sharded job, a preemption event costs 2-5 minutes of re-provisioning plus the time to reprocess the current shard.

The same checkpoint-and-resume pattern from the spot GPU training case study transfers directly to batch inference. The difference is that shards are the unit of fault recovery instead of training steps.

Teams running offline document processing pipelines combine spot instances with the sharded pattern described above for the lowest possible cost-per-token, as covered in the billing model comparison.

Throughput tuning for offline batch jobs

--max-num-seqs and --max-model-len: These two parameters control how much of the GPU's KV cache is allocated. A larger max_num_seqs increases throughput by processing more requests in parallel but consumes more VRAM. For a 13B model on A100_80G, max_num_seqs=256 is a reasonable starting point. If OOM errors appear, drop to 128. If GPU utilization stays below 70%, increase it.

Prefix caching (--enable-prefix-caching): When every prompt in your batch shares a system prompt or fixed preamble, vLLM can compute the KV cache for that shared prefix once and reuse it across all requests. For a summarization pipeline with a fixed 200-token system prompt and 500-token documents, prefix caching reduces compute by 25-35% and shortens total job time proportionally.

Speculative decoding: For models where most output tokens are predictable, a smaller draft model generates candidates and the main model verifies them in parallel. This increases throughput 1.5-2.5x at the cost of a second model in memory. See the continuous batching and paged attention guide for setup details.

FP8 quantization (--quantization fp8): Halves the KV cache memory footprint, effectively doubling the batch size you can run within the same VRAM budget. For batch jobs where accuracy requirements allow it (most classification and summarization tasks), FP8 is a free throughput multiplier.

Cost worked example: 10M document summarization

Setup: 10M documents, average 500 input tokens + 150 output tokens = 650 tokens per document, 6.5 billion total tokens.

Throughput assumption: 12,000 tokens/sec for 8 GPUs at 70% utilization in batch mode with a 13B model. Actual throughput varies with model size, sequence length, and quantization settings.

Estimated job time: 6,500,000,000 / 12,000 = 541,667 seconds = ~150.5 hours.

H100 SXM5: on-demand

For workloads that need Hopper-class memory bandwidth, see on-demand H100 SXM5 availability on Spheron. H100 SXM5 spot is not currently available on Spheron.

GPU	Configuration	Hourly rate	Est. hours	Total cost
H100 SXM5	8x @ $4.34/hr each	$34.72/hr	~150.5 hr	~$5,225

H100 SXM5 achieves higher throughput in practice (20,000+ tokens/sec for a 13B model across 8 GPUs), so actual job time and total cost would be lower. The table above uses the same 12,000 tokens/sec baseline to keep the comparison consistent.

RTX PRO 6000 PCIe: on-demand vs spot

RTX PRO 6000 PCIe has 96GB VRAM, handles 13B models without quantization, and is available at spot pricing on Spheron at 65% below on-demand.

Approach	GPU config	Hourly rate	Est. hours	Total cost
On-demand	8x RTX PRO 6000 PCIe @ $1.70/hr each	$13.60/hr	150.5 hr	~$2,047
Spot	8x RTX PRO 6000 PCIe @ $0.59/hr each	$4.72/hr	165.6 hr (+10% re-run overhead)	~$782
Spot savings	~$1,265 saved (62% reduction)

The 10% overhead on spot accounts for occasional preemptions that require partial shard re-runs. With a properly sharded job (10K docs/shard), each preemption costs 15-30 minutes of re-work at most.

Pricing fluctuates based on GPU availability. The prices above are based on 24 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Monitoring and progress tracking for long-running batch runs

Long batch jobs need lightweight observability. Three metrics matter:

Tokens per second (rolling average): Log this every 60 seconds. A sudden drop indicates an I/O bottleneck (slow storage reads), KV cache pressure (OOM approaching), or a hung worker. Baseline first and alert on >20% degradation.

GPU utilization: Use nvidia-smi dmon -s u for per-GPU utilization polling, or DCGM exporter with Prometheus for multi-node setups. If utilization stays below 70% throughout the job, increase max_num_seqs. If it spikes to 100% and stays there with OOM warnings, reduce it or add FP8 quantization.

Shard completion rate and ETA: Track this with a simple JSON progress file:

python

import json, time

def write_progress(completed, total, start_time):
    elapsed = time.time() - start_time
    if elapsed <= 0:
        return
    rate = completed / elapsed  # shards per second
    eta_seconds = (total - completed) / rate if rate > 0 else None
    with open("progress.json", "w") as f:
        json.dump({
            "completed": completed,
            "total": total,
            "elapsed_hours": elapsed / 3600,
            "eta_hours": eta_seconds / 3600 if eta_seconds is not None else None,
        }, f)

Read progress.json at any point to get the current ETA without interrupting the job.

When batch inference is the wrong choice

Batch inference is appropriate only when latency doesn't matter. Do not use it for:

Chatbots and interactive assistants. Users are waiting. Queuing their message for batch processing is not an option.

Real-time API products with SLAs. If your API contract promises 95th-percentile latency under 2 seconds, batch mode cannot deliver that.

Streaming outputs. vLLM offline mode returns full completions only. If you need token-by-token streaming for progressive rendering, you need an online serving endpoint.

Short-burst workloads under 100K documents. The overhead of setting up the batch infrastructure, waiting for GPU provisioning, and processing shards dominates the total time. For small jobs, just call an API.

Any user-facing feature. If a human is waiting for the result, it's an online workload.

Batch inference on spot GPU instances cuts compute costs by 60-70% for fault-tolerant offline workloads. Spheron has on-demand H100 SXM5 for maximum throughput and RTX PRO 6000 PCIe spot for the lowest cost-per-token on document processing pipelines.
Rent H100 on Spheron → | View spot pricing → | Get started →