Teams running document summarization or classification pipelines at scale are consistently over-spending on GPU compute. When you have 10 million contracts to summarize, a legal document archive to classify, or a product catalog to embed, you don't need a low-latency serving endpoint. You need a batch pipeline that processes everything overnight and shuts down when done. The cost difference is not marginal. Online serving endpoints running at 20-30% average utilization cost 5-10x more per processed token than a well-structured offline batch job. See the GPU cost optimization playbook for the full picture of where GPU budgets go, and serverless vs on-demand vs reserved billing explained for the billing model comparison.
Online serving vs offline batch: why the economics differ
An always-on serving endpoint bills you for every hour the GPU is allocated, regardless of whether requests are coming in. At 25% average utilization, you're paying for four GPUs to do the work of one. Add streaming overhead, connection management, and the memory footprint of keeping KV caches warm for concurrent connections, and you're burning GPU budget on infrastructure glue rather than actual inference.
A batch job is fundamentally different. The GPU is only provisioned during active compute. There's no idle billing between inference runs. You can pack requests into much larger effective batch sizes because you're not constrained by per-request latency. And since the job is asynchronous and interruptible, it's eligible for spot pricing.
| Dimension | Online Serving | Offline Batch |
|---|---|---|
| GPU utilization | 20-40% average | 70-90% sustained |
| Spot eligible | No (latency SLA) | Yes |
| Latency requirement | <500ms TTFT | None |
| Throughput optimization | Per-request | Per-shard |
| Typical cost per 1M tokens | $0.40-1.20 | $0.04-0.15 |
The cost gap widens further when you factor in spot pricing. Spot-eligible GPU models on Spheron run at 50-70% below on-demand pricing. For a batch job running 150 hours on 8 GPUs, that discount cuts the compute bill by more than most teams spend on GPU in a month. The worked example below breaks down the exact numbers.
Architectural patterns for reliable batch pipelines
Three patterns cover most offline batch use cases. Which one you need depends on scale and fault-tolerance requirements.
Pattern 1: Simple sequential. One process reads prompts from a file, calls the model, writes outputs. Fine for jobs under 100K documents that you can monitor manually. No fault tolerance.
Pattern 2: Sharded parallel. Split the corpus into N fixed-size shards (10,000-50,000 documents each). Each shard is an independent job unit. Workers pick up shards from a queue or a file list. A completion checkpoint is written after each shard finishes. On a failure or preemption, only the in-progress shard needs to be re-run. This is the right pattern for most production batch pipelines.
Pattern 3: Distributed cluster. Ray Data or similar distributes shards across multiple nodes. Handles petabyte-scale corpora and heterogeneous GPU pools. Adds operational complexity. Most teams don't need this until they're above 500M documents/run.
The core infrastructure for Pattern 2 is simpler than it looks:
Input corpus (10M docs)
|
Shard 0..N (10K docs/shard)
|
Worker pool (N workers, 1 per GPU or per node)
|
Checkpoint store (completed shard IDs)
|
Output store (shard_000.jsonl, shard_001.jsonl, ...)Workers check the checkpoint store on startup and skip already-completed shards. A shard is only marked complete after its output file is fully written and synced. On spot preemption, the job restarts and resumes from the last checkpoint with no re-work beyond the current shard. See the spot GPU training case study for a detailed checkpoint implementation that adapts directly to batch inference pipelines.
Framework comparison for batch LLM inference
| Framework | Batch mode | Cluster support | Fault tolerance | Best for |
|---|---|---|---|---|
vLLM offline (vllm.LLM) | Single-node | None built-in | Manual checkpointing | Single-GPU or tensor-parallel batch on one node |
| Ray Data LLM | Distributed | Native (Ray cluster) | Ray lineage + retry | Multi-node, heterogeneous GPU pools |
| SkyPilot | Job-level scheduling | Cloud-agnostic | Spot failover + retry | Cloud-portable batch with automatic preemption retry |
| BentoML batch | Endpoint-based | Limited | None built-in | Hybrid online+batch with a single deployment |
vLLM's offline mode is the right starting point for most teams. The API is minimal: instantiate vllm.LLM, call .generate() with a list of prompts, get a list of completions back. vLLM handles KV cache allocation, continuous batching, and tensor parallelism internally.
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
tensor_parallel_size=8, # number of GPUs
max_num_seqs=256, # max batch size in flight
enable_prefix_caching=True,
)
sampling_params = SamplingParams(temperature=0.0, max_tokens=256)
# Process one shard (e.g., 10K documents)
outputs = llm.generate(prompts, sampling_params)
# Write outputs and checkpoint
for output in outputs:
result = output.outputs[0].text
# write to output file...The key parameters for throughput tuning are tensor_parallel_size (GPUs per node), max_num_seqs (controls KV cache allocation and throughput), and enable_prefix_caching (covered below).
Spot and preemptible GPU strategies on Spheron
Batch jobs are the ideal use case for spot instances. They're asynchronous, not user-facing, and naturally interruptible. Spot instances on Spheron run at 50-70% below on-demand pricing for eligible GPU models. For current spot pricing, see GPU pricing on Spheron.
The fault-tolerant design for spot batch jobs has three layers:
Checkpoint every N shards. Write a JSON state file after each shard completes. On restart, load the state file and skip completed shards. This keeps re-work to at most one shard (10K-50K documents), even if the instance is preempted mid-job.
Detect SIGTERM before preemption. Cloud providers typically send SIGTERM 30-90 seconds before reclaiming a spot instance. Register a SIGTERM handler that flushes the current shard's output buffer and writes a partial checkpoint, then exits cleanly.
Restart automatically. A simple launch script requests a new spot instance on failure and re-runs the batch script from the checkpoint. With a properly sharded job, a preemption event costs 2-5 minutes of re-provisioning plus the time to reprocess the current shard.
The same checkpoint-and-resume pattern from the spot GPU training case study transfers directly to batch inference. The difference is that shards are the unit of fault recovery instead of training steps.
Teams running offline document processing pipelines combine spot instances with the sharded pattern described above for the lowest possible cost-per-token, as covered in the billing model comparison.
Throughput tuning for offline batch jobs
--max-num-seqs and --max-model-len: These two parameters control how much of the GPU's KV cache is allocated. A larger max_num_seqs increases throughput by processing more requests in parallel but consumes more VRAM. For a 13B model on A100_80G, max_num_seqs=256 is a reasonable starting point. If OOM errors appear, drop to 128. If GPU utilization stays below 70%, increase it.
Prefix caching (--enable-prefix-caching): When every prompt in your batch shares a system prompt or fixed preamble, vLLM can compute the KV cache for that shared prefix once and reuse it across all requests. For a summarization pipeline with a fixed 200-token system prompt and 500-token documents, prefix caching reduces compute by 25-35% and shortens total job time proportionally.
Speculative decoding: For models where most output tokens are predictable, a smaller draft model generates candidates and the main model verifies them in parallel. This increases throughput 1.5-2.5x at the cost of a second model in memory. See the continuous batching and paged attention guide for setup details.
FP8 quantization (--quantization fp8): Halves the KV cache memory footprint, effectively doubling the batch size you can run within the same VRAM budget. For batch jobs where accuracy requirements allow it (most classification and summarization tasks), FP8 is a free throughput multiplier.
Cost worked example: 10M document summarization
Setup: 10M documents, average 500 input tokens + 150 output tokens = 650 tokens per document, 6.5 billion total tokens.
Throughput assumption: 12,000 tokens/sec for 8 GPUs at 70% utilization in batch mode with a 13B model. Actual throughput varies with model size, sequence length, and quantization settings.
Estimated job time: 6,500,000,000 / 12,000 = 541,667 seconds = ~150.5 hours.
H100 SXM5: on-demand
For workloads that need Hopper-class memory bandwidth, see on-demand H100 SXM5 availability on Spheron. H100 SXM5 spot is not currently available on Spheron.
| GPU | Configuration | Hourly rate | Est. hours | Total cost |
|---|---|---|---|---|
| H100 SXM5 | 8x @ $4.34/hr each | $34.72/hr | ~150.5 hr | ~$5,225 |
H100 SXM5 achieves higher throughput in practice (20,000+ tokens/sec for a 13B model across 8 GPUs), so actual job time and total cost would be lower. The table above uses the same 12,000 tokens/sec baseline to keep the comparison consistent.
RTX PRO 6000 PCIe: on-demand vs spot
RTX PRO 6000 PCIe has 96GB VRAM, handles 13B models without quantization, and is available at spot pricing on Spheron at 65% below on-demand.
| Approach | GPU config | Hourly rate | Est. hours | Total cost |
|---|---|---|---|---|
| On-demand | 8x RTX PRO 6000 PCIe @ $1.70/hr each | $13.60/hr | 150.5 hr | ~$2,047 |
| Spot | 8x RTX PRO 6000 PCIe @ $0.59/hr each | $4.72/hr | 165.6 hr (+10% re-run overhead) | ~$782 |
| Spot savings | ~$1,265 saved (62% reduction) |
The 10% overhead on spot accounts for occasional preemptions that require partial shard re-runs. With a properly sharded job (10K docs/shard), each preemption costs 15-30 minutes of re-work at most.
Pricing fluctuates based on GPU availability. The prices above are based on 24 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Monitoring and progress tracking for long-running batch runs
Long batch jobs need lightweight observability. Three metrics matter:
Tokens per second (rolling average): Log this every 60 seconds. A sudden drop indicates an I/O bottleneck (slow storage reads), KV cache pressure (OOM approaching), or a hung worker. Baseline first and alert on >20% degradation.
GPU utilization: Use nvidia-smi dmon -s u for per-GPU utilization polling, or DCGM exporter with Prometheus for multi-node setups. If utilization stays below 70% throughout the job, increase max_num_seqs. If it spikes to 100% and stays there with OOM warnings, reduce it or add FP8 quantization.
Shard completion rate and ETA: Track this with a simple JSON progress file:
import json, time
def write_progress(completed, total, start_time):
elapsed = time.time() - start_time
if elapsed <= 0:
return
rate = completed / elapsed # shards per second
eta_seconds = (total - completed) / rate if rate > 0 else None
with open("progress.json", "w") as f:
json.dump({
"completed": completed,
"total": total,
"elapsed_hours": elapsed / 3600,
"eta_hours": eta_seconds / 3600 if eta_seconds is not None else None,
}, f)Read progress.json at any point to get the current ETA without interrupting the job.
When batch inference is the wrong choice
Batch inference is appropriate only when latency doesn't matter. Do not use it for:
Chatbots and interactive assistants. Users are waiting. Queuing their message for batch processing is not an option.
Real-time API products with SLAs. If your API contract promises 95th-percentile latency under 2 seconds, batch mode cannot deliver that.
Streaming outputs. vLLM offline mode returns full completions only. If you need token-by-token streaming for progressive rendering, you need an online serving endpoint.
Short-burst workloads under 100K documents. The overhead of setting up the batch infrastructure, waiting for GPU provisioning, and processing shards dominates the total time. For small jobs, just call an API.
Any user-facing feature. If a human is waiting for the result, it's an online workload.
Batch inference on spot GPU instances cuts compute costs by 60-70% for fault-tolerant offline workloads. Spheron has on-demand H100 SXM5 for maximum throughput and RTX PRO 6000 PCIe spot for the lowest cost-per-token on document processing pipelines.
Rent H100 on Spheron → | View spot pricing → | Get started →
