Engineering

From Prototype to Production: A Complete LLM Deployment Guide

Back to BlogWritten by Mitrasish, Co-founderMar 27, 2026
LLM DeploymentGPU CloudvLLMOllamaLLM InferenceProduction MLAI InfrastructureMLOps
From Prototype to Production: A Complete LLM Deployment Guide

Most LLM deployments fail at the same choke points: developers get a model working locally, spin up a cloud instance, and then spend weeks firefighting latency spikes, OOM crashes, and unbounded GPU bills. This guide maps the full path from first ollama run to a load-balanced, monitored, auto-recovering production service.

TL;DR

PhaseWhat You DoToolCost Ballpark
PrototypeRun model locallyOllama$0 (local GPU/CPU)
ValidateCloud GPU, real traffic testvLLM + Spheron~$2/hr (H100 on-demand)
OptimizeBenchmark engines, tune flagsvLLM, llama.cppSame instance, no extra cost
ProductionSystemd, health checks, monitoringsystemd + Prometheus+$0 tooling on existing instance
ScaleMulti-instance, spot, shardingNginx LB + spot GPUs60-70% cost reduction vs on-demand

Phase 1: Prototype Locally with Ollama

Before spending on cloud compute, validate that your chosen model actually meets your quality requirements. This phase costs nothing if you have a local GPU or CPU with enough RAM.

For a deeper look at running models locally, see our guide on running LLMs locally with Ollama.

Pick Your Model

Model size determines VRAM requirements and inference speed. Here is the practical decision table. If you are targeting newer large models like Llama 4, see deploying Llama 4 on GPU cloud for model-specific setup.

Model SizeVRAM Required (FP16)VRAM Required (Q4)Best For
7B~14GB~5GBFast iteration, cost-sensitive API
13B~26GB~8GBBetter quality, still single-GPU
30B~60GB~18GBHigh-quality outputs, multi-GPU or Q4
70B~140GB~35-40GBNear-GPT-4 quality, needs H100 or multi-GPU

If you are targeting a 7B or 13B model, a local machine with a 24GB GPU (RTX 3090 or 4090) handles it in FP16. For 30B+ you are looking at quantization locally, or moving to cloud for unquantized serving.

Run It with Ollama

bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (e.g. Llama 3.1 8B)
ollama pull llama3.1:8b

# Run interactively
ollama run llama3.1:8b

# Or test via API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Explain how KV cache works in transformer inference.",
  "stream": false
}'

What to Test in Phase 1

Work through this checklist before moving on:

  • Output quality: Does the model answer your specific domain questions correctly? Run 20-30 prompts representative of your actual workload.
  • Latency on your hardware: Note time-to-first-token (TTFT) and tokens/second. This is your baseline before GPU cloud.
  • Context window limits: Test with your longest expected prompts. Does quality degrade at 4K tokens? 8K? 16K?
  • Model behavior with your prompts: Does it follow your system prompt? Does it hallucinate on your domain?

When to Move On

Move to Phase 2 when any of these are true:

  • Your p50 TTFT exceeds 1 second under single-user load.
  • You need to handle more than 5 concurrent users.
  • Your target model requires more VRAM than your local GPU has.
  • You need 24/7 availability without tying up your laptop.

Phase 2: Validate on a Cloud GPU with vLLM

Phase 2 is about getting a realistic read on throughput and cost before committing to production architecture. Pick a GPU, deploy vLLM, and run a real load test.

For a full breakdown of inference GPUs by cost-per-token, see the AI inference GPU comparison.

Choose Your GPU

Match your model size to the right GPU. Overpaying for a bigger GPU than you need is the most common Phase 2 mistake.

Model SizeRecommended GPUVRAMOn-Demand PriceSpot Price
7B (FP16)L40S48GB$1.80/hr~$0.30-0.35/hr
13B (FP16)L40S48GB$1.80/hr~$0.30-0.35/hr
30B (FP16)A100 SXM4 80GB80GB$1.05/hr~$0.40-0.55/hr
70B (FP8)H100 SXM5 80GB80GB$2.40/hr~$0.80-1.00/hr
70B (FP16)2x H100 SXM5 80GB160GB$4.80/hr~$1.60-2.00/hr

Prices as of March 2026 on Spheron GPU rental. See current GPU pricing for live rates.

Pricing fluctuates based on GPU availability. The prices above are based on 22 Mar 2026 and may have changed. Check current GPU pricing → for live rates.

Spin Up a Spheron Instance

  1. Log in to app.spheron.ai and select your GPU model from the catalog. See the Spheron GPU pricing for a full list of available GPU instances and regions.
  2. Choose your region and provider, then launch the instance.
  3. SSH into the instance and verify the GPU is visible:
bash
nvidia-smi

You should see your GPU with the expected VRAM (e.g., 80034MiB for an H100 80GB). If you provisioned multiple GPUs, all should appear.

  1. Confirm Docker and the NVIDIA Container Toolkit are installed:
bash
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Deploy with vLLM

For a complete guide on replacing OpenAI API calls with your own vLLM server, see self-hosted OpenAI-compatible API with vLLM.

Run the vLLM OpenAI-compatible server. This single command covers the common case for a 7B model on an L40S:

bash
docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype float16 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --max-num-seqs 256

Key flags:

  • --ipc=host: required for shared memory between GPU processes. Skipping causes CUDA errors under load.
  • --gpu-memory-utilization 0.90: leaves 10% headroom for CUDA overhead. Go to 0.92-0.95 if you need more KV cache space.
  • --max-num-seqs 256: maximum concurrent sequences in a batch. Raise this for high-throughput workloads.
  • --max-model-len 8192: context window limit. Lower this to reduce KV cache memory pressure if you do not need long contexts.

For a 70B model in FP8 on H100:

bash
docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 4096 \
  --max-num-seqs 64

Test the deployment:

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello, are you running?"}]
  }'

Load Test with Real Traffic

Single-request latency does not predict production performance. Run a load test to measure throughput under concurrency:

bash
# Install locust
pip install locust

# locustfile.py
from locust import HttpUser, task, between

class LLMUser(HttpUser):
    wait_time = between(0.1, 0.5)

    @task
    def chat(self):
        self.client.post("/v1/chat/completions", json={
            "model": "meta-llama/Llama-3.1-8B-Instruct",
            "messages": [{"role": "user", "content": "Write a two-sentence summary of transformer architecture."}],
            "max_tokens": 100
        }, timeout=120)

# Run with 50 concurrent users
locust -f locustfile.py --headless -u 50 -r 5 --host http://localhost:8000 --run-time 60s

Capture these numbers before moving to Phase 3:

  • Throughput: requests/second and tokens/second at steady state
  • p50 and p95 TTFT: time-to-first-token at median and 95th percentile
  • p95 end-to-end latency: total request time including generation
  • Error rate: any 429 (queue full) or 500 errors at your target concurrency

Pass criteria: if your p95 TTFT is under your SLA at your target concurrency, you have the right GPU. If not, either increase --max-num-seqs, upgrade the GPU, or add a second instance.

Phase 2 Cost Snapshot

ConfigurationTokens/secOn-Demand $/hrCost per 1M Tokens
7B on L40S2,500-4,000$1.80~$0.13-0.20
30B on A100 80GB800-1,500$1.05~$0.19-0.37
70B (FP8) on H100400-700$2.40~$0.95-1.67
70B (FP16) on 2x H100600-1,000$4.80~$1.33-2.22

Phase 3: Optimize Your Inference Engine

Phase 2 confirmed your GPU size and baseline throughput. Phase 3 is about getting more out of that hardware before you harden it for production.

For a detailed comparison of inference frameworks including benchmark numbers, see the full vLLM production deployment guide.

vLLM vs llama.cpp vs TGI: A Practical Comparison

FactorvLLMllama.cppTGI (Text Generation Inference)
Throughput (concurrent)HighestMediumHigh
p50 latencyLowLowLow
Multi-GPU supportYes (tensor + pipeline)LimitedYes
QuantizationFP8, INT4, GPTQ, AWQGGUF (Q2-Q8)GPTQ, AWQ, FP8
Ops complexityMediumLowMedium
Best forProduction batch/API servingSingle-user, CPU inferenceHugging Face-native stack

The right answer for most production deployments is vLLM. llama.cpp is worth considering for CPU-only or very low-concurrency use cases where its GGUF quantization formats are a better fit. TGI is a viable alternative if you are already on the Hugging Face stack and want tighter integration with their ecosystem.

Tuning vLLM for Your Workload

Start with the defaults, then tune based on your Phase 2 load test results:

bash
# High-throughput batch workload
docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype float16 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 512 \
  --max-model-len 4096 \
  --performance-mode throughput

# Low-latency interactive workload
docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype float16 \
  --gpu-memory-utilization 0.85 \
  --max-num-seqs 64 \
  --max-model-len 8192 \
  --performance-mode interactivity

Key flags to tune:

  • --gpu-memory-utilization: higher value means more KV cache space and higher concurrency ceiling, at the risk of OOM. Tune in 0.02 increments.
  • --max-num-seqs: directly caps concurrent sequences. Set to 2-3x your expected peak concurrency.
  • --dtype fp8: use on H100 and Blackwell GPUs. Gives ~1.5-2x throughput improvement. Not available on older GPU architectures.
  • --performance-mode: new in vLLM v0.17.0. throughput favors batching, interactivity favors TTFT, balanced is the default.
  • --kv-cache-dtype fp8: also store the KV cache in FP8 on H100. Saves additional VRAM, enabling longer context windows or more concurrent sequences.

Quantization Trade-offs

PrecisionVRAM for 70BThroughput (relative)Quality Impact
FP16~140GB1x baselineNo loss
FP8~70GB1.5-2xLess than 1-2% on benchmarks
INT4 (AWQ)~40GB1.2-1.5x2-5% on benchmarks, varies by model
GGUF Q4_K_M~38GB0.6-1x3-6% on benchmarks, CPU-friendly

FP8 is the practical choice on H100 and Blackwell. The throughput gain is real and the quality loss is marginal for most use cases. INT4 is worth considering if you need a 70B model on hardware with less than 70GB VRAM (e.g., A100 80GB with tight fit).

Benchmark Your Setup

After tuning, capture baseline numbers you can compare against later:

bash
# GPU-level metrics (run in a separate terminal while vLLM is under load)
nvidia-smi dmon -s pucvmet -d 5

# vLLM internal metrics (raw Prometheus format)
curl http://localhost:8000/metrics | grep -E "num_requests_waiting|gpu_cache_usage|time_to_first_token"

Numbers to record before Phase 4:

  • GPU compute utilization at peak load
  • GPU memory bandwidth utilization at peak load
  • vllm:gpu_cache_usage_perc at peak load
  • vllm:time_to_first_token_seconds p50 and p95

Phase 4: Harden for Production

You have a working, tuned vLLM deployment. Phase 4 is about making it survive a long weekend without manual intervention.

For architecture patterns that apply to any GPU production workload, see our production GPU cloud architecture patterns guide.

Systemd Service Unit

Wrap the Docker container in a systemd service so it restarts automatically on crashes, reboots, and OOM kills:

ini
# /etc/systemd/system/vllm.service
[Unit]
Description=vLLM OpenAI-compatible inference server
After=docker.service
Requires=docker.service

[Service]
Type=simple
Restart=always
RestartSec=10
ExecStartPre=-/usr/bin/docker stop vllm-server
ExecStartPre=-/usr/bin/docker rm vllm-server
ExecStart=/usr/bin/docker run \
  --name vllm-server \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype float16 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 256 \
  --max-model-len 8192
ExecStop=/usr/bin/docker stop vllm-server

[Install]
WantedBy=multi-user.target

Enable and start:

bash
sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm
sudo systemctl status vllm

Health Checks

vLLM exposes a /health endpoint. Poll it and trigger a service restart if it stops responding:

bash
# /usr/local/bin/vllm-healthcheck.sh
#!/bin/bash

# Skip restart if the service has been active for less than 15 minutes (model loading grace period)
active_since=$(systemctl show vllm --property=ActiveEnterTimestamp --value 2>/dev/null)
if [ -n "$active_since" ]; then
  start_epoch=$(date -d "$active_since" +%s 2>/dev/null); [ -z "$start_epoch" ] && exit 0
  now_epoch=$(date +%s)
  uptime_seconds=$((now_epoch - start_epoch))
  if [ "$uptime_seconds" -lt 900 ]; then
    echo "vLLM has been up for ${uptime_seconds}s, within grace period, skipping health check"
    exit 0
  fi
fi

# Also skip if service is still in activating state (i.e., loading)
service_state=$(systemctl is-active vllm 2>/dev/null)
if [ "$service_state" = "activating" ]; then
  echo "vLLM is still activating, skipping health check"
  exit 0
fi

response=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 --connect-timeout 5 http://localhost:8000/health)
if [ "$response" != "200" ]; then
  echo "vLLM health check failed (HTTP $response), restarting service"
  systemctl restart vllm
fi

Add a cron job to run this every minute:

bash
chmod +x /usr/local/bin/vllm-healthcheck.sh
echo "* * * * * root /usr/local/bin/vllm-healthcheck.sh >> /var/log/vllm-healthcheck.log 2>&1" \
  | sudo tee /etc/cron.d/vllm-healthcheck

Monitoring Setup

For GPU monitoring details including DCGM and Grafana setup, see GPU monitoring with Prometheus and Grafana.

Add a Prometheus scrape job for vLLM metrics:

yaml
# prometheus.yml (add to scrape_configs)
scrape_configs:
  - job_name: 'vllm'
    scrape_interval: 10s
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: '/metrics'

Three alerts to configure from the start:

yaml
# vllm_alerts.yml
groups:
  - name: vllm
    rules:
      - alert: VLLMQueueDepth
        expr: vllm:num_requests_waiting > 20
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "vLLM request queue is backing up"

      - alert: VLLMKVCachePressure
        expr: vllm:gpu_cache_usage_perc > 0.90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "vLLM KV cache above 90%, consider scaling"

      - alert: VLLMHighTTFT
        expr: histogram_quantile(0.95, sum(rate(vllm:time_to_first_token_seconds_bucket[5m])) by (le)) > 2.0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "vLLM p95 TTFT above 2s SLA"

Load Balancing Two Instances

For redundancy or throughput beyond a single GPU, add an Nginx upstream block:

nginx
# /etc/nginx/sites-available/vllm
upstream vllm_backend {
    least_conn;
    server 10.0.0.1:8000;
    server 10.0.0.2:8000;
}

server {
    listen 80;

    location / {
        proxy_pass http://vllm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_read_timeout 300s;
        proxy_buffering off;
    }
}

least_conn is the right load balancing strategy for LLM inference because request duration varies significantly. Round-robin can pile long requests onto one backend while the other sits idle.

Phase 4 Architecture Diagram

Client
  |
  v
Nginx (least_conn load balancer, port 80)
  |           |
  v           v
vLLM-1     vLLM-2
(port 8000) (port 8000)
  |           |
  v           v
GPU-1       GPU-2
  |           |
  v           v
Prometheus  Prometheus
(scrape /metrics every 10s)

Phase 5: Scale

Single-instance is fine for development and low-traffic production. Phase 5 is for when your traffic outgrows it.

For GPU cost reduction strategies that apply across all phases, see GPU cost optimization strategies.

Horizontal Scaling: When and How

Calculate how many instances you need:

instances_needed = ceil(peak_rps / single_instance_rps)

Example: if your load test showed a single L40S handles 12 requests/second at p95 TTFT under 500ms, and your peak traffic is 60 requests/second, you need:

ceil(60 / 12) = 5 L40S instances

Add 20-30% buffer for traffic spikes: plan for 6-7 instances.

For stateless LLM inference (no session affinity), horizontal scaling is straightforward. Each instance runs an independent vLLM server. Nginx distributes load across all of them. No shared state to coordinate.

Model Sharding for 70B+ Models

When a model does not fit on a single GPU even in FP8, use tensor parallelism to split it across multiple GPUs on the same host:

bash
# 70B model in FP16 across 2x H100 80GB
docker run --gpus all \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --dtype float16 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 128

When to use tensor parallelism vs adding more single-GPU instances:

  • Tensor parallelism: when the model does not fit on one GPU, or when you want to reduce TTFT (prefill is parallelized across GPUs).
  • More instances: when throughput is the bottleneck and the model fits on one GPU. Multiple independent instances scale throughput linearly with no NVLink overhead.

For 140B+ models, consider pipeline parallelism (--pipeline-parallel-size) in addition to tensor parallelism. Pipeline parallelism assigns different transformer layers to different GPUs, which works better when NVLink bandwidth is a bottleneck.

Spot Instances for Cost Reduction

For production inference on a stable traffic pattern, spot instances cut costs significantly. On Spheron, spot pricing for H100 and L40S on-demand varies by availability, but savings typically run 50-70% versus on-demand:

GPUOn-Demand $/hrSpot $/hr (approx.)Monthly Savings (1 instance)
H100 SXM5 80GB$2.40~$0.80-1.00~$1,008-1,152
L40S 48GB$1.80~$0.30-0.35~$1,044-1,080
A100 SXM4 80GB$1.05~$0.40-0.55~$360-468

To use spot with vLLM safely, the service must handle interruptions gracefully. LLM inference is stateless (no in-flight requests survive an interruption), so the main concern is in-flight requests at the moment of preemption. A reasonable approach:

  1. Run spot instances behind the Nginx load balancer.
  2. Set a short connection drain timeout (10-30 seconds) so Nginx stops sending new requests to a preempting instance.
  3. Keep at least one on-demand instance in the pool for reliability. Blend spot and on-demand based on your availability tolerance.

Cost at Scale: Projections

TierConfigurationOn-Demand $/monthSpot $/month (approx.)
Dev / Low traffic1x L40S$1,296~$216-252
Small production2x H100 SXM5$3,456~$1,152-1,440
Scale production4x H100 SXM5 (spot blend)$6,912~$2,304-2,880

See H100 GPU rental and View all GPU pricing for current rates before budgeting.

Pricing fluctuates based on GPU availability. The prices above are based on 22 Mar 2026 and may have changed. Check current GPU pricing → for live rates.


Related Resources

The five phases above each have deeper coverage in related posts:


Every phase in this guide runs on Spheron, from a single L40S for vLLM validation to multi-GPU H100 configurations for production. Per-minute billing means you only pay for the phases you are in, not idle capacity between them.

Rent H100 → | Rent L40S → | View all GPU pricing → | Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.