Most LangGraph tutorials end at "run the graph locally." Production means a self-hosted server, a real model backend, durable state, and real users watching the Studio UI as nodes execute. This guide covers the full stack: from GPU provisioning on Spheron H100 instances through vLLM deployment, Postgres checkpointing, and observability.
What LangGraph Studio Actually Is
LangGraph Studio is a visual IDE for building and debugging stateful agent workflows. It connects to a running LangGraph server and shows your graph topology, lets you trigger invocations, and streams node execution state in real time as the graph runs.
Three components are in play, and conflating them causes confusion:
- Studio UI: the desktop app (Mac/Linux) or web interface. This is just a visualization and control layer. It makes HTTP calls to a LangGraph server.
- LangGraph server: the Python process that actually executes your graph. You can run it locally with
langgraph dev, self-host it on your own infrastructure, or use LangGraph Cloud. - LangGraph Cloud: LangChain's managed hosting for the LangGraph server. You push your code, they run the server. You still pay for the underlying model API calls separately.
Three deployment modes exist:
| Mode | When to use |
|---|---|
Local dev (langgraph dev) | Building and testing graphs on your machine |
| Self-hosted server | Production, cost control, custom model backends, data residency requirements |
| LangGraph Cloud | When you want zero infrastructure management and are fine with OpenAI/Anthropic API pricing |
Self-hosting is the only mode where you can point your graph nodes at a locally running model server. The LangGraph server and vLLM process communicate over localhost with single-digit-millisecond round trips. With LangGraph Cloud, every node invocation goes out to an external model API over the public internet.
Why GPU Cloud Changes the LangGraph Math
The LangGraph server itself is CPU-bound. It runs the graph routing logic, manages state serialization to Postgres, and handles the Studio WebSocket connection. None of that touches a GPU.
The GPU work happens in the model server your graph nodes call. Every llm.invoke() call in a node is an HTTP request to your vLLM instance, and vLLM's throughput and latency are entirely GPU-bound.
Two GPU failure modes show up specifically in LangGraph workflows:
VRAM OOM (static): you loaded a model that doesn't fit on your GPU. The vLLM process crashes or rejects requests. This is caught before production if you test with the right model-to-GPU pairing.
KV cache starvation (dynamic): this is the one that bites in production. The KV cache fills up under concurrent load. Each active inference request holds KV cache proportional to its context length. In multi-agent graph patterns where a supervisor fans out to 5 worker nodes simultaneously, those 5 nodes each hold KV cache slots while they wait for their model responses. Peak KV cache demand = (context length per request) x (number of concurrent inflight requests). Underestimate this and you get queue buildup that cascades into timeout chains across the graph.
The spot instance argument is worth making here. LangGraph's Postgres checkpointer means a spot eviction is a graph pause, not a restart. The graph resumes from the last completed node when a new instance comes up. This is unlike frameworks without native checkpointing, where a spot interruption means starting the entire graph from scratch. For the full picture on GPU autoscaling in agent systems, see the multi-agent GPU infrastructure guide.
Architecture Overview
The full self-hosted stack has four layers:
+-------------------------------------------------------------+
| Layer 1: LangGraph Studio UI |
| (desktop app / browser) |
+-------------------------------------------------------------+
|
| HTTP + WebSocket
v
+-------------------------------------------------------------+
| Layer 2: LangGraph Server |
| graph execution . state routing . Studio API |
| checkpoint management . thread history |
+------------------------------+------------------------------+
|
+----------------+----------------+
| |
HTTP inference calls reads / writes
| |
v v
+-------------------------+ +---------------------------+
| Layer 3a: vLLM/SGLang | | Layer 3b: State Storage |
| model server | | Postgres (checkpoints) |
| (GPU-backed) | | Redis (in-session) |
| | | Vector DB (optional) |
+-------------------------+ +---------------------------+
|
| CUDA
v
+-------------------------------------------------------------+
| Layer 4: GPU (H100 / L40S / B200) |
| model weights + KV cache |
+-------------------------------------------------------------+| Layer | Component | Role |
|---|---|---|
| 1 | LangGraph Studio UI | Visualization and control. Sends HTTP/WebSocket calls to the LangGraph server. |
| 2 | LangGraph Server | Executes graph nodes, manages state routing, serves Studio API, writes checkpoints. |
| 3a | vLLM / SGLang | Handles all inference calls from graph nodes. GPU-backed; throughput and latency are entirely GPU-bound. |
| 3b | State Persistence | Postgres for durable checkpoints, Redis for short-term in-session state, optional vector DB for long-term memory. |
| 4 | GPU | Runs the model server. VRAM holds model weights and KV cache. |
The LangGraph server and vLLM can run on the same host (ideal: Unix socket or localhost) or on separate hosts with a load balancer between them. Co-location is simpler and cuts inter-process latency to under 1ms.
State persistence runs across three tiers with different durability tradeoffs:
| Tier | Storage | Durability | Use case |
|---|---|---|---|
| Durable | Postgres | Survives restarts and spot evictions | Full state snapshot after every completed node |
| In-session | Redis | Lost on restart (TTL-managed) | Tool call results, scratchpad, fast in-session lookups |
| Long-term | Vector DB | Persistent across sessions | Agent memory, user preferences, cross-run context |
For vector DB setup with Mem0 or Zep, see the agent memory guide.
Step-by-Step Deployment on Spheron H100
1. Provision the GPU Instance
Select your GPU based on the model you plan to serve. The bottleneck is always total parameter count in the KV cache tier, not active parameters for MoE models.
| Model | VRAM Required (FP8) | Recommended GPU | On-Demand $/hr | Spot $/hr |
|---|---|---|---|---|
| Qwen 3 30B A3B | ~16GB | L40S (48GB) | $0.72 | N/A |
| Llama 4 Scout 17B/16E | ~55-60GB | H100 SXM5 (80GB) | N/A | $0.80 |
| Llama 4 Maverick 17B/128E | ~55-60GB | H100 SXM5 (80GB) | N/A | $0.80 |
| 70B-class (FP8) | ~70GB | H100 SXM5 (80GB) | N/A | $0.80 |
| 70B-class (parallel fan-out) | ~140GB | 2x H100 SXM5 | N/A | $1.60 |
Pricing fluctuates based on GPU availability. The prices above are based on 01 May 2026 and may have changed. Check current GPU pricing for live rates.
Once you have the instance, open ports 8000 (vLLM) and 8123 (LangGraph server) in the Spheron firewall settings. SSH in and confirm GPU access:
nvidia-smi
# Should show GPU name, VRAM total, and driver version2. Deploy the vLLM Backend
Run the vLLM OpenAI-compatible server:
docker run --gpus all --ipc=host \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--dtype fp8 \
--gpu-memory-utilization 0.92 \
--max-model-len 32768 \
--max-num-seqs 64Key flags explained:
--dtype fp8: enables FP8 Tensor Cores on H100 for ~50% VRAM reduction and ~1.5x throughput gain--gpu-memory-utilization 0.92: leaves 8% VRAM headroom for CUDA context overhead--max-model-len 32768: cap context length to control KV cache footprint; increase if your graph nodes use long context windows--max-num-seqs 64: maximum concurrent sequences; size to your expected graph fan-out degree plus headroom--ipc=host: required for CUDA multi-process shared memory
For Qwen 3 30B A3B (smaller model, higher throughput per GPU):
docker run --gpus all --ipc=host \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-30B-A3B \
--dtype fp8 \
--gpu-memory-utilization 0.92 \
--max-model-len 32768 \
--max-num-seqs 128For the complete vLLM production configuration including multi-GPU tensor parallelism, prefix caching, and monitoring setup, see the full vLLM production guide.
For workloads that require guaranteed structured outputs (JSON tool calls from every graph node), SGLang's constrained decoding is often faster than vLLM's guided decoding on JSON-heavy tool calls, per SGLang's RadixAttention benchmarks. The SGLang deployment guide covers the same deployment steps for that engine.
Verify the server is running:
curl http://localhost:8000/v1/models
# Should return JSON with your model name3. Configure the LangGraph Agent
Here is a working 2-node supervisor-worker graph that uses the vLLM backend:
import os
from typing import Annotated, TypedDict
import operator
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres import PostgresSaver
# Point at your vLLM instance
llm = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="none", # vLLM doesn't require auth by default
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
temperature=0,
)
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
next: str
def supervisor_node(state: AgentState):
response = llm.invoke(state["messages"])
# Route to worker or finish based on response content
if "DONE" in response.content:
return {"next": "end", "messages": [response]}
return {"next": "worker", "messages": [response]}
def worker_node(state: AgentState):
response = llm.invoke(state["messages"])
return {"messages": [response], "next": "supervisor"}
builder = StateGraph(AgentState)
builder.add_node("supervisor", supervisor_node)
builder.add_node("worker", worker_node)
builder.set_entry_point("supervisor")
builder.add_conditional_edges(
"supervisor",
lambda s: s["next"],
{"worker": "worker", "end": END}
)
builder.add_edge("worker", "supervisor")
# Postgres checkpointer for durable state
checkpointer = PostgresSaver.from_conn_string(os.environ["POSTGRES_URL"])
checkpointer.setup() # Creates the checkpoints table and schema on first run
graph = builder.compile(checkpointer=checkpointer)Environment variables needed in .env:
POSTGRES_URL=postgresql://user:pass@localhost:5432/langgraph
REDIS_URL=redis://localhost:6379
LANGCHAIN_API_KEY=your-key-here # optional, for LangSmith traces4. Start the LangGraph Server
Create langgraph.json in your project root (LangGraph CLI 0.2.x format as of April 2026):
{
"graphs": {
"agent": "./agent.py:graph"
},
"env": ".env",
"python_version": "3.11"
}Start the development server:
pip install langgraph-cli
langgraph dev --host 0.0.0.0 --port 8123Note: langgraph dev defaults to port 2024. The --port 8123 flag used here matches the langgraph up default, which makes switching between the two easier. If you omit the flag, use port 2024 in your Studio connection URL instead.
For production, use the Docker-based deployment instead of langgraph dev:
langgraph up --host 0.0.0.0 --port 8123langgraph dev runs a lightweight server with auto-reload. langgraph up builds a Docker image from your project and runs it with production settings (Docker Compose-based flow). For the latest production path, LangChain shipped langgraph deploy in March 2026, which supersedes langgraph up for cloud deployments; langgraph up still works for the older Docker Compose flow. Use dev for Studio testing, deploy or up for actual production traffic.
With the server running, open LangGraph Studio and click "Connect to self-hosted server." Enter your server's public IP and port 8123. Studio will load your graph topology, show all nodes and edges, and let you trigger invocations with custom input state.
5. Connect LangGraph Studio
In the Studio app:
- Click the server selector in the top-left dropdown
- Choose "Self-hosted server"
- Enter
http://<your-instance-ip>:8123 - Studio loads your graph topology from the server's
/infoendpoint
What Studio shows:
- Graph panel: nodes and edges drawn from your
StateGraphdefinition - Thread history: all past invocations with their
thread_id, grouped by run - State diff per node: each node execution shows the before/after state diff, so you can see exactly what changed at each step
For HTTPS with a self-signed cert, export the cert chain and add it to Studio's trusted cert list in preferences. Using HTTP is simpler during development if your Studio and server are on a private network.
State Persistence: Postgres, Redis, and Vector Store
Postgres checkpointer
Every time a graph node completes, the PostgresSaver writes a full state snapshot to the checkpoints table. The snapshot includes all messages, tool call results, and any custom state fields. The table schema (created by calling checkpointer.setup() once before the graph is compiled) stores by thread_id and checkpoint_ns, which together uniquely identify a point in the graph execution.
If a spot instance is reclaimed mid-graph, the graph resumes from the last completed checkpoint when a new instance comes up. The node that was executing at eviction time re-runs from scratch; all prior nodes are skipped.
Schedule a cleanup job to prevent unbounded table growth. The exact pruning query depends on your LangGraph version's schema. Run \d checkpoints in psql to confirm available columns before writing any cleanup job.
The standard schema created by PostgresSaver.setup() does not include an updated_at column. A reliable approach is to track thread activity at the application level (e.g., record thread_id and last-used timestamp in a separate table) and prune by those IDs:
DELETE FROM checkpoints
WHERE thread_id IN (
SELECT thread_id FROM thread_activity
WHERE last_used_at < NOW() - INTERVAL '7 days'
);If you need a schema-only approach without a separate tracking table, the checkpoint_id column is a UUIDv1, which encodes the creation timestamp. You can extract it and filter on it:
DELETE FROM checkpoints
WHERE (uuid_send(checkpoint_id::uuid)::bytea IS NOT NULL)
AND (
to_timestamp((
((('x' || encode(substring(uuid_send(checkpoint_id::uuid) from 7 for 2), 'hex'))::bit(16)::integer::bigint & 4095) << 48)
| (('x' || encode(substring(uuid_send(checkpoint_id::uuid) from 5 for 2), 'hex'))::bit(16)::integer::bigint << 32)
| (('x' || encode(substring(uuid_send(checkpoint_id::uuid) from 1 for 4), 'hex'))::bit(32)::bigint & 4294967295)
)::double precision / 1e7 - 12219292800) < NOW() - INTERVAL '7 days'
);This UUIDv1 approach is verbose and brittle across PostgreSQL versions. Tracking thread activity at the application layer is the more maintainable option.
Redis store
For short-term within-session state (tool results, scratchpad, intermediate computations that don't need durable storage), add a Redis store:
from langgraph.store.redis import RedisStore
store = RedisStore.from_conn_string(os.environ["REDIS_URL"])
graph = builder.compile(checkpointer=checkpointer, store=store)Set TTLs on Redis keys to auto-expire stale session data. For sessions longer than your Redis TTL, fall through to Postgres for state reconstruction.
Long-term vector memory
For cross-session agent memory (facts the agent learned in past runs, user preferences, long-term context), plug in a vector store as a LangGraph store backend. Mem0 and Zep both provide LangGraph-compatible store implementations. See the agent memory guide for the full Mem0 and Zep setup on GPU cloud.
Multi-Agent Patterns and Their GPU Implications
Supervisor Pattern
One orchestrator node routes tasks to worker nodes. Workers can run sequentially (one at a time) or in parallel fan-out (multiple workers triggered simultaneously).
# Parallel fan-out: supervisor sends to all workers at once
builder.add_conditional_edges(
"supervisor",
lambda s: s["next_workers"], # returns a list
{
"researcher": "researcher_node",
"coder": "coder_node",
"reviewer": "reviewer_node",
}
)GPU concurrency: if 3 workers fire simultaneously, that's 3 concurrent inference calls to vLLM. Peak KV cache demand = (3 workers) x (context length per worker). Set --max-num-seqs in vLLM to at least your maximum fan-out degree plus 20% headroom.
When to use: heterogeneous task routing where different tasks need different tools or prompting styles. The supervisor model can be small (routing decisions only), while worker models handle the actual inference.
Swarm Pattern
No central orchestrator. Agents hand off to each other via shared message state. Agent A finishes, reads the state, decides Agent B should continue, and updates the routing field.
GPU implication: lower peak concurrency than supervisor fan-out since only one agent runs at a time. But total token chains are longer, as each agent sees the entire accumulated message history. KV cache per request grows with each handoff.
Use case: self-correcting pipelines where agents catch each other's errors. Researcher agent produces a draft; fact-checker agent finds errors; researcher corrects them. The swarm pattern handles this naturally without an explicit routing node.
Hierarchical Multi-Agent
A supervisor that routes to sub-supervisors, each managing their own worker pools. This is the deepest graph structure in LangGraph.
GPU sizing insight: the top-level orchestrator model only needs to make routing decisions, not heavy inference. You can run it on a smaller, cheaper GPU (A100 at $1.64/hr) while worker nodes that do the heavy inference run on H100. Worker models access a shared vLLM pool. The orchestrator calls a separate, smaller model endpoint.
For production fleet autoscaling patterns that extend this architecture to 100+ concurrent agents, see the guide on scaling AI agent fleets.
Observability: Langfuse and Arize Phoenix
Both Langfuse and Arize Phoenix integrate via LangChain callback handlers, which LangGraph passes through to every model call in your graph nodes.
Langfuse integration:
from langfuse.callback import CallbackHandler
langfuse_handler = CallbackHandler()
result = graph.invoke(
{"messages": [("user", "research quantum computing trends")]},
config={
"callbacks": [langfuse_handler],
"configurable": {"thread_id": "session-123"}
}
)Each graph invocation creates a trace in Langfuse with per-node spans. You see token counts, latency per node, and the full LLM input/output for each llm.invoke() call inside your nodes.
Arize Phoenix integration:
from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor
# Configure Phoenix to export traces to a remote Phoenix instance or any OTLP backend
register(endpoint="http://your-phoenix-host:6006/v1/traces")
LangChainInstrumentor().instrument()
# Phoenix patches LangChain globally; still pass thread_id for the checkpointer
result = graph.invoke(
{"messages": [("user", "your input")]},
config={"configurable": {"thread_id": "your-session-id"}}
)Phoenix captures the same data as Langfuse but uses OpenTelemetry spans natively, which makes it easier to export traces to Jaeger, Tempo, or any OTLP-compatible backend.
Cost per agent run:
To calculate cost per graph invocation, join your trace data with Spheron billing:
# GPU cost per invocation = (GPU hourly rate / 3600) * execution_seconds
# For H100 SXM5 spot at $0.80/hr and a 45-second graph run:
cost_per_run = (0.80 / 3600) * 45 # = ~$0.010Add a custom Langfuse tag with the GPU instance type and the thread_id, then join against your hourly GPU cost to get per-run cost dashboards.
For the complete Langfuse and Arize Phoenix setup with production dashboards, see the LLM observability guide.
Production Hardening
Rate limiting: put nginx or Caddy in front of the LangGraph server. Limit requests per minute per IP to prevent a runaway Studio session from flooding the graph executor.
limit_req_zone $binary_remote_addr zone=langgraph:10m rate=20r/m;
server {
location / {
limit_req zone=langgraph burst=10 nodelay;
proxy_pass http://localhost:8123;
}
}Auth: the LangGraph server has no built-in authentication in the open-source CLI. Add JWT validation at the proxy layer, or use LangGraph Cloud if you need auth without building it yourself. For self-hosted, a simple approach is an API gateway (Kong, Traefik) that validates a bearer token before forwarding to the LangGraph server.
CORS: when Studio connects to a remote server, the LangGraph server must allow the Studio app's origin. During development: langgraph dev --cors-allow-origins '*'. In production, restrict to your Studio domain.
Secret management: inject API keys via environment variables or Docker secrets. Never pass secrets through graph state, as state is logged to Postgres and visible in Studio's thread history panel.
Retry policy: configure max_concurrency and retry_on_exception per node:
from langgraph.graph import StateGraph
from langgraph.pregel import RetryPolicy
builder.add_node(
"worker",
worker_node,
retry=RetryPolicy(max_attempts=3, retry_on=Exception)
)Tool timeouts: unhandled slow tool calls block the graph executor thread. Wrap all tool functions with explicit timeouts:
import asyncio
async def web_search_tool(query: str) -> str:
async with asyncio.timeout(30): # 30-second hard limit
return await search_api(query)Cost Comparison: Self-Hosted vs LangGraph Cloud + OpenAI API
At 1M tokens per day, the cost difference between self-hosted and managed is significant.
| Metric | LangGraph Cloud + GPT-4o | Self-Hosted + Llama 4 Scout (H100 PCIe On-Demand) | Self-Hosted + Llama 4 Scout (H100 SXM5 Spot) |
|---|---|---|---|
| Per-token inference cost | ~$5/1M input, ~$15/1M output | ~$0 (included in GPU hourly) | ~$0 (included in GPU hourly) |
| Monthly GPU/infra cost at 1M tokens/day | ~$4,500-9,000 (API only) | ~$1,447 (H100 PCIe on-demand 24/7 at $2.01/hr) | ~$576 (spot, preemptible at $0.80/hr) |
| P50 inference latency | 500-2000ms (API) | 50-200ms (local vLLM) | 50-200ms (local vLLM) |
| Data residency | OpenAI data centers | Your Spheron instance | Your Spheron instance |
| Model customizability | None (closed weights) | Full (fine-tune, LoRA, quantization) | Full |
| Spot eviction risk | None | None | Yes (mitigated by Postgres checkpointing) |
For on-demand H100 access sized for multi-agent workloads, see H100 on Spheron. For Blackwell-class performance at significantly lower spot pricing, B200 GPU rental offers 2.25x the FP8 throughput of H100. Check current GPU pricing for live rates.
Pricing fluctuates based on GPU availability. The prices above are based on 01 May 2026 and may have changed. Check current GPU pricing for live rates.
Common Pitfalls
Tool execution timeouts: Python's default asyncio behavior lets a slow tool call block a graph thread indefinitely. A web search tool that takes 3 minutes prevents the graph node from completing, which means the Postgres checkpoint never writes for that node. Wrap all tool calls with asyncio.timeout().
State explosion in long graphs: by default, state fields accumulate all outputs across node executions. For messages fields, this means the list grows with every turn. Use annotated reducers and trim explicitly:
from typing import Annotated
import operator
class AgentState(TypedDict):
# This grows unboundedly without trimming
messages: Annotated[list, operator.add]Add a message trimmer node that runs every N turns and removes old messages beyond your context budget.
GPU OOM under fan-out: a supervisor spawning 8 workers simultaneously is 8 concurrent vLLM requests, each holding a KV cache slot. If your GPU has headroom for 6 concurrent sequences at your context length, the 7th and 8th requests queue. Cap max_concurrency in the supervisor node to match available KV cache slots.
Checkpoint table bloat: every completed node writes to Postgres. A graph with 20 nodes that runs 10,000 times per day produces 200,000 checkpoint rows per day. Without a pruning job, the table will accumulate millions of rows. Schedule a daily cleanup for threads older than your retention window.
Studio CORS errors: if Studio can't connect to your remote server, check that the LangGraph server's CORS config includes the Studio app's origin. For the Mac Studio app, the origin is typically app://langgraph-studio or null (Electron). Add --cors-allow-origins '*' during development only.
Mixing spot and on-demand nodes incorrectly: only nodes that run unattended (background research, batch processing, overnight data pipelines) should be behind a spot-backed vLLM instance. Any node where a user is actively watching execution in Studio should be on on-demand. A spot eviction mid-session forces the user to wait for a new instance to come up before the graph can resume.
LangGraph Studio with a self-hosted vLLM backend on Spheron H100 gives you full control over the inference stack at a fraction of LangGraph Cloud + OpenAI API costs. For on-demand H100 and B200 instances sized for multi-agent workloads, start with the links below.
