What is LangGraph Studio and why does it need GPU cloud?

LangGraph Studio is a visual IDE for building, debugging, and running stateful agent workflows defined as directed graphs. It connects to a LangGraph server that executes graph nodes - each node typically makes one or more LLM calls. When those LLM calls hit a self-hosted model instead of an API, the GPU running that model becomes the throughput bottleneck. Nodes that call a model backed by inadequate VRAM cause queue buildup, which in multi-agent fan-out scenarios can cascade into timeout chains across the whole graph. GPU cloud gives you on-demand access to H100 or B200 instances sized specifically to your model and concurrency requirements, without committing to reserved capacity.

Can I self-host LangGraph Studio without LangGraph Cloud?

Yes. LangGraph Studio connects to any LangGraph server exposed over HTTP. You run the LangGraph server yourself (via the langgraph CLI or Docker), point the Studio UI at your server's URL, and it works. LangGraph Cloud is a managed version of the same server hosted by LangChain. Self-hosting means you control the deployment environment, can co-locate the LangGraph server with your vLLM instance to minimize inference latency, and pay only for compute - not per-run API fees on top.

How much GPU memory does a LangGraph Studio deployment need?

The LangGraph server and Studio UI are CPU-bound processes and need no GPU. The GPU requirement comes entirely from the model server backing your graph nodes. A Llama 4 Scout (17B active parameters, 109B total) in FP8 needs roughly 55-60GB VRAM - one H100 80GB covers it comfortably. Qwen 3 30B A3B in FP8 fits in about 16GB, making an L40S a cost-effective option. For parallel multi-agent workloads where multiple graph branches fire simultaneously, multiply the peak VRAM per concurrent call by the maximum fan-out degree to size correctly.

What is the cheapest way to run LangGraph Studio in production on GPU cloud?

Use spot GPU instances for non-real-time graph nodes and on-demand for user-facing orchestrator nodes. An on-demand L40S runs Qwen 3 30B at $0.72/hr. For batch graph runs (overnight research agents, data pipelines), use spot H100 SXM5 instances at $0.80/hr: LangGraph's Postgres checkpointer lets the graph resume from the last completed node if the spot instance is reclaimed. For interactive Studio sessions where a human is watching the graph execute in real time, keep the inference backend on on-demand instances to avoid mid-session interruptions.

How do I wire Langfuse or Arize Phoenix into LangGraph Studio?

LangGraph supports LangChain callbacks, so any OpenTelemetry-compatible observability tool that provides a LangChain callback handler works out of the box. For Langfuse, install langfuse and pass CallbackHandler() as a config callback when invoking the graph. For Arize Phoenix, install arize-phoenix-otel and use the OpenInference instrumentor, which patches LangChain automatically. Both capture node-level traces, token counts, and latency breakdowns per graph node. The traces appear in LangGraph Studio's trace panel and simultaneously in your observability dashboard.

LangGraph Studio Production Deployment on GPU Cloud: Self-Hosted Multi-Agent Workflows (2026)

Most LangGraph tutorials end at "run the graph locally." Production means a self-hosted server, a real model backend, durable state, and real users watching the Studio UI as nodes execute. This guide covers the full stack: from GPU provisioning on Spheron H100 instances through vLLM deployment, Postgres checkpointing, and observability.

What LangGraph Studio Actually Is

LangGraph Studio is a visual IDE for building and debugging stateful agent workflows. It connects to a running LangGraph server and shows your graph topology, lets you trigger invocations, and streams node execution state in real time as the graph runs.

Three components are in play, and conflating them causes confusion:

Studio UI: the desktop app (Mac/Linux) or web interface. This is just a visualization and control layer. It makes HTTP calls to a LangGraph server.
LangGraph server: the Python process that actually executes your graph. You can run it locally with langgraph dev, self-host it on your own infrastructure, or use LangGraph Cloud.
LangGraph Cloud: LangChain's managed hosting for the LangGraph server. You push your code, they run the server. You still pay for the underlying model API calls separately.

Three deployment modes exist:

Mode	When to use
Local dev (`langgraph dev`)	Building and testing graphs on your machine
Self-hosted server	Production, cost control, custom model backends, data residency requirements
LangGraph Cloud	When you want zero infrastructure management and are fine with OpenAI/Anthropic API pricing

Self-hosting is the only mode where you can point your graph nodes at a locally running model server. The LangGraph server and vLLM process communicate over localhost with single-digit-millisecond round trips. With LangGraph Cloud, every node invocation goes out to an external model API over the public internet.

Why GPU Cloud Changes the LangGraph Math

The LangGraph server itself is CPU-bound. It runs the graph routing logic, manages state serialization to Postgres, and handles the Studio WebSocket connection. None of that touches a GPU.

The GPU work happens in the model server your graph nodes call. Every llm.invoke() call in a node is an HTTP request to your vLLM instance, and vLLM's throughput and latency are entirely GPU-bound.

Two GPU failure modes show up specifically in LangGraph workflows:

VRAM OOM (static): you loaded a model that doesn't fit on your GPU. The vLLM process crashes or rejects requests. This is caught before production if you test with the right model-to-GPU pairing.

KV cache starvation (dynamic): this is the one that bites in production. The KV cache fills up under concurrent load. Each active inference request holds KV cache proportional to its context length. In multi-agent graph patterns where a supervisor fans out to 5 worker nodes simultaneously, those 5 nodes each hold KV cache slots while they wait for their model responses. Peak KV cache demand = (context length per request) x (number of concurrent inflight requests). Underestimate this and you get queue buildup that cascades into timeout chains across the graph.

The spot instance argument is worth making here. LangGraph's Postgres checkpointer means a spot eviction is a graph pause, not a restart. The graph resumes from the last completed node when a new instance comes up. This is unlike frameworks without native checkpointing, where a spot interruption means starting the entire graph from scratch. For the full picture on GPU autoscaling in agent systems, see the multi-agent GPU infrastructure guide.

Architecture Overview

The full self-hosted stack has four layers:

+-------------------------------------------------------------+
|            Layer 1: LangGraph Studio UI                     |
|              (desktop app / browser)                        |
+-------------------------------------------------------------+
                          |
                          |  HTTP + WebSocket
                          v
+-------------------------------------------------------------+
|               Layer 2: LangGraph Server                     |
|   graph execution  .  state routing  .  Studio API          |
|   checkpoint management  .  thread history                  |
+------------------------------+------------------------------+
                               |
              +----------------+----------------+
              |                                 |
    HTTP inference calls                   reads / writes
              |                                 |
              v                                 v
+-------------------------+       +---------------------------+
|  Layer 3a: vLLM/SGLang  |       |  Layer 3b: State Storage  |
|      model server       |       |  Postgres  (checkpoints)  |
|      (GPU-backed)       |       |  Redis     (in-session)   |
|                         |       |  Vector DB (optional)     |
+-------------------------+       +---------------------------+
              |
              |  CUDA
              v
+-------------------------------------------------------------+
|           Layer 4: GPU  (H100 / L40S / B200)               |
|              model weights  +  KV cache                     |
+-------------------------------------------------------------+

Layer	Component	Role
1	LangGraph Studio UI	Visualization and control. Sends HTTP/WebSocket calls to the LangGraph server.
2	LangGraph Server	Executes graph nodes, manages state routing, serves Studio API, writes checkpoints.
3a	vLLM / SGLang	Handles all inference calls from graph nodes. GPU-backed; throughput and latency are entirely GPU-bound.
3b	State Persistence	Postgres for durable checkpoints, Redis for short-term in-session state, optional vector DB for long-term memory.
4	GPU	Runs the model server. VRAM holds model weights and KV cache.

The LangGraph server and vLLM can run on the same host (ideal: Unix socket or localhost) or on separate hosts with a load balancer between them. Co-location is simpler and cuts inter-process latency to under 1ms.

State persistence runs across three tiers with different durability tradeoffs:

Tier	Storage	Durability	Use case
Durable	Postgres	Survives restarts and spot evictions	Full state snapshot after every completed node
In-session	Redis	Lost on restart (TTL-managed)	Tool call results, scratchpad, fast in-session lookups
Long-term	Vector DB	Persistent across sessions	Agent memory, user preferences, cross-run context

For vector DB setup with Mem0 or Zep, see the agent memory guide.

Step-by-Step Deployment on Spheron H100

1. Provision the GPU Instance

Select your GPU based on the model you plan to serve. The bottleneck is always total parameter count in the KV cache tier, not active parameters for MoE models.

Model	VRAM Required (FP8)	Recommended GPU	On-Demand $/hr	Spot $/hr
Qwen 3 30B A3B	~16GB	L40S (48GB)	$0.72	N/A
Llama 4 Scout 17B/16E	~55-60GB	H100 SXM5 (80GB)	N/A	$0.80
Llama 4 Maverick 17B/128E	~55-60GB	H100 SXM5 (80GB)	N/A	$0.80
70B-class (FP8)	~70GB	H100 SXM5 (80GB)	N/A	$0.80
70B-class (parallel fan-out)	~140GB	2x H100 SXM5	N/A	$1.60

Pricing fluctuates based on GPU availability. The prices above are based on 01 May 2026 and may have changed. Check current GPU pricing for live rates.

Once you have the instance, open ports 8000 (vLLM) and 8123 (LangGraph server) in the Spheron firewall settings. SSH in and confirm GPU access:

bash

nvidia-smi
# Should show GPU name, VRAM total, and driver version

2. Deploy the vLLM Backend

Run the vLLM OpenAI-compatible server:

bash

docker run --gpus all --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768 \
  --max-num-seqs 64

Key flags explained:

--dtype fp8: enables FP8 Tensor Cores on H100 for ~50% VRAM reduction and ~1.5x throughput gain
--gpu-memory-utilization 0.92: leaves 8% VRAM headroom for CUDA context overhead
--max-model-len 32768: cap context length to control KV cache footprint; increase if your graph nodes use long context windows
--max-num-seqs 64: maximum concurrent sequences; size to your expected graph fan-out degree plus headroom
--ipc=host: required for CUDA multi-process shared memory

For Qwen 3 30B A3B (smaller model, higher throughput per GPU):

bash

docker run --gpus all --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-30B-A3B \
  --dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768 \
  --max-num-seqs 128

For the complete vLLM production configuration including multi-GPU tensor parallelism, prefix caching, and monitoring setup, see the full vLLM production guide.

For workloads that require guaranteed structured outputs (JSON tool calls from every graph node), SGLang's constrained decoding is often faster than vLLM's guided decoding on JSON-heavy tool calls, per SGLang's RadixAttention benchmarks. The SGLang deployment guide covers the same deployment steps for that engine.

Verify the server is running:

bash

curl http://localhost:8000/v1/models
# Should return JSON with your model name

3. Configure the LangGraph Agent

Here is a working 2-node supervisor-worker graph that uses the vLLM backend:

python

import os
from typing import Annotated, TypedDict
import operator
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres import PostgresSaver

# Point at your vLLM instance
llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="none",  # vLLM doesn't require auth by default
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    temperature=0,
)

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    next: str

def supervisor_node(state: AgentState):
    response = llm.invoke(state["messages"])
    # Route to worker or finish based on response content
    if "DONE" in response.content:
        return {"next": "end", "messages": [response]}
    return {"next": "worker", "messages": [response]}

def worker_node(state: AgentState):
    response = llm.invoke(state["messages"])
    return {"messages": [response], "next": "supervisor"}

builder = StateGraph(AgentState)
builder.add_node("supervisor", supervisor_node)
builder.add_node("worker", worker_node)
builder.set_entry_point("supervisor")
builder.add_conditional_edges(
    "supervisor",
    lambda s: s["next"],
    {"worker": "worker", "end": END}
)
builder.add_edge("worker", "supervisor")

# Postgres checkpointer for durable state
checkpointer = PostgresSaver.from_conn_string(os.environ["POSTGRES_URL"])
checkpointer.setup()  # Creates the checkpoints table and schema on first run
graph = builder.compile(checkpointer=checkpointer)

Environment variables needed in .env:

POSTGRES_URL=postgresql://user:pass@localhost:5432/langgraph
REDIS_URL=redis://localhost:6379
LANGCHAIN_API_KEY=your-key-here  # optional, for LangSmith traces

4. Start the LangGraph Server

Create langgraph.json in your project root (LangGraph CLI 0.2.x format as of April 2026):

json

{
  "graphs": {
    "agent": "./agent.py:graph"
  },
  "env": ".env",
  "python_version": "3.11"
}

Start the development server:

bash

pip install langgraph-cli
langgraph dev --host 0.0.0.0 --port 8123

Note: langgraph dev defaults to port 2024. The --port 8123 flag used here matches the langgraph up default, which makes switching between the two easier. If you omit the flag, use port 2024 in your Studio connection URL instead.

For production, use the Docker-based deployment instead of langgraph dev:

bash

langgraph up --host 0.0.0.0 --port 8123

langgraph dev runs a lightweight server with auto-reload. langgraph up builds a Docker image from your project and runs it with production settings (Docker Compose-based flow). For the latest production path, LangChain shipped langgraph deploy in March 2026, which supersedes langgraph up for cloud deployments; langgraph up still works for the older Docker Compose flow. Use dev for Studio testing, deploy or up for actual production traffic.

With the server running, open LangGraph Studio and click "Connect to self-hosted server." Enter your server's public IP and port 8123. Studio will load your graph topology, show all nodes and edges, and let you trigger invocations with custom input state.

5. Connect LangGraph Studio

In the Studio app:

Click the server selector in the top-left dropdown
Choose "Self-hosted server"
Enter http://<your-instance-ip>:8123
Studio loads your graph topology from the server's /info endpoint

What Studio shows:

Graph panel: nodes and edges drawn from your StateGraph definition
Thread history: all past invocations with their thread_id, grouped by run
State diff per node: each node execution shows the before/after state diff, so you can see exactly what changed at each step

For HTTPS with a self-signed cert, export the cert chain and add it to Studio's trusted cert list in preferences. Using HTTP is simpler during development if your Studio and server are on a private network.

State Persistence: Postgres, Redis, and Vector Store

Postgres checkpointer

Every time a graph node completes, the PostgresSaver writes a full state snapshot to the checkpoints table. The snapshot includes all messages, tool call results, and any custom state fields. The table schema (created by calling checkpointer.setup() once before the graph is compiled) stores by thread_id and checkpoint_ns, which together uniquely identify a point in the graph execution.

If a spot instance is reclaimed mid-graph, the graph resumes from the last completed checkpoint when a new instance comes up. The node that was executing at eviction time re-runs from scratch; all prior nodes are skipped.

Schedule a cleanup job to prevent unbounded table growth. The exact pruning query depends on your LangGraph version's schema. Run \d checkpoints in psql to confirm available columns before writing any cleanup job.

The standard schema created by PostgresSaver.setup() does not include an updated_at column. A reliable approach is to track thread activity at the application level (e.g., record thread_id and last-used timestamp in a separate table) and prune by those IDs:

sql

DELETE FROM checkpoints
WHERE thread_id IN (
  SELECT thread_id FROM thread_activity
  WHERE last_used_at < NOW() - INTERVAL '7 days'
);

If you need a schema-only approach without a separate tracking table, the checkpoint_id column is a UUIDv1, which encodes the creation timestamp. You can extract it and filter on it:

sql

DELETE FROM checkpoints
WHERE (uuid_send(checkpoint_id::uuid)::bytea IS NOT NULL)
  AND (
    to_timestamp((
      ((('x' || encode(substring(uuid_send(checkpoint_id::uuid) from 7 for 2), 'hex'))::bit(16)::integer::bigint & 4095) << 48)
      | (('x' || encode(substring(uuid_send(checkpoint_id::uuid) from 5 for 2), 'hex'))::bit(16)::integer::bigint << 32)
      | (('x'  || encode(substring(uuid_send(checkpoint_id::uuid) from 1 for 4), 'hex'))::bit(32)::bigint & 4294967295)
    )::double precision / 1e7 - 12219292800) < NOW() - INTERVAL '7 days'
  );

This UUIDv1 approach is verbose and brittle across PostgreSQL versions. Tracking thread activity at the application layer is the more maintainable option.

Redis store

For short-term within-session state (tool results, scratchpad, intermediate computations that don't need durable storage), add a Redis store:

python

from langgraph.store.redis import RedisStore

store = RedisStore.from_conn_string(os.environ["REDIS_URL"])
graph = builder.compile(checkpointer=checkpointer, store=store)

Set TTLs on Redis keys to auto-expire stale session data. For sessions longer than your Redis TTL, fall through to Postgres for state reconstruction.

Long-term vector memory

For cross-session agent memory (facts the agent learned in past runs, user preferences, long-term context), plug in a vector store as a LangGraph store backend. Mem0 and Zep both provide LangGraph-compatible store implementations. See the agent memory guide for the full Mem0 and Zep setup on GPU cloud.

Multi-Agent Patterns and Their GPU Implications

Supervisor Pattern

One orchestrator node routes tasks to worker nodes. Workers can run sequentially (one at a time) or in parallel fan-out (multiple workers triggered simultaneously).

python

# Parallel fan-out: supervisor sends to all workers at once
builder.add_conditional_edges(
    "supervisor",
    lambda s: s["next_workers"],  # returns a list
    {
        "researcher": "researcher_node",
        "coder": "coder_node",
        "reviewer": "reviewer_node",
    }
)

GPU concurrency: if 3 workers fire simultaneously, that's 3 concurrent inference calls to vLLM. Peak KV cache demand = (3 workers) x (context length per worker). Set --max-num-seqs in vLLM to at least your maximum fan-out degree plus 20% headroom.

When to use: heterogeneous task routing where different tasks need different tools or prompting styles. The supervisor model can be small (routing decisions only), while worker models handle the actual inference.

Swarm Pattern

No central orchestrator. Agents hand off to each other via shared message state. Agent A finishes, reads the state, decides Agent B should continue, and updates the routing field.

GPU implication: lower peak concurrency than supervisor fan-out since only one agent runs at a time. But total token chains are longer, as each agent sees the entire accumulated message history. KV cache per request grows with each handoff.

Use case: self-correcting pipelines where agents catch each other's errors. Researcher agent produces a draft; fact-checker agent finds errors; researcher corrects them. The swarm pattern handles this naturally without an explicit routing node.

Hierarchical Multi-Agent

A supervisor that routes to sub-supervisors, each managing their own worker pools. This is the deepest graph structure in LangGraph.

GPU sizing insight: the top-level orchestrator model only needs to make routing decisions, not heavy inference. You can run it on a smaller, cheaper GPU (A100 at $1.64/hr) while worker nodes that do the heavy inference run on H100. Worker models access a shared vLLM pool. The orchestrator calls a separate, smaller model endpoint.

For production fleet autoscaling patterns that extend this architecture to 100+ concurrent agents, see the guide on scaling AI agent fleets.

Observability: Langfuse and Arize Phoenix

Both Langfuse and Arize Phoenix integrate via LangChain callback handlers, which LangGraph passes through to every model call in your graph nodes.

Langfuse integration:

python

from langfuse.callback import CallbackHandler

langfuse_handler = CallbackHandler()

result = graph.invoke(
    {"messages": [("user", "research quantum computing trends")]},
    config={
        "callbacks": [langfuse_handler],
        "configurable": {"thread_id": "session-123"}
    }
)

Each graph invocation creates a trace in Langfuse with per-node spans. You see token counts, latency per node, and the full LLM input/output for each llm.invoke() call inside your nodes.

Arize Phoenix integration:

python

from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor

# Configure Phoenix to export traces to a remote Phoenix instance or any OTLP backend
register(endpoint="http://your-phoenix-host:6006/v1/traces")
LangChainInstrumentor().instrument()

# Phoenix patches LangChain globally; still pass thread_id for the checkpointer
result = graph.invoke(
    {"messages": [("user", "your input")]},
    config={"configurable": {"thread_id": "your-session-id"}}
)

Phoenix captures the same data as Langfuse but uses OpenTelemetry spans natively, which makes it easier to export traces to Jaeger, Tempo, or any OTLP-compatible backend.

Cost per agent run:

To calculate cost per graph invocation, join your trace data with Spheron billing:

python

# GPU cost per invocation = (GPU hourly rate / 3600) * execution_seconds
# For H100 SXM5 spot at $0.80/hr and a 45-second graph run:
cost_per_run = (0.80 / 3600) * 45  # = ~$0.010

Add a custom Langfuse tag with the GPU instance type and the thread_id, then join against your hourly GPU cost to get per-run cost dashboards.

For the complete Langfuse and Arize Phoenix setup with production dashboards, see the LLM observability guide.

Production Hardening

Rate limiting: put nginx or Caddy in front of the LangGraph server. Limit requests per minute per IP to prevent a runaway Studio session from flooding the graph executor.

nginx

limit_req_zone $binary_remote_addr zone=langgraph:10m rate=20r/m;
server {
    location / {
        limit_req zone=langgraph burst=10 nodelay;
        proxy_pass http://localhost:8123;
    }
}

Auth: the LangGraph server has no built-in authentication in the open-source CLI. Add JWT validation at the proxy layer, or use LangGraph Cloud if you need auth without building it yourself. For self-hosted, a simple approach is an API gateway (Kong, Traefik) that validates a bearer token before forwarding to the LangGraph server.

CORS: when Studio connects to a remote server, the LangGraph server must allow the Studio app's origin. During development: langgraph dev --cors-allow-origins '*'. In production, restrict to your Studio domain.

Secret management: inject API keys via environment variables or Docker secrets. Never pass secrets through graph state, as state is logged to Postgres and visible in Studio's thread history panel.

Retry policy: configure max_concurrency and retry_on_exception per node:

python

from langgraph.graph import StateGraph
from langgraph.pregel import RetryPolicy

builder.add_node(
    "worker",
    worker_node,
    retry=RetryPolicy(max_attempts=3, retry_on=Exception)
)

Tool timeouts: unhandled slow tool calls block the graph executor thread. Wrap all tool functions with explicit timeouts:

python

import asyncio

async def web_search_tool(query: str) -> str:
    async with asyncio.timeout(30):  # 30-second hard limit
        return await search_api(query)

Cost Comparison: Self-Hosted vs LangGraph Cloud + OpenAI API

At 1M tokens per day, the cost difference between self-hosted and managed is significant.

Metric	LangGraph Cloud + GPT-4o	Self-Hosted + Llama 4 Scout (H100 PCIe On-Demand)	Self-Hosted + Llama 4 Scout (H100 SXM5 Spot)
Per-token inference cost	~$5/1M input, ~$15/1M output	~$0 (included in GPU hourly)	~$0 (included in GPU hourly)
Monthly GPU/infra cost at 1M tokens/day	~$4,500-9,000 (API only)	~$1,447 (H100 PCIe on-demand 24/7 at $2.01/hr)	~$576 (spot, preemptible at $0.80/hr)
P50 inference latency	500-2000ms (API)	50-200ms (local vLLM)	50-200ms (local vLLM)
Data residency	OpenAI data centers	Your Spheron instance	Your Spheron instance
Model customizability	None (closed weights)	Full (fine-tune, LoRA, quantization)	Full
Spot eviction risk	None	None	Yes (mitigated by Postgres checkpointing)

For on-demand H100 access sized for multi-agent workloads, see H100 on Spheron. For Blackwell-class performance at significantly lower spot pricing, B200 GPU rental offers 2.25x the FP8 throughput of H100. Check current GPU pricing for live rates.

Pricing fluctuates based on GPU availability. The prices above are based on 01 May 2026 and may have changed. Check current GPU pricing for live rates.

Common Pitfalls

Tool execution timeouts: Python's default asyncio behavior lets a slow tool call block a graph thread indefinitely. A web search tool that takes 3 minutes prevents the graph node from completing, which means the Postgres checkpoint never writes for that node. Wrap all tool calls with asyncio.timeout().

State explosion in long graphs: by default, state fields accumulate all outputs across node executions. For messages fields, this means the list grows with every turn. Use annotated reducers and trim explicitly:

python

from typing import Annotated
import operator

class AgentState(TypedDict):
    # This grows unboundedly without trimming
    messages: Annotated[list, operator.add]

Add a message trimmer node that runs every N turns and removes old messages beyond your context budget.

GPU OOM under fan-out: a supervisor spawning 8 workers simultaneously is 8 concurrent vLLM requests, each holding a KV cache slot. If your GPU has headroom for 6 concurrent sequences at your context length, the 7th and 8th requests queue. Cap max_concurrency in the supervisor node to match available KV cache slots.

Checkpoint table bloat: every completed node writes to Postgres. A graph with 20 nodes that runs 10,000 times per day produces 200,000 checkpoint rows per day. Without a pruning job, the table will accumulate millions of rows. Schedule a daily cleanup for threads older than your retention window.

Studio CORS errors: if Studio can't connect to your remote server, check that the LangGraph server's CORS config includes the Studio app's origin. For the Mac Studio app, the origin is typically app://langgraph-studio or null (Electron). Add --cors-allow-origins '*' during development only.

Mixing spot and on-demand nodes incorrectly: only nodes that run unattended (background research, batch processing, overnight data pipelines) should be behind a spot-backed vLLM instance. Any node where a user is actively watching execution in Studio should be on on-demand. A spot eviction mid-session forces the user to wait for a new instance to come up before the graph can resume.

LangGraph Studio with a self-hosted vLLM backend on Spheron H100 gives you full control over the inference stack at a fraction of LangGraph Cloud + OpenAI API costs. For on-demand H100 and B200 instances sized for multi-agent workloads, start with the links below.
Rent H100 on Spheron → | Rent B200 → | View GPU pricing →

What LangGraph Studio Actually Is

Why GPU Cloud Changes the LangGraph Math

Architecture Overview

Step-by-Step Deployment on Spheron H100

1. Provision the GPU Instance

2. Deploy the vLLM Backend

3. Configure the LangGraph Agent

4. Start the LangGraph Server

5. Connect LangGraph Studio

State Persistence: Postgres, Redis, and Vector Store

Multi-Agent Patterns and Their GPU Implications

Supervisor Pattern

Swarm Pattern

Hierarchical Multi-Agent

Observability: Langfuse and Arize Phoenix

Production Hardening

Cost Comparison: Self-Hosted vs LangGraph Cloud + OpenAI API

Common Pitfalls

Build what's next.