Teams shipping agents in 2026 keep getting stuck on the same question: LangGraph or LangChain? The framing is wrong. LangGraph is built on top of LangChain. The real question is whether your agent needs what LangGraph adds: stateful graph execution, checkpointing, time-travel debugging, and interruption support.
If you have a linear pipeline with no branching, LangChain is sufficient. If your agent needs to loop, branch, resume after failure, or wait for human approval mid-execution, you need LangGraph. Most production agents end up in the second category.
TL;DR Decision Matrix
| Scenario | Use LangChain | Use LangGraph | Use Both |
|---|---|---|---|
| Simple RAG pipeline (retrieve, generate, return) | Yes | No | Optional |
| Linear chatbot with no memory across sessions | Yes | No | Optional |
| Multi-step tool use with fixed sequence | Yes | No | Optional |
| Agent with conditional branches or retries | No | Yes | Yes |
| Human-in-the-loop approval gate | No | Yes | Yes |
| Long-running session that must resume | No | Yes | Yes |
| Multi-agent supervisor with specialized workers | No | Yes | Yes |
| Complex workflow with state replay/debugging | No | Yes | Yes |
What LangChain Actually Is (and Isn't)
LangChain is a composition toolkit. It gives you building blocks: retrievers that connect to vector stores, tool definitions, prompt templates, output parsers, and LCEL (LangChain Expression Language) for wiring them into pipelines. The ecosystem is the real moat. Hundreds of integrations with vector stores, document loaders, embedding models, and third-party APIs exist out of the box.
LCEL makes composition readable:
chain = prompt | llm | output_parser
result = chain.invoke({"question": "What is the capital of France?"})That is clean, testable, and easy to understand. For fixed-flow pipelines, it's hard to beat.
The weakness shows up with AgentExecutor. LangChain's built-in agent loop handles tool calling, but it's a black box. No native state persistence. No way to interrupt and resume. No branching or conditional routing. If you call agent_executor.invoke() and it fails halfway through a long chain of tool calls, you restart from scratch. For short, low-stakes tasks, this is fine. For production agents running 5-15 tool calls on expensive context, it's a problem.
What LangGraph Adds
LangGraph gives you a directed graph where nodes are Python functions and edges are transitions between them. The state is explicit: a typed Python dict (TypedDict) that every node reads and writes to.
from langgraph.graph import StateGraph, END
from typing import TypedDict
class AgentState(TypedDict):
messages: list
tool_calls_remaining: int
last_tool_output: str
def call_llm(state: AgentState) -> AgentState:
# call LLM, return partial state update
...
def call_tool(state: AgentState) -> AgentState:
# execute tool, update state
...
def should_continue(state: AgentState) -> str:
if state["tool_calls_remaining"] > 0:
return "tool"
return END
graph = StateGraph(AgentState)
graph.add_node("llm", call_llm)
graph.add_node("tool", call_tool)
graph.add_conditional_edges("llm", should_continue)
graph.add_edge("tool", "llm")
graph.set_entry_point("llm")
app = graph.compile()The graph is explicit, inspectable, and testable. You can visualize it, step through it in a debugger, and replay any state from any checkpoint.
Checkpointing
Checkpointing is the feature that changes what's possible in production. After every node execution, LangGraph serializes the full state. In development, you use MemorySaver. In production, you use AsyncPostgresSaver or AsyncRedisSaver.
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
async with AsyncPostgresSaver.from_conn_string(DATABASE_URL) as checkpointer:
app = graph.compile(checkpointer=checkpointer)
result = await app.ainvoke(input, config={"configurable": {"thread_id": "session-123"}})With PostgreSQL checkpointing, a spot instance interruption means resuming from the last completed node, not restarting from scratch. For a 10-node graph, that's the difference between losing 9 LLM calls and losing 0.
Human-in-the-loop
interrupt_before and interrupt_after let you pause graph execution at specific nodes, send state to a human reviewer, and resume:
app = graph.compile(
checkpointer=checkpointer,
interrupt_before=["execute_code"] # pause before code execution for review
)LangChain's AgentExecutor has no equivalent primitive.
Time-Travel Debugging
LangGraph Studio (the visual debugger) lets you replay any prior state, branch off a new execution path from any point, and compare outcomes. For debugging complex multi-turn agent failures, this saves hours.
State Management Deep Dive
LangChain handles memory through objects like ConversationBufferMemory and RedisChatMessageHistory. These work well for chatbots that need the last N messages. They fall apart for agents that need to track structured state across turns.
Compare:
LangChain memory (conversation history only):
from langchain.memory import RedisChatMessageHistory, ConversationBufferWindowMemory
history = RedisChatMessageHistory(session_id="user-123", url=REDIS_URL)
memory = ConversationBufferWindowMemory(chat_memory=history, k=10)LangGraph state (structured, typed, persisted):
class ResearchState(TypedDict):
query: str
sources_found: list[str]
drafts: list[str]
approval_status: str
token_budget_remaining: int
user_id: strLangGraph's state is explicit about everything your agent tracks. There's no implicit message buffer that you hope contains the right context. Every field is visible, reduceable (you can define how fields merge on updates), and checkpointed.
When checkpoints outperform naive memory:
- Multi-session resume: a user comes back after three days. LangGraph can restore the exact state from where they left off. LangChain's
RedisChatMessageHistorystores messages but not the full execution state. - Parallel branch evaluation: you can fork a graph at checkpoint N, run two different paths, and compare outcomes for A/B testing or debugging.
- Compliance audit trails: regulated industries need a complete record of every agent decision. LangGraph's checkpoint history provides this out of the box.
For long-term cross-session memory that goes beyond graph state (embedding-based recall of facts across sessions), LangGraph checkpoints and vector memory serve different purposes. The guide on persistent agent memory with Mem0 and Zep covers how embedding-based memory sits alongside LangGraph checkpoints as a separate retrieval layer.
Multi-Agent Orchestration: Where LangGraph Wins Decisively
LangChain's AgentExecutor has no native multi-agent primitive. You can chain two agents together, but managing state across them, routing between them conditionally, or running them in parallel requires custom code.
LangGraph handles this with the supervisor pattern:
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from typing import TypedDict, Annotated
class SupervisorState(TypedDict):
messages: Annotated[list, add_messages]
next_agent: str
def router(state: SupervisorState) -> str:
return state["next_agent"]
# Build supervisor graph
supervisor = StateGraph(SupervisorState)
supervisor.add_node("supervisor", supervisor_node)
supervisor.add_node("researcher", researcher_subgraph)
supervisor.add_node("writer", writer_subgraph)
supervisor.add_conditional_edges("supervisor", router)The parallel subgraph execution via the Send API runs multiple agents simultaneously:
from langgraph.types import Send
def spawn_parallel_agents(state):
return [
Send("researcher", {"query": q})
for q in state["queries"]
]Each sub-agent is its own graph with its own state. The parent graph manages routing, aggregation, and final output. This maps directly to GPU batching at the inference layer: if your graph has parallel tool call branches, those branches can share a single GPU inference pool efficiently.
For the infrastructure side of scaling these multi-agent topologies, the guide on scaling agent fleets with MCP orchestration covers autoscaling patterns, GPU tiering, and cost modeling for fleets of 1,000+ concurrent agents.
Streaming, Interruptions, and Replay
graph.astream_events() gives you per-node streaming. You get a stream of events as each node starts, runs, and completes. This lets you show users incremental progress rather than waiting for the full graph to complete:
async for event in app.astream_events(input, version="v2"):
if event["event"] == "on_chat_model_stream":
print(event["data"]["chunk"].content, end="", flush=True)LangChain's streaming works at the chain level. LangGraph's streaming works at the graph level, with visibility into which node is running.
LangGraph Studio is the visual debugger. It shows the graph structure, lets you inspect state at any node, replay runs from any checkpoint, and branch off new execution paths from any point. For debugging complex multi-turn agent failures, this is the tool that saves hours.
Production interruption patterns: placing interrupt_before on dangerous nodes (code execution, database writes, external API calls) gives you a human approval gate without changing the agent logic. The graph pauses, the state is checkpointed, a notification is sent, and execution resumes when a human approves.
Production Observability
Langfuse is the most straightforward observability layer for LangGraph. The callback handler traces every node execution:
from langfuse.callback import CallbackHandler
langfuse_handler = CallbackHandler()
result = app.invoke(
input,
config={"callbacks": [langfuse_handler]}
)Every node execution shows up as a span with token counts, latency, and model parameters. For cost tracking, Langfuse calculates cost per trace automatically using its model pricing table.
Helicone works similarly for cost tracking across inference backends. If you're routing LangGraph nodes to different models (e.g., a cheap 8B model for routing decisions, an expensive 70B model for final output), Helicone gives you a unified cost view.
For a full observability setup covering OpenTelemetry instrumentation, DCGM metric correlation, and compliance requirements, the LLM observability guide covering Langfuse, Arize Phoenix, and Helicone covers the complete stack.
Migration Guide: LangChain AgentExecutor to LangGraph
The migration is more structural than a simple API swap. You're not changing the tools, the prompt, or the model. You're changing the execution loop.
Before (LangChain AgentExecutor):
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_openai import ChatOpenAI
from langchain.tools import tool
@tool
def search_web(query: str) -> str:
"""Search the web for information."""
return web_search(query)
llm = ChatOpenAI(model="gpt-4o")
tools = [search_web]
agent = create_openai_functions_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools)
result = executor.invoke({"input": "Research the latest LLM benchmarks"})After (LangGraph StateGraph):
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langgraph.checkpoint.memory import MemorySaver
from typing import TypedDict, Annotated
from langchain_core.messages import BaseMessage, HumanMessage
from langgraph.graph.message import add_messages
class AgentState(TypedDict):
messages: Annotated[list[BaseMessage], add_messages]
llm_with_tools = llm.bind_tools(tools)
def agent_node(state: AgentState):
response = llm_with_tools.invoke(state["messages"])
return {"messages": [response]}
def should_continue(state: AgentState):
last = state["messages"][-1]
if last.tool_calls:
return "tools"
return END
graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.add_node("tools", ToolNode(tools))
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", should_continue)
graph.add_edge("tools", "agent")
app = graph.compile(checkpointer=MemorySaver())
result = app.invoke({"messages": [HumanMessage("Research the latest LLM benchmarks")]}, config={"configurable": {"thread_id": "thread-1"}})The tool list, prompt template, and LLM are unchanged. Only the execution loop changes. The key translation: AgentExecutor.invoke() becomes graph.invoke() with a {"messages": [...]} state dict.
What you gain in the migration: full state visibility, checkpointing, streaming at the node level, and the ability to add human-in-the-loop gates without restructuring.
GPU Infrastructure for Both Frameworks
Both LangChain and LangGraph are inference-backend-agnostic. They call LLMs through HTTP APIs. The GPU layer determines your production latency and cost, not the orchestration framework.
vLLM exposes an OpenAI-compatible API that either framework connects to with a single line change:
# Before: OpenAI API
llm = ChatOpenAI(model="gpt-4o")
# After: Self-hosted vLLM on Spheron
llm = ChatOpenAI(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
base_url="http://<spheron-instance-ip>:8000/v1",
api_key="your-vllm-key"
)That change works for both LangChain and LangGraph nodes. The orchestration layer never needs to know whether the model is hosted at OpenAI or on a bare-metal H100 in a Spheron data center.
For VRAM sizing, throughput estimates, and latency budgets for specific models, the GPU infrastructure requirements for AI agents guide covers the math. The short version: for a 70B agent model serving under 20 concurrent sessions at 8K context, an H100 PCIe 80GB is the practical minimum.
Reference Architecture: LangGraph + vLLM on Spheron H100
Here is a concrete production setup for a hierarchical multi-agent LangGraph workflow:
LangGraph Supervisor Graph
|
├── Router Node (8B model, fast, routing decisions)
| calls vLLM: Qwen3-8B on H100 PCIe
|
├── Researcher Sub-graph (17B active model, deep analysis)
| calls vLLM: Llama-4-Scout-17B-16E-Instruct on H100 PCIe
|
├── Code Writer Sub-graph (7B model, code generation)
| calls vLLM: Qwen2.5-Coder-7B-Instruct on H100 PCIe
|
└── Synthesizer Node (32B model, final output)
calls vLLM: DeepSeek-R1-Distill-Qwen-32B on H100 PCIe
PostgreSQL checkpointer (RDS or self-hosted)
Langfuse callback handler (traces all nodes)
Redis for session affinity across vLLM instancesStart the vLLM backend on Spheron's H100 instances:
# H100 PCIe for router + worker nodes (starts at $2.01/hr)
docker run --gpus all --ipc=host -p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-8B \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--max-model-len 32768 \
--max-num-seqs 64
# H100 PCIe for the researcher node (starts at $2.01/hr)
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--host 0.0.0.0 \
--port 8001 \
--tensor-parallel-size 1 \
--max-model-len 65536Note: Spheron's H100 instances are available as on-demand PCIe instances. For a 17B-active MoE model at the researcher tier, a single H100 PCIe 80GB handles the load with enough VRAM headroom for 65K context.
GPU cost for this reference architecture:
| GPU | Use Case | On-Demand Price | Est. Tokens/hr at 70% Util |
|---|---|---|---|
| H100 PCIe | Router / Researcher / Code Writer / Synthesizer | $2.01/hr | ~2.5M tokens (8B), ~1.4M tokens (17B active) |
Pricing fluctuates based on GPU availability. The prices above are based on 01 May 2026 and may have changed. Check current GPU pricing for live rates.
For a typical production setup with four H100 PCIe instances (router, researcher, code writer, synthesizer at $2.01/hr each), the baseline cost is roughly $8.04/hr. At 70% utilization across all nodes, this architecture handles approximately 4-5M tokens per hour across the entire graph.
When the Framework Choice Doesn't Matter
Most teams that argue about LangGraph vs LangChain are optimizing the wrong layer. The bottleneck is almost never the orchestration framework. It's TTFT (time to first token), throughput, and cost per token from the inference layer.
A well-tuned vLLM backend on bare-metal GPU serves TTFT under 200ms for 8B models at moderate concurrency. A poorly-provisioned managed API serving the same model can sit at 1-3 seconds under load. That latency gap swamps any efficiency difference between LangGraph and LangChain.
The framework choice matters for:
- Control flow complexity: LangGraph for anything beyond linear.
- State management: LangGraph for checkpointed, resumable state.
- Multi-agent routing: LangGraph, full stop.
- Team onboarding speed: LangChain wins if your agents are simple.
- Debugging capability: LangGraph Studio is significantly better.
The framework choice does not matter for:
- Throughput and latency: entirely determined by inference backend.
- Cost per token: entirely determined by GPU type and utilization.
- Model quality: entirely determined by model selection and prompting.
If your agents run slowly or cost too much, the fix is usually a better vLLM configuration or a more efficient GPU provisioning strategy, not switching orchestration frameworks.
Both LangGraph and LangChain run faster and cheaper when the inference layer is bare metal. Spheron's H100 instances start at $2.01/hr for PCIe with per-second billing and no seat licenses.
Rent H100 on Spheron → | View all GPU pricing → | Get started →
