Teams shipping agents in 2026 keep getting stuck on the same question: LangGraph or LangChain? The framing is wrong. LangGraph is built on top of LangChain. The real question is whether your agent needs what LangGraph adds: stateful graph execution, checkpointing, time-travel debugging, and interruption support.
If you have a linear pipeline with no branching, LangChain is sufficient. If your agent needs to loop, branch, resume after failure, or wait for human approval mid-execution, you need LangGraph. Most production agents end up in the second category.
TL;DR Decision Matrix
| Scenario | Use LangChain | Use LangGraph | Use Both |
|---|---|---|---|
| Simple RAG pipeline (retrieve, generate, return) | Yes | No | Optional |
| Linear chatbot with no memory across sessions | Yes | No | Optional |
| Multi-step tool use with fixed sequence | Yes | No | Optional |
| Agent with conditional branches or retries | No | Yes | Yes |
| Human-in-the-loop approval gate | No | Yes | Yes |
| Long-running session that must resume | No | Yes | Yes |
| Multi-agent supervisor with specialized workers | No | Yes | Yes |
| Complex workflow with state replay/debugging | No | Yes | Yes |
What LangChain Actually Is (and Isn't)
LangChain is a composition toolkit. It gives you building blocks: retrievers that connect to vector stores, tool definitions, prompt templates, output parsers, and LCEL (LangChain Expression Language) for wiring them into pipelines. The ecosystem is the real moat. Hundreds of integrations with vector stores, document loaders, embedding models, and third-party APIs exist out of the box.
LCEL makes composition readable:
chain = prompt | llm | output_parser
result = chain.invoke({"question": "What is the capital of France?"})That is clean, testable, and easy to understand. For fixed-flow pipelines, it's hard to beat.
The weakness shows up with AgentExecutor. LangChain's built-in agent loop handles tool calling, but it's a black box. No native state persistence. No way to interrupt and resume. No branching or conditional routing. If you call agent_executor.invoke() and it fails halfway through a long chain of tool calls, you restart from scratch. For short, low-stakes tasks, this is fine. For production agents running 5-15 tool calls on expensive context, it's a problem.
What LangGraph Adds
LangGraph gives you a directed graph where nodes are Python functions and edges are transitions between them. The state is explicit: a typed Python dict (TypedDict) that every node reads and writes to.
from langgraph.graph import StateGraph, END
from typing import TypedDict
class AgentState(TypedDict):
messages: list
tool_calls_remaining: int
last_tool_output: str
def call_llm(state: AgentState) -> AgentState:
# call LLM, return partial state update
...
def call_tool(state: AgentState) -> AgentState:
# execute tool, update state
...
def should_continue(state: AgentState) -> str:
if state["tool_calls_remaining"] > 0:
return "tool"
return END
graph = StateGraph(AgentState)
graph.add_node("llm", call_llm)
graph.add_node("tool", call_tool)
graph.add_conditional_edges("llm", should_continue)
graph.add_edge("tool", "llm")
graph.set_entry_point("llm")
app = graph.compile()The graph is explicit, inspectable, and testable. You can visualize it, step through it in a debugger, and replay any state from any checkpoint.
Checkpointing
Checkpointing is the feature that changes what's possible in production. After every node execution, LangGraph serializes the full state. In development, you use MemorySaver. In production, you use AsyncPostgresSaver or AsyncRedisSaver.
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
async with AsyncPostgresSaver.from_conn_string(DATABASE_URL) as checkpointer:
app = graph.compile(checkpointer=checkpointer)
result = await app.ainvoke(input, config={"configurable": {"thread_id": "session-123"}})With PostgreSQL checkpointing, a spot instance interruption means resuming from the last completed node, not restarting from scratch. For a 10-node graph, that's the difference between losing 9 LLM calls and losing 0.
Human-in-the-loop
interrupt_before and interrupt_after let you pause graph execution at specific nodes, send state to a human reviewer, and resume:
app = graph.compile(
checkpointer=checkpointer,
interrupt_before=["execute_code"] # pause before code execution for review
)LangChain's AgentExecutor has no equivalent primitive.
Time-Travel Debugging
LangGraph Studio (the visual debugger) lets you replay any prior state, branch off a new execution path from any point, and compare outcomes. For debugging complex multi-turn agent failures, this saves hours.
State Management Deep Dive
LangChain handles memory through objects like ConversationBufferMemory and RedisChatMessageHistory. These work well for chatbots that need the last N messages. They fall apart for agents that need to track structured state across turns.
Compare:
LangChain memory (conversation history only):
from langchain.memory import RedisChatMessageHistory, ConversationBufferWindowMemory
history = RedisChatMessageHistory(session_id="user-123", url=REDIS_URL)
memory = ConversationBufferWindowMemory(chat_memory=history, k=10)LangGraph state (structured, typed, persisted):
class ResearchState(TypedDict):
query: str
sources_found: list[str]
drafts: list[str]
approval_status: str
token_budget_remaining: int
user_id: strLangGraph's state is explicit about everything your agent tracks. There's no implicit message buffer that you hope contains the right context. Every field is visible, reduceable (you can define how fields merge on updates), and checkpointed.
When checkpoints outperform naive memory:
- Multi-session resume: a user comes back after three days. LangGraph can restore the exact state from where they left off. LangChain's
RedisChatMessageHistorystores messages but not the full execution state. - Parallel branch evaluation: you can fork a graph at checkpoint N, run two different paths, and compare outcomes for A/B testing or debugging.
- Compliance audit trails: regulated industries need a complete record of every agent decision. LangGraph's checkpoint history provides this out of the box.
For long-term cross-session memory that goes beyond graph state (embedding-based recall of facts across sessions), LangGraph checkpoints and vector memory serve different purposes. The guide on persistent agent memory with Mem0 and Zep covers how embedding-based memory sits alongside LangGraph checkpoints as a separate retrieval layer.
Multi-Agent Orchestration: Where LangGraph Wins Decisively
LangChain's AgentExecutor has no native multi-agent primitive. You can chain two agents together, but managing state across them, routing between them conditionally, or running them in parallel requires custom code.
LangGraph handles this with the supervisor pattern:
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from typing import TypedDict, Annotated
class SupervisorState(TypedDict):
messages: Annotated[list, add_messages]
next_agent: str
def router(state: SupervisorState) -> str:
return state["next_agent"]
# Build supervisor graph
supervisor = StateGraph(SupervisorState)
supervisor.add_node("supervisor", supervisor_node)
supervisor.add_node("researcher", researcher_subgraph)
supervisor.add_node("writer", writer_subgraph)
supervisor.add_conditional_edges("supervisor", router)The parallel subgraph execution via the Send API runs multiple agents simultaneously:
from langgraph.types import Send
def spawn_parallel_agents(state):
return [
Send("researcher", {"query": q})
for q in state["queries"]
]Each sub-agent is its own graph with its own state. The parent graph manages routing, aggregation, and final output. This maps directly to GPU batching at the inference layer: if your graph has parallel tool call branches, those branches can share a single GPU inference pool efficiently.
For the infrastructure side of scaling these multi-agent topologies, the guide on scaling agent fleets with MCP orchestration covers autoscaling patterns, GPU tiering, and cost modeling for fleets of 1,000+ concurrent agents.
Streaming, Interruptions, and Replay
graph.astream_events() gives you per-node streaming. You get a stream of events as each node starts, runs, and completes. This lets you show users incremental progress rather than waiting for the full graph to complete:
async for event in app.astream_events(input, version="v2"):
if event["event"] == "on_chat_model_stream":
print(event["data"]["chunk"].content, end="", flush=True)LangChain's streaming works at the chain level. LangGraph's streaming works at the graph level, with visibility into which node is running.
LangGraph Studio is the visual debugger. It shows the graph structure, lets you inspect state at any node, replay runs from any checkpoint, and branch off new execution paths from any point. For debugging complex multi-turn agent failures, this is the tool that saves hours.
Production interruption patterns: placing interrupt_before on dangerous nodes (code execution, database writes, external API calls) gives you a human approval gate without changing the agent logic. The graph pauses, the state is checkpointed, a notification is sent, and execution resumes when a human approves.
Production Observability
Langfuse is the most straightforward observability layer for LangGraph. The callback handler traces every node execution:
from langfuse.callback import CallbackHandler
langfuse_handler = CallbackHandler()
result = app.invoke(
input,
config={"callbacks": [langfuse_handler]}
)Every node execution shows up as a span with token counts, latency, and model parameters. For cost tracking, Langfuse calculates cost per trace automatically using its model pricing table.
Helicone works similarly for cost tracking across inference backends. If you're routing LangGraph nodes to different models (e.g., a cheap 8B model for routing decisions, an expensive 70B model for final output), Helicone gives you a unified cost view.
For a full observability setup covering OpenTelemetry instrumentation, DCGM metric correlation, and compliance requirements, the LLM observability guide covering Langfuse, Arize Phoenix, and Helicone covers the complete stack.
Migration Guide: LangChain AgentExecutor to LangGraph
The migration is more structural than a simple API swap. You're not changing the tools, the prompt, or the model. You're changing the execution loop.
Before (LangChain AgentExecutor):
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_openai import ChatOpenAI
from langchain.tools import tool
@tool
def search_web(query: str) -> str:
"""Search the web for information."""
return web_search(query)
llm = ChatOpenAI(model="gpt-4o")
tools = [search_web]
agent = create_openai_functions_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools)
result = executor.invoke({"input": "Research the latest LLM benchmarks"})After (LangGraph StateGraph):
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langgraph.checkpoint.memory import MemorySaver
from typing import TypedDict, Annotated
from langchain_core.messages import BaseMessage, HumanMessage
from langgraph.graph.message import add_messages
class AgentState(TypedDict):
messages: Annotated[list[BaseMessage], add_messages]
llm_with_tools = llm.bind_tools(tools)
def agent_node(state: AgentState):
response = llm_with_tools.invoke(state["messages"])
return {"messages": [response]}
def should_continue(state: AgentState):
last = state["messages"][-1]
if last.tool_calls:
return "tools"
return END
graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.add_node("tools", ToolNode(tools))
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", should_continue)
graph.add_edge("tools", "agent")
app = graph.compile(checkpointer=MemorySaver())
result = app.invoke({"messages": [HumanMessage("Research the latest LLM benchmarks")]}, config={"configurable": {"thread_id": "thread-1"}})The tool list, prompt template, and LLM are unchanged. Only the execution loop changes. The key translation: AgentExecutor.invoke() becomes graph.invoke() with a {"messages": [...]} state dict.
What you gain in the migration: full state visibility, checkpointing, streaming at the node level, and the ability to add human-in-the-loop gates without restructuring.
GPU Infrastructure for Both Frameworks
Both LangChain and LangGraph are inference-backend-agnostic. They call LLMs through HTTP APIs. The GPU layer determines your production latency and cost, not the orchestration framework.
vLLM exposes an OpenAI-compatible API that either framework connects to with a single line change:
# Before: OpenAI API
llm = ChatOpenAI(model="gpt-4o")
# After: Self-hosted vLLM on Spheron
llm = ChatOpenAI(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
base_url="http://<spheron-instance-ip>:8000/v1",
api_key="your-vllm-key"
)That change works for both LangChain and LangGraph nodes. The orchestration layer never needs to know whether the model is hosted at OpenAI or on a bare-metal H100 in a Spheron data center.
For VRAM sizing, throughput estimates, and latency budgets for specific models, the GPU infrastructure requirements for AI agents guide covers the math. The short version: for a 70B agent model serving under 20 concurrent sessions at 8K context, an H100 PCIe 80GB is the practical minimum.
Reference Architecture: LangGraph + vLLM on Spheron H100
Here is a concrete production setup for a hierarchical multi-agent LangGraph workflow:
LangGraph Supervisor Graph
|
├── Router Node (8B model, fast, routing decisions)
| calls vLLM: Qwen3-8B on H100 PCIe
|
├── Researcher Sub-graph (17B active model, deep analysis)
| calls vLLM: Llama-4-Scout-17B-16E-Instruct on H100 PCIe
|
├── Code Writer Sub-graph (7B model, code generation)
| calls vLLM: Qwen2.5-Coder-7B-Instruct on H100 PCIe
|
└── Synthesizer Node (32B model, final output)
calls vLLM: DeepSeek-R1-Distill-Qwen-32B on H100 PCIe
PostgreSQL checkpointer (RDS or self-hosted)
Langfuse callback handler (traces all nodes)
Redis for session affinity across vLLM instancesStart the vLLM backend on Spheron's H100 instances:
# H100 PCIe for router + worker nodes (starts at $2.01/hr)
docker run --gpus all --ipc=host -p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-8B \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-key \
--max-model-len 32768 \
--max-num-seqs 64
# H100 PCIe for the researcher node (starts at $2.01/hr)
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--host 0.0.0.0 \
--port 8001 \
--tensor-parallel-size 1 \
--max-model-len 65536Note: Spheron's H100 instances are available as on-demand PCIe instances. For a 17B-active MoE model at the researcher tier, a single H100 PCIe 80GB handles the load with enough VRAM headroom for 65K context.
GPU cost for this reference architecture:
| GPU | Use Case | On-Demand Price | Est. Tokens/hr at 70% Util |
|---|---|---|---|
| H100 PCIe | Router / Researcher / Code Writer / Synthesizer | $2.01/hr | ~2.5M tokens (8B), ~1.4M tokens (17B active) |
Pricing fluctuates based on GPU availability. The prices above are based on 01 May 2026 and may have changed. Check current GPU pricing for live rates.
For a typical production setup with four H100 PCIe instances (router, researcher, code writer, synthesizer at $2.01/hr each), the baseline cost is roughly $8.04/hr. At 70% utilization across all nodes, this architecture handles approximately 4-5M tokens per hour across the entire graph.
When the Framework Choice Doesn't Matter
Most teams that argue about LangGraph vs LangChain are optimizing the wrong layer. The bottleneck is almost never the orchestration framework. It's TTFT (time to first token), throughput, and cost per token from the inference layer.
A well-tuned vLLM backend on bare-metal GPU serves TTFT under 200ms for 8B models at moderate concurrency. A poorly-provisioned managed API serving the same model can sit at 1-3 seconds under load. That latency gap swamps any efficiency difference between LangGraph and LangChain.
The framework choice matters for:
- Control flow complexity: LangGraph for anything beyond linear.
- State management: LangGraph for checkpointed, resumable state.
- Multi-agent routing: LangGraph, full stop.
- Team onboarding speed: LangChain wins if your agents are simple.
- Debugging capability: LangGraph Studio is significantly better.
The framework choice does not matter for:
- Throughput and latency: entirely determined by inference backend.
- Cost per token: entirely determined by GPU type and utilization.
- Model quality: entirely determined by model selection and prompting.
If your agents run slowly or cost too much, the fix is usually a better vLLM configuration or a more efficient GPU provisioning strategy, not switching orchestration frameworks.
If your workflow does not need stateful graph execution at all and you want a single agent that writes and runs Python instead of emitting JSON tool calls, see how to deploy SmolAgents on GPU cloud as a lighter alternative to both frameworks.
Both LangGraph and LangChain run faster and cheaper when the inference layer is bare metal. Spheron's H100 instances start at $2.01/hr for PCIe with per-second billing and no seat licenses.
H100 GPU on Spheron → | View all GPU pricing → | Get started →
Quick Setup Guide
Draw your agent's execution flow on paper. If it is a linear sequence of steps with no branches, retries, or human approval gates, LangChain chains or a simple LCEL pipeline is sufficient. If the flow has conditional branches, loops, parallel subgraphs, or checkpointed interruptions, you need LangGraph. The decision is almost always driven by control flow, not model choice.
In LangGraph, each node is a Python function that takes a State object and returns a partial State update. Your existing LangChain retrievers, tool call wrappers, and output parsers become the body of those node functions. Define a TypedDict State class with all fields your agent needs to track across turns. This is the key structural difference: state is explicit and typed in LangGraph, implicit and scattered in LangChain agent loops.
Replace the in-memory checkpointer with AsyncPostgresSaver from langgraph-checkpoint-postgres. Pass your DATABASE_URL connection string. This gives your graph automatic persistence, resume-on-failure, and time-travel debugging without any code changes to the graph itself. Run CREATE TABLE IF NOT EXISTS checkpoints ... once during schema migration.
Provision an H100 PCIe instance on Spheron. Run vLLM with your chosen model: vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct --host 0.0.0.0 --port 8000 --api-key your-key. Point your LangGraph ChatOpenAI or ChatAnthropic node to the vLLM OpenAI-compatible endpoint: base_url=http://<spheron-ip>:8000/v1. LangGraph is inference-backend-agnostic; swapping from a managed API to self-hosted vLLM requires only a base_url and api_key change.
Wrap your LangGraph graph with the Langfuse callback handler. Set LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY. Every node execution, token count, and latency is now traced per graph run. For cost tracking across models, add model_params to each ChatOpenAI call; Langfuse calculates cost per trace automatically using its model pricing table.
Frequently Asked Questions
LangChain provides a toolkit of abstractions - chains, retrievers, tools, and memory objects - for composing LLM calls. LangGraph is built on top of LangChain and adds a stateful directed-graph execution model with checkpointing, time-travel debugging, and built-in support for human-in-the-loop interruptions. LangChain is the right choice for simple pipelines with a defined linear flow. LangGraph is the right choice when your agent needs to branch, loop, retry, or persist state across sessions.
Yes - and most production teams do. LangGraph handles the orchestration layer: defining the graph, managing state, checkpointing progress. LangChain components - retrievers, tool definitions, prompt templates, output parsers - plug into LangGraph nodes as callables. You get LangChain's ecosystem of integrations with LangGraph's production-grade state machine execution.
Yes, by about a week of effort. LangGraph requires you to think in terms of graph nodes, edges, and state reducers rather than linear chains. The State object definition is the steepest part of the learning curve. Teams comfortable with finite state machines or workflow engines (Prefect, Airflow) adapt fastest. If your agent logic fits in a linear sequence with no branching, LangChain is simpler to ship and maintain.
It depends on the model behind the LangGraph nodes. For a single 70B reasoning agent (e.g. DeepSeek R1) serving under 20 concurrent sessions at 8K context, an H100 PCIe 80GB is the minimum. For a hierarchical multi-agent LangGraph workflow with a 70B orchestrator and multiple 8B worker agents, plan for at least two H100 SXM5 instances or a single B200 to keep all models VRAM-resident simultaneously.
LangGraph checkpoints serialize the entire graph state after each node execution. In production, this state is persisted to an external store - PostgreSQL, Redis, or a custom backend - so graph execution can resume after interruption, rollback to any prior state, or branch into parallel timelines for evaluation. Checkpointing adds one small write per node per run, which is negligible compared to LLM inference latency.
