LangGraph Tutorial: Production Human-in-the-Loop Pattern (2026)
Autonomous AI agents are useful right up until the day one of them does something expensive that it should not have. For any agent that touches real money, real customers, or real production data, Human-in-the-Loop (HITL) is not optional - it is survival. This tutorial walks through the exact LangGraph pattern I use in production for a customer-support agent that auto-resolves 80% of tickets but pauses for human approval on anything risky.
What you will build in this tutorial
- A LangGraph state machine with classifier, action planner, and responder nodes
- An
interrupt_beforehook that pauses the graph mid-execution when risk exceeds a threshold - A durable checkpointer (Postgres) so state survives a server restart
- A clean resume flow that picks up from the exact pause point with no context replay
Why Human-in-the-Loop Is Non-Negotiable
The first multi-agent system most engineers ship is fully autonomous. It works in the demo. Then in week two production it issues a refund to the wrong customer, or sends a sensitive email to the wrong inbox, or cancels an order it should have kept. Each of those incidents costs more than the entire engineering effort that built the agent.
HITL is the single highest-leverage safety primitive in agent design. A good HITL implementation does three things at once: it catches the bad action, it generates training data for future automation, and it gives the business confidence to grant the agent broader autonomy over time.
The Pattern in One Diagram
Three nodes on a state graph:
- Classify - figures out user intent
- Plan Action - proposes a write action and a risk score
- Respond - executes the action and writes the customer-facing message
A conditional edge after Plan Action either routes to Respond directly (low risk) or pauses the graph for human approval (high risk). When the human approves, the graph resumes from the exact checkpoint.
Step 1: Define the State
Keep your state schema small and explicit. Large states make checkpoint debugging painful and inflate persistence costs.
from typing import TypedDict, Optional, Literal
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres import PostgresSaver
class SupportState(TypedDict):
ticket_id: str
user_message: str
intent: Optional[str]
proposed_action: Optional[dict]
risk_score: Optional[float]
human_approved: Optional[bool]
human_notes: Optional[str]
final_response: Optional[str]
Step 2: Define the Nodes
Each node is a pure function: takes State, returns a State delta. No side effects outside the State return value - this is what makes checkpointing reliable.
def classify(state: SupportState) -> dict:
intent = llm_classify(state["user_message"])
return {"intent": intent}
def plan_action(state: SupportState) -> dict:
action, risk = llm_plan(state["intent"], state["user_message"])
return {"proposed_action": action, "risk_score": risk}
def respond(state: SupportState) -> dict:
if state.get("human_approved") is False:
return {"final_response": "Escalating to a human agent now."}
response = llm_respond(state["proposed_action"])
return {"final_response": response}
Step 3: The Interrupt Edge
This is the move. A conditional edge after plan_action that checks risk score and either continues or pauses.
def needs_approval(state: SupportState) -> Literal["respond", "await_human"]:
return "await_human" if state["risk_score"] > 0.7 else "respond"
workflow = StateGraph(SupportState)
workflow.add_node("classify", classify)
workflow.add_node("plan", plan_action)
workflow.add_node("respond", respond)
workflow.set_entry_point("classify")
workflow.add_edge("classify", "plan")
workflow.add_conditional_edges("plan", needs_approval, {
"respond": "respond",
"await_human": END # graph pauses here
})
workflow.add_edge("respond", END)
# Postgres-backed durable checkpointer for production
checkpointer = PostgresSaver.from_conn_string(POSTGRES_URL)
app = workflow.compile(
checkpointer=checkpointer,
interrupt_before=["respond"] # always pause before any write
)
Step 4: The Resume Flow
When a human approves, you resume by invoking the graph with None as input and the same thread_id. LangGraph replays from the checkpoint - no state reconstruction, no token waste.
# Initial run - graph pauses at high risk
config = {"configurable": {"thread_id": ticket_id}}
result = app.invoke({
"ticket_id": ticket_id,
"user_message": user_msg
}, config=config)
# Human approves via your admin UI - update state and resume
app.update_state(config, {"human_approved": True})
final = app.invoke(None, config=config)
Production Hardening Checklist
- Use PostgresSaver, not MemorySaver. MemorySaver loses state on restart. Postgres survives deploys.
- Set TTL on pending approvals. If no human acts in 24 hours, auto-escalate or auto-reject based on policy.
- Log full state at every interrupt. When something weird happens three weeks in, this trace is the only thing that will save you.
- Version the state schema. Schema migrations are easier with versioned TypedDicts than "why is there an extra field."
- Rate-limit the approval queue. One bad batch can flood human reviewers. Cap at N pending per reviewer.
- Track approval-to-action lag. If P95 lag exceeds 4 hours, your queue is a bottleneck.
Real-World Numbers from Production
I deployed this exact pattern for an e-commerce client in late 2024. Here is what the first 90 days looked like:
That 94% approval rate matters. It means the agent's risk-classifier is well-calibrated - humans usually agree with the proposed action, they just want the chance to say no on the edge cases.
The HITL Calibration Framework
The TJ HITL Calibration Framework
- Start with a low risk threshold (0.4). Send most actions for approval in week one. You will learn faster.
- Track agreement rate weekly. If humans approve more than 90% of paused actions, the threshold is too low - raise it by 0.1.
- Track regret events. Count cases where a human approves and the action turns out wrong. If above 3% in a week, lower the threshold by 0.1.
- Promote actions to auto-execute only after 4 consecutive weeks of greater-than-95% agreement at the current threshold.
5 Common HITL Mistakes
- Putting humans on every action. You will burn out reviewers and never reach autonomy. Tier by risk.
- Hiding the agent's reasoning from the reviewer. Show the proposed action, the risk score, and the LLM's rationale - not just "approve?"
- No timeout on pending approvals. Tickets sit forever, customers wait, CSAT drops.
- One reviewer for everything. Different action types need different reviewer skill sets - refunds vs. data changes vs. account merges.
- Skipping the "why approved" capture. The reviewer's notes are training data. Save them.
Frequently Asked Questions
What is the difference between interrupt_before and interrupt_after in LangGraph?
interrupt_before pauses before a node executes - useful when you want to approve the proposed action. interrupt_after pauses after a node finishes - useful when you want a human to review the agent's output before it leaves the system. Most production HITL flows use interrupt_before.
How do I scale HITL across hundreds of reviewers?
Build a queue with priority based on risk score and customer tier, route by reviewer skill (refunds vs. data ops), and track per-reviewer agreement rate to surface drift. Standard ticketing tools like Linear or a custom React dashboard work fine.
Can the LLM agent learn from human approvals?
Yes. Every approve/reject event with reviewer notes becomes a training row. The simplest pattern: build a few-shot example pool from the last 200 approvals and inject the most-similar 3-5 examples into the planner's prompt at inference time.
How do I handle a server restart while a graph is paused?
Use PostgresSaver instead of MemorySaver. State persists in your database, and any worker can resume any thread by thread_id. Make sure your worker pool reads from the same Postgres instance.
Is HITL slow for the user?
Only on paused tickets. The 60-80% that auto-resolve are sub-5-second responses. For paused tickets, the user gets an immediate "we are looking at this" reply, and the human approval typically lands within minutes for live queues.
Conclusion
HITL is the bridge between "demo agent" and "production agent." LangGraph makes the pattern trivial to implement and durable enough to survive deploys, restarts, and weird edge cases. Start with a low risk threshold, calibrate weekly, and promote actions to autonomy only when the data justifies it.
If you want a deeper look at the framework choices behind this approach, my LangChain vs LangGraph vs CrewAI guide covers the trade-offs.
Building a Production Agent?
Free 30-minute scoping call. I will sketch your HITL architecture on a whiteboard with you.
Book a Scoping Call