AI Agent Evaluation: The 2026 Stack

Your agent worked perfectly in the demo. It has been silently degrading for three weeks. The customer found it before you did. This is the eval stack that catches the regression before the screenshot lands in your inbox - the same one I implement with paying clients running production AI agents on LangGraph, CrewAI, and custom orchestrators.

TL;DR - Key Takeaways

Agent evals are not LLM evals. You need to evaluate trajectories (the path), not just outputs (the answer).
Production agents need four eval layers: unit, trajectory, outcome, and online (live traffic sampling).
LangSmith wins for LangChain/LangGraph teams. Langfuse wins for self-hosted and OSS-first teams. Braintrust wins for eval-heavy teams shipping weekly.
The biggest mistake: evaluating on prompts that look like your training set instead of real, messy production traffic.
A working eval stack catches a $40K/month regression in hours, not weeks. Without it you find out from a customer.

Why Agent Evals Are Different from LLM Evals

An LLM eval asks: did the model give the right answer to this prompt? An agent eval has to ask three more questions. Did it call the right tools? Did it call them in the right order? Did it recover when a tool failed? An LLM produces a string; an agent produces a trajectory - a sequence of decisions, tool calls, and intermediate states. A correct final answer can hide a broken trajectory that costs ten times more or fails on the next prompt.

This is why teams that bolt LLM-style evals onto agents miss regressions. Their eval reports green while their token bill silently doubles because the agent now retries every tool call twice. The fix is to evaluate the path, not just the destination.

The Four Eval Layers Every Production Agent Needs

Every agent I have shipped to production in the last 18 months runs all four of these. Skip a layer and you create a blind spot.

Layer 1 - Unit evals. Test individual components: a single tool call, a single classifier, a single extraction prompt. Fast (under 5 seconds), runs on every commit, catches prompt-template regressions.

Layer 2 - Trajectory evals. Test the full agent on a fixed dataset of 50-200 real prompts. Compare the trajectory (tools called, order, intermediate state) against a golden run. Catches routing regressions and unnecessary tool retries.

Layer 3 - Outcome evals. Test whether the final output is correct using LLM-as-judge or ground-truth labels. The most expensive layer. Run nightly, not on every commit.

Layer 4 - Online evals. Sample 1-5% of live production traffic and grade it asynchronously. The only layer that catches drift, prompt injection, and edge cases your golden dataset missed.

The TJ Agent Eval Pyramid

Build your eval stack bottom-up. Each layer assumes the one below it is in place.

1. Unit (foundation). 50-200 cheap, fast tests. Run on every PR. Block merges on regressions.

2. Trajectory. 50-200 full agent runs against golden trajectories. Run on every PR. Investigate any new tool-call sequence.

3. Outcome. LLM-as-judge or human grading on a curated 100-prompt set. Run nightly. Track accuracy delta over time.

4. Online. Async sampling of live traffic with LLM-as-judge. Run continuously. Alert on accuracy drops over a 24-hour window.

Most teams stop at layer 2 and wonder why production keeps surprising them. The top layer is where real users live.

LangSmith vs Langfuse vs Braintrust vs Custom

Tool	Best For	Trace Capture	Eval Runner	Self-Host	Pricing Pattern
LangSmith	LangChain/LangGraph teams	Native, zero-config	Built-in	Enterprise only	Per-trace, scales fast
Langfuse	OSS-first, self-hosted, multi-framework	SDK for any stack	Built-in (basic)	Free, MIT-licensed	Cloud tiers, free self-host
Braintrust	Teams shipping weekly with heavy eval needs	SDK	Best-in-class, parallelized	No	Per-eval-run, predictable
Custom (Postgres + LLM judge)	Strict data residency, unusual stacks	You build it	You build it	Yes	Eng time, not SaaS fees

What a Working Eval Stack Actually Catches

6 hrs

to catch a $40K/month regression vs 3 weeks without evals

87%

of regressions caught before merge with trajectory evals in CI

3-5%

live traffic sampling rate is enough to catch drift within 24 hours

$0.02

average cost per LLM-judge eval at GPT-4o-mini pricing

Real-World: Catching the $40K Regression

A B2B client running a customer-support agent on LangGraph started getting complaints about wrong answers in their billing flow. The team had no trajectory evals. They spent four engineering days bisecting recent prompt changes, found nothing, and were ready to roll back two weeks of work. We added a 60-prompt trajectory eval that took 90 minutes to wire up. The first run flagged that the billing-tool node was being called twice on 41% of trajectories - a routing change two weeks earlier had introduced a loop that doubled tool latency and bill, and occasionally returned stale data on the second call. We fixed the routing in 20 minutes. Total time from "we have a problem" to "we have a fix": 6 hours. Without trajectory evals it would have been three weeks.

5 Common Mistakes Teams Make

1. Evaluating on prompts that look like your training set. Real users send messy, malformed, ambiguous prompts. Your eval set must too. Pull eval prompts from production traffic, not from your imagination.

2. No golden dataset. You cannot measure regression without a baseline. Curate 50-100 prompts with verified correct trajectories before you ship v1, not after.

3. Only evaluating final outputs. The trajectory matters. A correct answer reached via 12 tool calls when it should take 3 is still a regression - it just costs 4x and fails the next time.

4. Running evals nowhere except locally. Evals belong in CI. If they do not block merges, they get skipped under deadline pressure.

5. No online sampling. Pre-deployment evals catch known failure modes. Online sampling catches the unknown ones - which is most of them in the first six months.

Frequently Asked Questions

Do I need LangSmith if I am not using LangChain?

No. Langfuse and Braintrust both support arbitrary frameworks via their SDKs. LangSmith works best when you are already on the LangChain stack because trace capture is automatic. For non-LangChain agents, Langfuse is the better default.

How often should evals run?

Unit evals on every commit. Trajectory evals on every pull request to main. Outcome evals nightly. Online sampling continuously. If your eval suite takes more than 10 minutes for the PR-level layers, parallelize it before you slow down.

How big should my eval dataset be?

Start with 50 prompts that cover your top 5-10 user intents. Grow to 200 over the first three months. Past 500 prompts you hit diminishing returns - quality of curation matters more than count.

Can I use GPT-4o as a judge to evaluate GPT-4o outputs?

Yes, with one rule: the judge prompt must be different from the agent prompt and must include a rubric. Self-judging without a rubric inflates scores by 10-15%. With a rubric, agreement with human grading runs around 85%, which is good enough for trend detection.

What does this cost to run at scale?

For a 100-prompt nightly eval suite using GPT-4o-mini as judge, expect roughly $2 per night, $60 per month. Online sampling at 3% of 100K daily requests adds about $90 per month. The all-in eval bill for a mid-size production agent typically lands under $300 per month - cheap insurance against silent regressions.

Conclusion

You cannot ship an AI agent and hope for the best. The teams that win in 2026 are not the ones with the cleverest prompts; they are the ones with the eval stack that catches a regression in hours instead of weeks. Pick the layer you are missing, wire it up this week, and you will sleep better next month.

Need help wiring an eval stack onto an existing production agent? Happy to scope it in a free 30-minute call.

Need an Eval Stack on Your Agent?

Free 30-minute scoping call. We will sketch the four layers and pick the right tool for your stack.

Book a Scoping Call

AI Agent Evaluation: The Production Stack for 2026 (LangSmith vs Langfuse vs Braintrust)

TL;DR - Key Takeaways

Why Agent Evals Are Different from LLM Evals

The Four Eval Layers Every Production Agent Needs

The TJ Agent Eval Pyramid

LangSmith vs Langfuse vs Braintrust vs Custom

What a Working Eval Stack Actually Catches

Real-World: Catching the $40K Regression

5 Common Mistakes Teams Make

Frequently Asked Questions

Do I need LangSmith if I am not using LangChain?

How often should evals run?

How big should my eval dataset be?

Can I use GPT-4o as a judge to evaluate GPT-4o outputs?

What does this cost to run at scale?

Conclusion

Need an Eval Stack on Your Agent?

About the Author

TL;DR - Key Takeaways

Why Agent Evals Are Different from LLM Evals

The Four Eval Layers Every Production Agent Needs

The TJ Agent Eval Pyramid

LangSmith vs Langfuse vs Braintrust vs Custom

What a Working Eval Stack Actually Catches

Real-World: Catching the $40K Regression

5 Common Mistakes Teams Make

Frequently Asked Questions

Do I need LangSmith if I am not using LangChain?

How often should evals run?

How big should my eval dataset be?

Can I use GPT-4o as a judge to evaluate GPT-4o outputs?

What does this cost to run at scale?

Conclusion

Need an Eval Stack on Your Agent?

About the Author

Continue reading