LLM Routing: Cut LLM Costs 60%

Your $0.03 GPT-4o call could have been a $0.0006 GPT-4o-mini call - 50x cheaper, same answer. The catch is knowing which queries can take the cheap path and which cannot. That is the entire job of an LLM router. Done right, it is the cheapest 60% you will ever ship. Done wrong, your accuracy quietly collapses and your CSAT drops before your dashboards notice.

TL;DR - Key Takeaways

Most production workloads have a long tail: 70-85% of queries can be served by a small model with no quality loss.
Three routing strategies: static (rules-based), classifier-based (a small LLM picks), cascading (try cheap first, escalate on low confidence).
The cheapest model that gets the job done wins - but only if you have evals to prove the answer is still correct.
Real-world: a SaaS team cut their $14K/month bill to $5.4K with classifier-based routing and zero accuracy regression.
The biggest mistake: routing without an eval set. You will save money and never notice you also lost 8 points of accuracy.

What an LLM Router Actually Does

An LLM router is a small piece of software that sits between your application and your model providers. It looks at the incoming query and decides which model to send it to - sometimes based on a static rule, sometimes based on a classifier, sometimes by trying a cheap model first and escalating if the answer looks weak. The point is the same: spend frontier-model money only on the queries that need a frontier model.

The reason this works is unglamorous. In every production workload I have audited, the query distribution is wildly uneven. A handful of intents (greeting, simple status check, common Q&A) make up the majority of traffic and can be answered by a $0.0006-per-1K-token model. A small fraction of queries (multi-step reasoning, code generation, ambiguous policy questions) actually need a $0.03-per-1K-token model. Without routing, you pay frontier prices for the entire tail.

Static vs Classifier-Based vs Cascading Routing

Static routing uses hand-written rules. "If the user is on the free tier, route to GPT-4o-mini. If the query mentions 'code', route to Claude." Fast, cheap, predictable. Hits a ceiling because rules cannot capture intent nuance.

Classifier-based routing uses a small LLM (or a fine-tuned classifier) to predict the right model for the query. Better accuracy, slightly more latency (one extra small LLM call), more flexible. The default for most production systems in 2026.

Cascading routing sends every query to the cheap model first, then escalates to the expensive model if the response confidence is low or a self-grader rejects the answer. Maximizes savings on easy queries, doubles latency on hard ones. Best for batch or async workloads where latency is forgiving.

The TJ LLM Routing Decision Matrix

Pick your routing strategy based on two axes: query complexity variance and cost sensitivity.

Low variance + low cost sensitivity: No router needed. Pick one model, ship.

Low variance + high cost sensitivity: Static routing. One rule based on user tier or feature flag.

High variance + low cost sensitivity: Classifier routing. Pay the small classifier-call latency to get the right model per query.

High variance + high cost sensitivity: Cascading routing. Try cheap, escalate on low confidence. Maximum savings, accept the latency tradeoff.

Most B2B SaaS workloads land in the high-variance + high-cost-sensitivity quadrant - which is why classifier or cascading is winning in 2026.

What Routing Actually Saves

60%

typical cost reduction with classifier-based routing

+12%

accuracy gain when paired with model-specific prompt tuning

70-85%

of queries that can be served by a small model with no quality loss

100-300ms

added latency for the classifier hop (offsetable by faster small models)

OpenRouter vs Portkey vs LangChain vs Custom

Tool	Routing Style	Best For	Cost	Lock-in
OpenRouter	Multi-provider gateway with optional auto-routing	Teams that want one API for many models	~5% markup on tokens	Low
Portkey	Gateway with rules + fallback + caching	Production teams needing observability + routing	SaaS tiers	Medium
LangChain RouterChain	In-process classifier routing	LangChain teams already in the stack	Free (your model calls)	High (LangChain)
Custom (small classifier + provider SDKs)	Whatever you build	Teams with eng capacity and unusual requirements	Eng time	None

Real-World: $14K to $5.4K in 5 Weeks

A B2B SaaS client running a customer-facing AI assistant on GPT-4o was burning roughly $14,000 per month in OpenAI bills. They had no router - every query, including "hi" and "what's my account balance," went to GPT-4o. We did three things over five weeks. Week one: built a 60-prompt eval set covering their top intents. Week two: added a small classifier (GPT-4o-mini, prompted with their intent taxonomy) that routed each query to one of three tiers - tier 1 (GPT-4o-mini), tier 2 (Claude Haiku), tier 3 (GPT-4o). Weeks three through five: tuned the classifier, added a cascading fallback for low-confidence cases, ran shadow traffic to validate. Final monthly bill: $5,400. Accuracy on the eval set was unchanged within statistical noise. Time to break even on the engineering investment: 11 days.

5 Common Routing Mistakes

1. Routing without evals. If you do not measure accuracy before and after, you will save money and never notice you also lost 8 points of correctness. Eval set is non-negotiable.

2. Ignoring fallback latency. Cascading routing doubles latency on the queries that escalate. Make sure your UX can absorb it (or set a max-attempts cap).

3. No cost ceiling. A bad classifier can route everything to the expensive model. Set a per-day cost ceiling and alert on it.

4. Routing on the wrong axis. Some teams route by user tier when the right axis is query complexity. The cheapest cleanup is usually intent-based routing, not user-based.

5. Treating routing as set-and-forget. Query distribution shifts. Re-evaluate the router monthly. The classifier you trained in February may be sending tier-3 work to tier-1 in May.

Frequently Asked Questions

Does OpenRouter's auto-routing work?

For general workloads, surprisingly well - it picks reasonable models per query and abstracts provider quirks. For domain-specific workloads (legal, medical, code), a custom classifier trained on your taxonomy almost always beats it. Use OpenRouter for v1, replace with a custom classifier when you have eval data.

What is the simplest router I can ship?

A single GPT-4o-mini call that returns "easy" or "hard" based on the query, plus an if-statement that picks the model. You can ship that in an afternoon and it captures most of the savings of a more sophisticated system.

How does routing combine with semantic caching?

Cache hits return before the router runs. Route only on cache miss. The two compound: caching saves on repeated queries, routing saves on the long tail of unique queries. Together they typically cut bills 70-80%.

Can I route to open-source models like Llama or Mistral?

Yes - and the savings get bigger. The hardest part is operating the inference (Together, Fireworks, Groq, or self-hosted on Modal/RunPod). For tier-1 traffic (high volume, low complexity), open-source can drop costs another 5-10x on top of the routing savings.

How do I know if my workload will benefit from routing?

Pull a sample of 200 production queries. Manually score each as "easy" (any small model can handle it) or "hard" (needs the frontier model). If more than 50% of your traffic is "easy," routing will pay back fast. Most production workloads land at 70-85% easy.

Conclusion

LLM routing is the highest-leverage cost optimization in production AI in 2026 - bigger than caching, bigger than prompt compression, bigger than every other trick combined. The teams that route are paying 60% less than the teams that do not. The teams that route plus cache plus tune prompts are paying 80% less. None of it is hard. All of it requires evals.

Want a second opinion on whether routing makes sense for your specific workload? Happy to look at your traffic mix in a free 30-minute call.

Ready to Cut Your LLM Bill?

Free 30-minute scoping call. We will look at your traffic, pick the right routing strategy, and project the savings.

Book a Scoping Call

LLM Routing: How to Cut Costs 60% Without Losing Quality (2026 Strategy)

TL;DR - Key Takeaways

What an LLM Router Actually Does

Static vs Classifier-Based vs Cascading Routing

The TJ LLM Routing Decision Matrix

What Routing Actually Saves

OpenRouter vs Portkey vs LangChain vs Custom

Real-World: $14K to $5.4K in 5 Weeks

5 Common Routing Mistakes

Frequently Asked Questions

Does OpenRouter's auto-routing work?

What is the simplest router I can ship?

How does routing combine with semantic caching?

Can I route to open-source models like Llama or Mistral?

How do I know if my workload will benefit from routing?

Conclusion

Ready to Cut Your LLM Bill?

About the Author

TL;DR - Key Takeaways

What an LLM Router Actually Does

Static vs Classifier-Based vs Cascading Routing

The TJ LLM Routing Decision Matrix

What Routing Actually Saves

OpenRouter vs Portkey vs LangChain vs Custom

Real-World: $14K to $5.4K in 5 Weeks

5 Common Routing Mistakes

Frequently Asked Questions

Does OpenRouter's auto-routing work?

What is the simplest router I can ship?

How does routing combine with semantic caching?

Can I route to open-source models like Llama or Mistral?

How do I know if my workload will benefit from routing?

Conclusion

Ready to Cut Your LLM Bill?

About the Author

Continue reading