How I Cut an AI Agent's LLM Cost 60%

In Q4 2024 a B2B SaaS client came to me with a problem that is becoming uncomfortably common: their OpenAI bill was scaling linearly with active users, eating roughly 18% of revenue, and the CFO was asking hard questions. Eight weeks later we had cut that bill by 60% with zero measurable drop in CSAT. This is the exact playbook - what we did, what we measured, and what we would do differently.

TL;DR - Key Takeaways

Semantic caching caught 31% of repeat queries with sub-50ms latency, eliminating those LLM calls entirely.
Prompt compression on the top 12 chains cut average input tokens by 38% with no quality regression.
Intent-aware model routing sent ~60% of requests to GPT-4o-mini while reserving GPT-4o for complex reasoning.
Total result: 60% spend reduction, +0.2 CSAT delta (within noise), 220ms P50 latency improvement.

The Starting Point

The client runs a high-traffic B2B support platform - thousands of tickets per day routed through an AI agent that responded directly when confident and escalated when not. Spend had been climbing 12% month-over-month for two quarters. Their team had already done the obvious thing: switched some endpoints from GPT-4 to GPT-4o. That bought them 30 days. The real problem was structural, not model-selection.

Three constraints made this engagement harder than a typical "just downgrade the model" exercise:

Quality bar was sacred. CSAT was at 4.6/5 and any drop was a non-starter.
Latency budget was tight. P95 had to stay under 4 seconds end-to-end.
No model migration risk. Production was on OpenAI; we had two weeks to ship the first cost cut without changing providers.

Step 1: The Cost Audit (Weeks 1-2)

Most LLM cost programs fail because teams optimize before they measure. The first thing I did was instrument every LLM call with five tags: endpoint, user tier, prompt-template version, model, and input/output token counts. Loaded 30 days of historical logs into BigQuery and built a Looker dashboard.

The dashboard surfaced three uncomfortable truths the team did not know:

Three endpoints accounted for 71% of total spend.
The top endpoint had an average prompt length of 4,800 tokens - of which roughly 2,100 tokens were duplicated context already in the system prompt.
About 35% of queries on the top endpoint were semantically near-duplicates of queries from the last 24 hours.

That third number is what made the engagement viable. If a third of your queries are repeats, semantic caching alone gets you most of the way to the goal.

Lesson: Never start optimizing without 30 days of token-level cost telemetry. You cannot improve what you cannot see.

Step 2: Semantic Caching with pgvector (Weeks 3-4)

Built a semantic cache layer in front of the top three highest-volume endpoints. The implementation: every incoming query gets embedded with text-embedding-3-small, and we look for cached responses where the cosine similarity is above a tuned threshold.

Python

async def get_cached_or_call(query: str, endpoint: str):
    embedding = await embed(query)
    hit = await pg.fetchrow("""
        SELECT response, similarity(embedding, $1) AS score
        FROM cache_entries
        WHERE endpoint = $2
          AND created_at > NOW() - INTERVAL '7 days'
        ORDER BY embedding <=> $1
        LIMIT 1
    """, embedding, endpoint)

    if hit and hit["score"] > THRESHOLD[endpoint]:
        return hit["response"], "cache_hit"

    response = await call_llm(query)
    await pg.execute(
        "INSERT INTO cache_entries (endpoint, embedding, response) VALUES ($1, $2, $3)",
        endpoint, embedding, response
    )
    return response, "cache_miss"

Two non-obvious choices made this work in production:

Per-endpoint similarity thresholds. The order-status endpoint tolerated 0.92 similarity. The refund-policy endpoint required 0.97 because the wrong cached answer there has compliance consequences.
Source-doc invalidation hooks. Whenever a knowledge-base article was updated, we evicted all cache entries whose embeddings were similar to that article. Without this, customers would receive stale policy answers.

Cache hit rate stabilized at 31% across the three endpoints. Median cache-hit latency was 47ms - effectively free compared to a 1.8-second LLM call.

Step 3: Prompt Compression (Weeks 5-6)

The audit revealed that the top 12 prompts had grown organically over 14 months - bolted-on instructions, redundant few-shot examples, lazy "include the entire knowledge base" patterns. We did three things:

Deduplicated system prompts. Removed instructions that contradicted or repeated each other. Average reduction: 600 tokens per call.
Lazy-loaded context. Instead of cramming the entire customer history into every prompt, we passed only the 3 most recent interactions and a summary. Tools could fetch more if needed.
Pruned few-shot examples. Reduced from 8 examples to 3 carefully chosen ones. Quality scores held; tokens dropped by 1,100 per call on the worst offender.

Net effect: average input tokens on the top 12 chains dropped 38%. We re-ran the full eval suite (more on that below) and saw zero measurable quality regression.

Step 4: Intent-Aware Model Routing (Weeks 6-7)

Not every query needs GPT-4o. Built a lightweight classifier - a fine-tuned distilbert model running on a CPU - that tagged each query with intent: simple_lookup, factual_qa, complex_reasoning, action_required. Routed simple_lookup and factual_qa to GPT-4o-mini, kept complex_reasoning on GPT-4o, sent action_required through a supervisor flow.

Roughly 60% of traffic ended up on GPT-4o-mini. With GPT-4o-mini at roughly 1/15th the cost of GPT-4o, this single change drove the largest dollar reduction of the engagement.

Step 5: The Eval Harness That Made It Safe

Every cost cut had to pass through an LLM-as-judge eval suite plus a human spot-check. The eval suite scored every change against 240 golden tickets across the four major intents, on three axes: factual accuracy, helpfulness, and tone. We required no axis to drop more than 0.05 from baseline before shipping.

The compression change failed the first eval pass - we had pruned a few-shot example that turned out to be load-bearing for tone consistency on premium-tier customers. We added it back, re-ran, passed.

Lesson: Ship cost cuts behind an eval harness, not behind your gut. The cost savings vanish the moment CSAT drops.

The Results

60%Reduction in monthly LLM spend

31%Semantic cache hit rate

+0.2CSAT delta (within noise)

220msP50 latency improvement

The savings repaid the engagement in 5 weeks and freed budget to roll out LLM features to the free tier. Cost observability is now a standing part of the platform team's weekly review.

The 4-Layer LLM Cost Framework

The TJ LLM Cost Framework

Measure first. Per-endpoint, per-template token telemetry for at least 30 days. No telemetry, no optimization.
Cache the obvious. Semantic cache on your top 3 endpoints. Hit rate above 20% means you have already won.
Compress the chronic. Audit your top 10 prompts. Most are 30-50% bloated.
Route the routine. Intent-classify and send simple traffic to a smaller model. Reserve flagship models for complex reasoning.

5 Common Mistakes Teams Make

Switching models without an eval harness. The cost win evaporates the day a customer notices the quality drop.
Caching without invalidation. Stale policy answers create support tickets - a cost cut that creates new costs.
Compressing prompts blindly. Few-shot examples are often load-bearing. Test before you trim.
Routing on string matching instead of intent. A regex router will misclassify long-tail queries to the cheap model and tank quality.
Treating cost as a one-time project. LLM cost drift is constant. Make it a standing weekly review or it comes back.

Frequently Asked Questions

How long does an LLM cost optimization engagement typically take?

For a system processing 100k+ requests per day, expect 6-10 weeks for a 40-60% reduction. The first 2 weeks are pure measurement; the bulk of dollar wins land in weeks 4-8.

Will semantic caching hurt response quality?

Only if you set thresholds too loose or skip invalidation. With per-endpoint thresholds and source-doc invalidation hooks, cached responses are indistinguishable from fresh ones.

What is the cheapest model in 2026 that still produces production-quality answers?

For routine support workloads: GPT-4o-mini, Claude 3.5 Haiku, and Gemini 1.5 Flash all sit in the same cost band and produce strong answers when paired with good RAG. Pick based on your latency profile and existing tooling.

How do you handle cost spikes from a sudden traffic burst?

Three layers: per-tenant rate limits at the gateway, fallback routing to a cheaper model when concurrency exceeds a threshold, and an automated alert when spend exceeds the trailing 7-day average by more than 25%.

Do these techniques work on Anthropic and Google models too?

Yes. The patterns are model-agnostic. Caching, compression, and routing apply to any LLM provider. The specific dollar wins shift based on each provider's pricing.

Conclusion

LLM cost optimization is not glamorous work, but it is high-leverage. Most production systems are leaving 40-60% on the table because they ship features without a measurement layer. The wins are real, repeatable, and protected by a good eval harness.

If your monthly LLM spend has crossed five figures and the slope is still up-and-to-the-right, it is time for an audit. Happy to look at your specific architecture in a free 30-minute call.

Burning Through Your OpenAI Budget?

Free 30-minute audit. I will look at your top spend endpoints and tell you where the easy wins are.

Book a Free Audit Call

How I Cut a Customer Support Agent's LLM Cost by 60% (Full Playbook)

TL;DR - Key Takeaways

The Starting Point

Step 1: The Cost Audit (Weeks 1-2)

Step 2: Semantic Caching with pgvector (Weeks 3-4)

Step 3: Prompt Compression (Weeks 5-6)

Step 4: Intent-Aware Model Routing (Weeks 6-7)

Step 5: The Eval Harness That Made It Safe

The Results

The 4-Layer LLM Cost Framework

The TJ LLM Cost Framework

5 Common Mistakes Teams Make

Frequently Asked Questions

How long does an LLM cost optimization engagement typically take?

Will semantic caching hurt response quality?

What is the cheapest model in 2026 that still produces production-quality answers?

How do you handle cost spikes from a sudden traffic burst?

Do these techniques work on Anthropic and Google models too?

Conclusion

Burning Through Your OpenAI Budget?

About the Author

TL;DR - Key Takeaways

The Starting Point

Step 1: The Cost Audit (Weeks 1-2)

Step 2: Semantic Caching with pgvector (Weeks 3-4)

Step 3: Prompt Compression (Weeks 5-6)

Step 4: Intent-Aware Model Routing (Weeks 6-7)

Step 5: The Eval Harness That Made It Safe

The Results

The 4-Layer LLM Cost Framework

The TJ LLM Cost Framework

5 Common Mistakes Teams Make

Frequently Asked Questions

How long does an LLM cost optimization engagement typically take?

Will semantic caching hurt response quality?

What is the cheapest model in 2026 that still produces production-quality answers?

How do you handle cost spikes from a sudden traffic burst?

Do these techniques work on Anthropic and Google models too?

Conclusion

Burning Through Your OpenAI Budget?

About the Author

Continue reading