A production LLM cost program that combined semantic caching, prompt compression, and cross-model routing to drop monthly API spend from six figures to mid-five figures while holding quality flat.
The Challenge
This high-traffic B2B SaaS platform experienced severe margin compression as its Large Language Model (LLM) API transaction costs scaled linearly with active user growth. The escalating API spend rose to represent approximately 18% of total platform revenue, prompting critical financial pressure to cut operational overhead. However, the engineering requirements dictated that cost cuts could not trigger any measurable regressions in output quality or customer satisfaction (CSAT) scores. The engineering team faced major obstacles, including a complete lack of observability into which API endpoints or user cohorts drove costs. Furthermore, every conversational chain defaulted to expensive frontier models like GPT-4, even for simple, low-complexity classification tasks. Over the preceding twelve months, prompt templates had doubled in token length due to un-optimized instructions, systematically inflating token usage without generating corresponding business value.
Pain points we set out to solve
- ×LLM spend was around 18% of revenue and climbing
- ×No observability into which endpoints drove cost
- ×Every chain defaulted to GPT-4, including tasks a smaller model could handle
- ×Prompts had grown 2x longer over 12 months with no pruning
Objectives
- 01Cut monthly LLM spend by at least 50%
- 02Hold CSAT within plus or minus 1 point of baseline
- 03Ship cost dashboards so the team can self-serve from here on
- 04Leave a routing layer the team can extend to new models
Approach
How we delivered — phased, with clear checkpoints and evidence at each step.
- Week 1-2
Cost forensics
Instrumented every LLM call with endpoint, user tier, and token counts. Loaded 30 days of logs into a warehouse and built a Looker dashboard that exposed where the spend actually went.
- Week 3-4
Semantic cache layer
Built a pgvector-backed semantic cache in front of the top three highest-volume endpoints. Added embedding-similarity thresholds and cache-invalidation on source-doc change events.
- Week 5-6
Prompt compression and routing
Audited and rewrote the 12 longest prompts, cutting average length by 38%. Added an intent-classifier router that sends simple tasks to GPT-4o-mini and reserves GPT-4o for complex reasoning.
- Week 7-8
Quality guardrails and rollout
Ran a CSAT-safe eval harness (LLM-as-judge plus human spot-check) on every cache hit and routed response. Rolled to 100% behind kill switches for every optimization.
The Solution
The engineered solution is an intelligent, three-layer cost optimization stack that intercepts all LLM requests to optimize token consumption and model routing before hitting external APIs. First, a semantic cache layer backed by a pgvector Postgres database evaluates input embedding similarity, successfully resolving 31% of repeating queries with sub-50ms latency. Second, a systematic prompt audit and compression pipeline cut prompt lengths by 38% across the twelve highest-volume templates through dynamic few-shot pruning. Third, a lightweight classifier router dynamically evaluates request complexity, sending 62% of simple tasks to cheaper models like GPT-4o-mini and Claude Haiku, while reserving expensive frontier models only for multi-layered reasoning. The entire stack features dynamic feature-flag controls, custom monitoring dashboards in Looker, and an automated evaluation harness to verify that output quality maintains complete parity with baseline metrics.
pgvector semantic cache
Caches responses keyed by embedding similarity, not exact string match. Covers about 31% of total query volume with sub-50ms hit latency.
Prompt audit and compression
Cut average prompt length by 38% on the top 12 chains through deduplication, lazy-loaded context, and few-shot pruning.
Intent-aware model router
Lightweight classifier routes requests to GPT-4o-mini, GPT-4o, or Claude 3.5 based on complexity and latency budget.
Cost observability
A Looker dashboard broken down by endpoint, customer tier, and model - with alerts on unexpected spikes.
Technology stack
Picked for latency, cost, and long-term maintainability — not for novelty.
AI / Models
Caching
Routing
Observability
Results
Business impact
The savings repaid the engagement in 5 weeks and freed budget to roll out LLM features to the free tier. Cost observability is now a standing part of the platform team weekly review.
Key takeaways
- Most LLM cost problems are prompt-length and model-selection problems in disguise
- Semantic caching only pays off with good invalidation - pair it with source-of-truth change events
- Never ship cost cuts without an eval harness that catches quality regressions in production