LLM Cost Optimization for SaaS Platform

A production LLM cost program that combined semantic caching, prompt compression, and cross-model routing to drop monthly API spend from six figures to mid-five figures while holding quality flat.

The Challenge

This high-traffic B2B SaaS platform experienced severe margin compression as its Large Language Model (LLM) API transaction costs scaled linearly with active user growth. The escalating API spend rose to represent approximately 18% of total platform revenue, prompting critical financial pressure to cut operational overhead. However, the engineering requirements dictated that cost cuts could not trigger any measurable regressions in output quality or customer satisfaction (CSAT) scores. The engineering team faced major obstacles, including a complete lack of observability into which API endpoints or user cohorts drove costs. Furthermore, every conversational chain defaulted to expensive frontier models like GPT-4, even for simple, low-complexity classification tasks. Over the preceding twelve months, prompt templates had doubled in token length due to un-optimized instructions, systematically inflating token usage without generating corresponding business value.

Pain points we set out to solve

×LLM spend was around 18% of revenue and climbing
×No observability into which endpoints drove cost
×Every chain defaulted to GPT-4, including tasks a smaller model could handle
×Prompts had grown 2x longer over 12 months with no pruning

Objectives

01Cut monthly LLM spend by at least 50%
02Hold CSAT within plus or minus 1 point of baseline
03Ship cost dashboards so the team can self-serve from here on
04Leave a routing layer the team can extend to new models

Approach

How we delivered — phased, with clear checkpoints and evidence at each step.

Week 1-2
Cost forensics
Instrumented every LLM call with endpoint, user tier, and token counts. Loaded 30 days of logs into a warehouse and built a Looker dashboard that exposed where the spend actually went.
Week 3-4
Semantic cache layer
Built a pgvector-backed semantic cache in front of the top three highest-volume endpoints. Added embedding-similarity thresholds and cache-invalidation on source-doc change events.
Week 5-6
Prompt compression and routing
Audited and rewrote the 12 longest prompts, cutting average length by 38%. Added an intent-classifier router that sends simple tasks to GPT-4o-mini and reserves GPT-4o for complex reasoning.
Week 7-8
Quality guardrails and rollout
Ran a CSAT-safe eval harness (LLM-as-judge plus human spot-check) on every cache hit and routed response. Rolled to 100% behind kill switches for every optimization.

The Solution

The engineered solution is an intelligent, three-layer cost optimization stack that intercepts all LLM requests to optimize token consumption and model routing before hitting external APIs. First, a semantic cache layer backed by a pgvector Postgres database evaluates input embedding similarity, successfully resolving 31% of repeating queries with sub-50ms latency. Second, a systematic prompt audit and compression pipeline cut prompt lengths by 38% across the twelve highest-volume templates through dynamic few-shot pruning. Third, a lightweight classifier router dynamically evaluates request complexity, sending 62% of simple tasks to cheaper models like GPT-4o-mini and Claude Haiku, while reserving expensive frontier models only for multi-layered reasoning. The entire stack features dynamic feature-flag controls, custom monitoring dashboards in Looker, and an automated evaluation harness to verify that output quality maintains complete parity with baseline metrics.

pgvector semantic cache

Caches responses keyed by embedding similarity, not exact string match. Covers about 31% of total query volume with sub-50ms hit latency.

Prompt audit and compression

Cut average prompt length by 38% on the top 12 chains through deduplication, lazy-loaded context, and few-shot pruning.

Intent-aware model router

Lightweight classifier routes requests to GPT-4o-mini, GPT-4o, or Claude 3.5 based on complexity and latency budget.

Cost observability

A Looker dashboard broken down by endpoint, customer tier, and model - with alerts on unexpected spikes.

Technology stack

Picked for latency, cost, and long-term maintainability — not for novelty.

AI / Models

OpenAI GPT-4oGPT-4o-miniClaude 3.5 Sonnettext-embedding-3-small

Caching

pgvectorRedis

Routing

Custom intent classifierLangChain

Observability

LookerDatadogLangSmith

Results

65%Reduction in monthly LLM spend

31%Semantic cache hit rate

+0.2CSAT delta (within noise)

220msImprovement in P50 response time

Business impact

The savings repaid the engagement in 5 weeks and freed budget to roll out LLM features to the free tier. Cost observability is now a standing part of the platform team weekly review.

Key takeaways

Most LLM cost problems are prompt-length and model-selection problems in disguise
Semantic caching only pays off with good invalidation - pair it with source-of-truth change events
Never ship cost cuts without an eval harness that catches quality regressions in production

Visual reference

LLM Cost Optimization for SaaS Platform — reference 1