← Back to case studies
SaaS8 weeksSep 2024 - Nov 2024

LLM Cost Optimization for SaaS Platform

Cut OpenAI API spend by 65% on a high-volume SaaS without a single-point drop in CSAT - through semantic caching, prompt compression, and intent-aware model routing.

Cost OptimizationSemantic CachingPrompt CompressionModel Routing
Have a similar problem? Book a scoping call →
ClientHigh-traffic B2B SaaS
IndustrySaaS
Engagement8 weeks
TimelineSep 2024 - Nov 2024
LLM Cost Optimization for SaaS Platform

A production LLM cost program that combined semantic caching, prompt compression, and cross-model routing to drop monthly API spend from six figures to mid-five figures while holding quality flat.

01

The Challenge

This high-traffic B2B SaaS platform experienced severe margin compression as its Large Language Model (LLM) API transaction costs scaled linearly with active user growth. The escalating API spend rose to represent approximately 18% of total platform revenue, prompting critical financial pressure to cut operational overhead. However, the engineering requirements dictated that cost cuts could not trigger any measurable regressions in output quality or customer satisfaction (CSAT) scores. The engineering team faced major obstacles, including a complete lack of observability into which API endpoints or user cohorts drove costs. Furthermore, every conversational chain defaulted to expensive frontier models like GPT-4, even for simple, low-complexity classification tasks. Over the preceding twelve months, prompt templates had doubled in token length due to un-optimized instructions, systematically inflating token usage without generating corresponding business value.

Pain points we set out to solve

  • ×LLM spend was around 18% of revenue and climbing
  • ×No observability into which endpoints drove cost
  • ×Every chain defaulted to GPT-4, including tasks a smaller model could handle
  • ×Prompts had grown 2x longer over 12 months with no pruning
02

Objectives

  • 01Cut monthly LLM spend by at least 50%
  • 02Hold CSAT within plus or minus 1 point of baseline
  • 03Ship cost dashboards so the team can self-serve from here on
  • 04Leave a routing layer the team can extend to new models
03

Approach

How we delivered — phased, with clear checkpoints and evidence at each step.

  1. Week 1-2

    Cost forensics

    Instrumented every LLM call with endpoint, user tier, and token counts. Loaded 30 days of logs into a warehouse and built a Looker dashboard that exposed where the spend actually went.

  2. Week 3-4

    Semantic cache layer

    Built a pgvector-backed semantic cache in front of the top three highest-volume endpoints. Added embedding-similarity thresholds and cache-invalidation on source-doc change events.

  3. Week 5-6

    Prompt compression and routing

    Audited and rewrote the 12 longest prompts, cutting average length by 38%. Added an intent-classifier router that sends simple tasks to GPT-4o-mini and reserves GPT-4o for complex reasoning.

  4. Week 7-8

    Quality guardrails and rollout

    Ran a CSAT-safe eval harness (LLM-as-judge plus human spot-check) on every cache hit and routed response. Rolled to 100% behind kill switches for every optimization.

04

The Solution

The engineered solution is an intelligent, three-layer cost optimization stack that intercepts all LLM requests to optimize token consumption and model routing before hitting external APIs. First, a semantic cache layer backed by a pgvector Postgres database evaluates input embedding similarity, successfully resolving 31% of repeating queries with sub-50ms latency. Second, a systematic prompt audit and compression pipeline cut prompt lengths by 38% across the twelve highest-volume templates through dynamic few-shot pruning. Third, a lightweight classifier router dynamically evaluates request complexity, sending 62% of simple tasks to cheaper models like GPT-4o-mini and Claude Haiku, while reserving expensive frontier models only for multi-layered reasoning. The entire stack features dynamic feature-flag controls, custom monitoring dashboards in Looker, and an automated evaluation harness to verify that output quality maintains complete parity with baseline metrics.

pgvector semantic cache

Caches responses keyed by embedding similarity, not exact string match. Covers about 31% of total query volume with sub-50ms hit latency.

Prompt audit and compression

Cut average prompt length by 38% on the top 12 chains through deduplication, lazy-loaded context, and few-shot pruning.

Intent-aware model router

Lightweight classifier routes requests to GPT-4o-mini, GPT-4o, or Claude 3.5 based on complexity and latency budget.

Cost observability

A Looker dashboard broken down by endpoint, customer tier, and model - with alerts on unexpected spikes.

05

Technology stack

Picked for latency, cost, and long-term maintainability — not for novelty.

AI / Models

OpenAI GPT-4oGPT-4o-miniClaude 3.5 Sonnettext-embedding-3-small

Caching

pgvectorRedis

Routing

Custom intent classifierLangChain

Observability

LookerDatadogLangSmith
06

Results

65%Reduction in monthly LLM spend
31%Semantic cache hit rate
+0.2CSAT delta (within noise)
220msImprovement in P50 response time

Business impact

The savings repaid the engagement in 5 weeks and freed budget to roll out LLM features to the free tier. Cost observability is now a standing part of the platform team weekly review.

07

Key takeaways

  • Most LLM cost problems are prompt-length and model-selection problems in disguise
  • Semantic caching only pays off with good invalidation - pair it with source-of-truth change events
  • Never ship cost cuts without an eval harness that catches quality regressions in production

Ready to start something similar?

A 30-minute call, no pitch deck. If it's not a fit, I'll point you to someone it is.

Book a scoping call

Or email tayyabjaved0786@gmail.com