Your AI bill is higher
than it needs to be.
40–60% of LLM spend is recoverable waste. Most of it goes to context the model never uses — and most teams discover this only after they've already scaled the inefficiency.
Failure type
Token waste / cost efficiency
Where it shows up
Context design, model routing
User impact
Costs that scale faster than value
Four sources of waste.
Most stacks have all four.
01
Full documents in context
RAG retrieves whole documents when a single paragraph is relevant. The other 95% occupies tokens, adds to latency, and costs as much as the signal it buries.
02
Uncompressed conversation history
Every turn passed verbatim until the context limit cuts it off. A 20-turn session carries 12,000 tokens of history — most of it redundant to the current query.
03
No model routing
Structured tasks — extraction, classification, yes/no reasoning — run through frontier models when a smaller model produces identical output at 10% of the cost.
04
No caching layer
Repeated queries, stable reference data, common lookups — all re-run the full pipeline on every call. Prompt prefix caching and semantic similarity caching are rarely configured.
The waste is structural.
Cutting budgets doesn't fix it.
No token attribution
One consolidated bill. No breakdown by feature, team, or call type. You can't hold anyone accountable for a cost they can't see — and you can't optimise what you can't measure.
Architecture set too early
The prompt structure is decided in the first sprint and never revisited. A workflow burning 90K tokens per call today costs 10× as much when traffic grows 10×. Inefficiencies compound directly.
Missing query classification
Without an orchestration layer that classifies intent before routing, every query hits the frontier model. A regex or lightweight classifier handles 30–50% of queries at near-zero cost.
Context with no budget
Token estimates are logged but not enforced. When session history grows or a document gets uploaded, context quietly doubles. No pruning, no fallback, no ceiling.
Three teams that fixed
the architecture, not the budget.
The LMSYS team at UC Berkeley trained a matrix factorization router on Chatbot Arena human preference data. Core premise: most queries don't require a frontier model. Structured tasks, factual lookups, simple transformations — all handled equally well by a smaller, cheaper model.
Result
On MT Bench: 85% cost reduction at 95% GPT-4 quality. The router sends only 14% of calls to GPT-4. 86% of queries run on a faster, cheaper model with no meaningful quality loss. Generalises to new model pairs without retraining.
Uber had 60+ LLM use cases across 30 engineering teams. Each team had its own integration — different providers, different models, no shared routing. Teams defaulted to frontier models for everything because there was no infrastructure to route simpler calls to cheaper ones.
Result
Built a unified gateway serving 16 million queries per month with per-team cost attribution and usage alerts. Teams could see their spend for the first time. Switching model or provider became a config change. Cost guardrails prevented individual teams from silently driving up spend.
ByteDance needed to run multimodal LLMs at scale for content moderation and video understanding — processing billions of videos daily. The cost and latency profile of standard inference was untenable. They needed to optimise without sacrificing accuracy.
Result
50% cost reduction deploying multimodal LLMs on AWS Inferentia2 using tensor parallelism and model quantisation. Inference cost per video dropped by half while maintaining accuracy.
Every token earns its place.
The rest get cut.
Saarthi runs in clinical settings where both cost and reliability matter. The token budget is enforced structurally — not as a guideline, but as a hard constraint every layer is designed around.
Query planner (minimal fetch)
Before any data is fetched, the planner maps intent to the minimal D1 fetch plan. A "latest labs?" query fetches observations only. A "change the medication?" query fetches conditions, history, and guidelines. No over-retrieval.
<6,000 token context budget
A hard ceiling — not a guideline. When context would exceed the budget, the lowest-priority layers are dropped first. The model never gets an unconstrained context.
Zero-cost history compression
Conversation history older than 6 turns is compressed to the first sentence of each assistant turn — no LLM call required. Works because the system prompt requires headline-first responses.
Regex orchestrator (50% of queries)
Deterministic queries — specific medication lookups, appointment data, known-format requests — are handled by direct database calls. Zero LLM cost, zero LLM latency.
If your AI spend is growing faster
than your value, we'll find why.
The Cost Diagnostic Sprint audits every layer: context design, model routing, caching configuration, and token attribution. One week. Specific findings with dollar figures attached.