NewThe Context Audit — first cohort now open.Register free →

Reliability failureToken Cost

Your AI bill is higher
than it needs to be.

40–60% of LLM spend is recoverable waste. Most of it goes to context the model never uses — and most teams discover this only after they've already scaled the inefficiency.

Failure type

Token waste / cost efficiency

Where it shows up

Context design, model routing

User impact

Costs that scale faster than value

[01]what it looks like in production

Four sources of waste.
Most stacks have all four.

Full documents in context

RAG retrieves whole documents when a single paragraph is relevant. The other 95% occupies tokens, adds to latency, and costs as much as the signal it buries.

Uncompressed conversation history

Every turn passed verbatim until the context limit cuts it off. A 20-turn session carries 12,000 tokens of history — most of it redundant to the current query.

No model routing

Structured tasks — extraction, classification, yes/no reasoning — run through frontier models when a smaller model produces identical output at 10% of the cost.

No caching layer

Repeated queries, stable reference data, common lookups — all re-run the full pipeline on every call. Prompt prefix caching and semantic similarity caching are rarely configured.

[02]root causes

The waste is structural.
Cutting budgets doesn't fix it.

No token attribution

One consolidated bill. No breakdown by feature, team, or call type. You can't hold anyone accountable for a cost they can't see — and you can't optimise what you can't measure.

Architecture set too early

The prompt structure is decided in the first sprint and never revisited. A workflow burning 90K tokens per call today costs 10× as much when traffic grows 10×. Inefficiencies compound directly.

Missing query classification

Without an orchestration layer that classifies intent before routing, every query hits the frontier model. A regex or lightweight classifier handles 30–50% of queries at near-zero cost.

Context with no budget

Token estimates are logged but not enforced. When session history grows or a document gets uploaded, context quietly doubles. No pruning, no fallback, no ceiling.

[03]what the fix looks like in production

Three teams that fixed
the architecture, not the budget.

LMSYS — RouteLLM2024

The LMSYS team at UC Berkeley trained a matrix factorization router on Chatbot Arena human preference data. Core premise: most queries don't require a frontier model. Structured tasks, factual lookups, simple transformations — all handled equally well by a smaller, cheaper model.

Result

On MT Bench: 85% cost reduction at 95% GPT-4 quality. The router sends only 14% of calls to GPT-4. 86% of queries run on a faster, cheaper model with no meaningful quality loss. Generalises to new model pairs without retraining.

LMSYS Blog / ICLR 2025, July 2024 ↗

Uber — GenAI Gateway2024

Uber had 60+ LLM use cases across 30 engineering teams. Each team had its own integration — different providers, different models, no shared routing. Teams defaulted to frontier models for everything because there was no infrastructure to route simpler calls to cheaper ones.

Result

Built a unified gateway serving 16 million queries per month with per-team cost attribution and usage alerts. Teams could see their spend for the first time. Switching model or provider became a config change. Cost guardrails prevented individual teams from silently driving up spend.

Uber Engineering Blog, July 2024 ↗

ByteDance — content moderation2024

ByteDance needed to run multimodal LLMs at scale for content moderation and video understanding — processing billions of videos daily. The cost and latency profile of standard inference was untenable. They needed to optimise without sacrificing accuracy.

Result

50% cost reduction deploying multimodal LLMs on AWS Inferentia2 using tensor parallelism and model quantisation. Inference cost per video dropped by half while maintaining accuracy.

AWS / ByteDance case study, 2024 ↗

[04]case study · saarthi

Every token earns its place.
The rest get cut.

Saarthi runs in clinical settings where both cost and reliability matter. The token budget is enforced structurally — not as a guideline, but as a hard constraint every layer is designed around.

Query planner (minimal fetch)

Before any data is fetched, the planner maps intent to the minimal D1 fetch plan. A "latest labs?" query fetches observations only. A "change the medication?" query fetches conditions, history, and guidelines. No over-retrieval.

<6,000 token context budget

A hard ceiling — not a guideline. When context would exceed the budget, the lowest-priority layers are dropped first. The model never gets an unconstrained context.

Zero-cost history compression

Conversation history older than 6 turns is compressed to the first sentence of each assistant turn — no LLM call required. Works because the system prompt requires headline-first responses.

Regex orchestrator (50% of queries)

Deterministic queries — specific medication lookups, appointment data, known-format requests — are handled by direct database calls. Zero LLM cost, zero LLM latency.

If your AI spend is growing faster
than your value, we'll find why.

The Cost Diagnostic Sprint audits every layer: context design, model routing, caching configuration, and token attribution. One week. Specific findings with dollar figures attached.

Start with a diagnostic

Your AI bill is higherthan it needs to be.

Four sources of waste.Most stacks have all four.

The waste is structural.Cutting budgets doesn't fix it.

Three teams that fixedthe architecture, not the budget.

Every token earns its place.The rest get cut.

If your AI spend is growing fasterthan your value, we'll find why.

Your AI bill is higher
than it needs to be.

Four sources of waste.
Most stacks have all four.

The waste is structural.
Cutting budgets doesn't fix it.

Three teams that fixed
the architecture, not the budget.

Every token earns its place.
The rest get cut.

If your AI spend is growing faster
than your value, we'll find why.