Every added second
costs users.
Slow AI responses aren't an infrastructure problem to throw compute at. They're an architecture problem. The fix is almost always in the query path — not the model.
Failure type
Latency / throughput
Where it shows up
Query pipeline, context assembly
User impact
Slow responses, abandoned sessions
Four patterns that add up
to seconds, not milliseconds.
01
LLM calls on every request
Simple, deterministic queries — status checks, lookups, confirmations — routed through the full LLM pipeline. Each one adds 800–2000ms and costs real money. No orchestration layer screening them out.
02
Uncapped context windows
Prompts grow with session history, retrieved documents, and injected context until they hit model limits or latency targets. No pruning. No budget enforcement. Just longer and slower.
03
No caching layer
Identical or near-identical queries re-run the full retrieval and generation pipeline. Frequently asked questions, repeated lookups, stable reference data — all regenerated from scratch on every call.
04
Serial dependencies
Retrieval, enrichment, and generation run sequentially when they could run in parallel. An 800ms retrieval step + 1200ms generation step becomes 2000ms instead of 1200ms.
The slowdown is architectural,
not incidental.
No query classification
Without an orchestration layer that classifies intent before routing, every query hits the LLM regardless of complexity. A regex or lightweight classifier can handle 30–50% of queries at near-zero cost and latency.
Prompt bloat
Context accumulates — conversation history, retrieved chunks, guardrails, static reference data — with no layer that prunes by priority when budget is exceeded. A 10,000-token prompt takes 2–3× longer to process than a 3,000-token one.
Missing semantic cache
Prompt caching (Anthropic's prefix caching, OpenAI's automatic caching) is often misconfigured or unused. Semantic similarity caching for near-duplicate queries is almost never implemented.
Single-provider dependency
No fallback routing means latency spikes from provider incidents are absorbed directly by users. A 99.9% uptime provider still has ~9 hours of downtime per year — concentrated in the worst moments.
Three teams that found
the real bottleneck.
Fin is Intercom's AI support agent, handling 40M+ resolved conversations. The team discovered their agentic pipeline had "eager requests" — preparatory LLM calls and context assembly that fired before it was confirmed they were needed. These executed on every interaction regardless of whether Fin was actually engaged.
Result
2 seconds shaved off median time to first token from the eager requests fix alone. Overall, a 60% reduction in median TTFT — bringing it below 8 seconds. Cost-per-interaction was embedded directly into production traces to catch future regressions before daily warehouse refreshes.
Uber had 60+ LLM use cases across 30 teams — from Eats recommendations to customer support to code review. Each team had built its own integration: different providers, different models, no shared routing, no cost attribution. Teams defaulted to frontier models because there was no infrastructure to route simpler tasks to cheaper, faster ones.
Result
Built a unified gateway serving 16 million queries per month with cost guardrails and per-team attribution. Routing across OpenAI, Vertex AI, and Uber-hosted models became a config change, not an engineering project.
The LMSYS team at UC Berkeley trained a matrix factorization router on Chatbot Arena preference data. Core hypothesis: most queries don't need a frontier model. A classifier trained on human preference signals can identify which queries require GPT-4 and route the rest to smaller, faster models.
Result
On MT Bench: 85% cost reduction at 95% GPT-4 quality. The matrix factorization router sends only 14% of calls to GPT-4 when trained on augmented data — meaning 86% of queries run on a faster, cheaper model with no meaningful quality drop.
Clinical AI that doctors
can't afford to wait on.
Saarthi is used mid-consultation. A doctor asking about a patient's labs can't wait 4 seconds. The latency budget is tight, and the query path is designed around it.
Regex orchestrator (~50% of queries)
Intent classification runs before the LLM pipeline. Deterministic queries — specific medication lookups, appointment data, known-format requests — are handled by direct D1 tool calls. Zero LLM latency. Zero LLM cost.
Query planner (selective fetch)
Instead of fetching all patient data for every query, the planner identifies the minimal data set needed. A lab review query fetches observations only. A diagnosis query fetches conditions and history. Retrieval time drops proportionally.
Context budget (<6,000 tokens)
A hard token target — not a soft guideline — keeps generation time within bounds. Layer pruning drops lowest-priority context first when the budget is tight.
Provider fallback (OpenRouter + Anthropic direct)
Primary routing through OpenRouter with Anthropic direct as fallback. Provider latency spikes don't hit users directly. Model can be swapped on routing config without code changes.
If your AI is slow,
we'll find the bottleneck.
The Cost Diagnostic Sprint includes a full latency audit: query classification gaps, context bloat, caching opportunities, and provider routing. One week. Specific findings.