NewThe Context Audit — first cohort now open.Register free →

Reliability failureLatency

Every added second
costs users.

Slow AI responses aren't an infrastructure problem to throw compute at. They're an architecture problem. The fix is almost always in the query path — not the model.

Failure type

Latency / throughput

Where it shows up

Query pipeline, context assembly

User impact

Slow responses, abandoned sessions

[01]what it looks like in production

Four patterns that add up
to seconds, not milliseconds.

LLM calls on every request

Simple, deterministic queries — status checks, lookups, confirmations — routed through the full LLM pipeline. Each one adds 800–2000ms and costs real money. No orchestration layer screening them out.

Uncapped context windows

Prompts grow with session history, retrieved documents, and injected context until they hit model limits or latency targets. No pruning. No budget enforcement. Just longer and slower.

No caching layer

Identical or near-identical queries re-run the full retrieval and generation pipeline. Frequently asked questions, repeated lookups, stable reference data — all regenerated from scratch on every call.

Serial dependencies

Retrieval, enrichment, and generation run sequentially when they could run in parallel. An 800ms retrieval step + 1200ms generation step becomes 2000ms instead of 1200ms.

[02]root causes

The slowdown is architectural,
not incidental.

No query classification

Without an orchestration layer that classifies intent before routing, every query hits the LLM regardless of complexity. A regex or lightweight classifier can handle 30–50% of queries at near-zero cost and latency.

Prompt bloat

Context accumulates — conversation history, retrieved chunks, guardrails, static reference data — with no layer that prunes by priority when budget is exceeded. A 10,000-token prompt takes 2–3× longer to process than a 3,000-token one.

Missing semantic cache

Prompt caching (Anthropic's prefix caching, OpenAI's automatic caching) is often misconfigured or unused. Semantic similarity caching for near-duplicate queries is almost never implemented.

Single-provider dependency

No fallback routing means latency spikes from provider incidents are absorbed directly by users. A 99.9% uptime provider still has ~9 hours of downtime per year — concentrated in the worst moments.

[03]what the fix looks like in production

Three teams that found
the real bottleneck.

Intercom — Fin AI2024–2025

Fin is Intercom's AI support agent, handling 40M+ resolved conversations. The team discovered their agentic pipeline had "eager requests" — preparatory LLM calls and context assembly that fired before it was confirmed they were needed. These executed on every interaction regardless of whether Fin was actually engaged.

Result

2 seconds shaved off median time to first token from the eager requests fix alone. Overall, a 60% reduction in median TTFT — bringing it below 8 seconds. Cost-per-interaction was embedded directly into production traces to catch future regressions before daily warehouse refreshes.

Honeycomb / Intercom engineering, March 2025 ↗

Uber — GenAI Gateway2024

Uber had 60+ LLM use cases across 30 teams — from Eats recommendations to customer support to code review. Each team had built its own integration: different providers, different models, no shared routing, no cost attribution. Teams defaulted to frontier models because there was no infrastructure to route simpler tasks to cheaper, faster ones.

Result

Built a unified gateway serving 16 million queries per month with cost guardrails and per-team attribution. Routing across OpenAI, Vertex AI, and Uber-hosted models became a config change, not an engineering project.

Uber Engineering Blog, July 2024 ↗

LMSYS — RouteLLM2024

The LMSYS team at UC Berkeley trained a matrix factorization router on Chatbot Arena preference data. Core hypothesis: most queries don't need a frontier model. A classifier trained on human preference signals can identify which queries require GPT-4 and route the rest to smaller, faster models.

Result

On MT Bench: 85% cost reduction at 95% GPT-4 quality. The matrix factorization router sends only 14% of calls to GPT-4 when trained on augmented data — meaning 86% of queries run on a faster, cheaper model with no meaningful quality drop.

LMSYS Blog / ICLR 2025, July 2024 ↗

[04]case study · saarthi

Clinical AI that doctors
can't afford to wait on.

Saarthi is used mid-consultation. A doctor asking about a patient's labs can't wait 4 seconds. The latency budget is tight, and the query path is designed around it.

Regex orchestrator (~50% of queries)

Intent classification runs before the LLM pipeline. Deterministic queries — specific medication lookups, appointment data, known-format requests — are handled by direct D1 tool calls. Zero LLM latency. Zero LLM cost.

Query planner (selective fetch)

Instead of fetching all patient data for every query, the planner identifies the minimal data set needed. A lab review query fetches observations only. A diagnosis query fetches conditions and history. Retrieval time drops proportionally.

Context budget (<6,000 tokens)

A hard token target — not a soft guideline — keeps generation time within bounds. Layer pruning drops lowest-priority context first when the budget is tight.

Provider fallback (OpenRouter + Anthropic direct)

Primary routing through OpenRouter with Anthropic direct as fallback. Provider latency spikes don't hit users directly. Model can be swapped on routing config without code changes.

If your AI is slow,
we'll find the bottleneck.

The Cost Diagnostic Sprint includes a full latency audit: query classification gaps, context bloat, caching opportunities, and provider routing. One week. Specific findings.

Start with a diagnostic

Every added secondcosts users.

Four patterns that add upto seconds, not milliseconds.

The slowdown is architectural,not incidental.

Three teams that foundthe real bottleneck.

Clinical AI that doctorscan't afford to wait on.

If your AI is slow,we'll find the bottleneck.

Every added second
costs users.

Four patterns that add up
to seconds, not milliseconds.

The slowdown is architectural,
not incidental.

Three teams that found
the real bottleneck.

Clinical AI that doctors
can't afford to wait on.

If your AI is slow,
we'll find the bottleneck.