NewThe Context Audit — first cohort now open.Register free →

Reliability failureContext

Your AI read it.
It just didn't pay attention.

Context window mismanagement is the most expensive invisible problem in production AI. You're paying for tokens the model isn't using, and the model is missing tokens it needs.

Failure type

Context design / RAG architecture

Where it shows up

Prompt assembly, retrieval pipeline

User impact

Inconsistent answers, token waste

[01]what it looks like in production

Four ways your context window
is working against you.

Context stuffing

Everything goes in: full conversation history, all retrieved documents, static reference data, verbose guardrails. No prioritisation. The model reads it all but only uses a fraction. You pay for the rest.

Lost-in-the-middle

Critical context — the most relevant retrieved chunks, the specific data point the user asked about — ends up buried in the middle of a long prompt where model attention is weakest. The answer drifts.

No context budget enforcement

Token estimates are logged but not enforced. When session history grows or a document gets uploaded, context quietly doubles. No pruning. No fallback. Just higher cost and lower reliability.

Static prompt structure

The same prompt layout regardless of query type. A simple lookup gets the same full context assembly as a complex multi-document synthesis. No routing, no trimming, no query-aware construction.

[02]root causes

The prompt is an architecture.
Most teams treat it as a string.

No context budget

Without a hard token limit enforced at runtime, context grows to fill available space. Past a threshold, more context reverses — more opportunity for the middle-zone attention trough to swallow relevant data.

Flat injection order

Prompt assembly treats all context as equal. There's no deliberate placement of high-priority information at primacy (beginning) or recency (end) zones where attention is highest. Critical content can end up mid-prompt by accident.

Monolithic retrieval

RAG pulls a fixed number of chunks regardless of query complexity. Simple queries get the same retrieval depth as complex ones. No query planner determines the minimal data set needed to answer the question.

Lossless history

Full conversation history is included verbatim until it hits a hard limit, then truncated bluntly. No compression. No summarisation. No priority-based pruning of older turns.

[03]what production teams found

Context architecture isn't optional
at scale.

Intercom — Fin AI2024–2025

Fin's agentic pipeline had "eager requests" — LLM calls and context assembly that fired before it was confirmed they were needed. The system assembled full context for interactions where Fin was never ultimately engaged. This wasted tokens on every such interaction and added latency before any user-visible work began.

Finding

Discovered via cost-per-interaction metrics embedded directly in production traces. Engineers could see token waste in real time rather than waiting for daily warehouse aggregates. Fixing the eager requests alone saved 2 seconds off median TTFT and eliminated a category of unnecessary context assembly.

Honeycomb / Intercom engineering, March 2025 ↗

Databricks — benchmark study2024

Databricks benchmarked Llama-3.1-405B and GPT-4-0125-preview on long-context tasks at production-representative lengths. The goal: establish where context length starts meaningfully degrading output quality, so engineering teams could design context budgets around real thresholds rather than model marketing specs.

Finding

Llama-3.1-405B starts degrading measurably after 32K tokens. GPT-4-0125-preview holds up until around 64K before accuracy begins falling. Both follow the U-shaped attention curve. Context budget decisions in production need to be anchored to measured thresholds, not advertised context limits.

Databricks research, 2024 ↗

Production teams at scale2024–2025

At tens of thousands of queries per day, teams running full long-context approaches ran cost/latency analyses against RAG architectures. The comparison wasn't close: long-context approaches ran 30–60x slower and cost 8–82x more per query. Teams at this scale were forced to build context budget enforcement and selective retrieval not for performance but for economic survival.

Finding

At 10K queries/day with a 100K-token prompt at GPT-4 Turbo pricing, that's $15,000/day in input tokens alone. Teams that architected for context discipline at lower traffic didn't hit this wall. The ones that didn't had to make painful architectural changes under production pressure.

TianPan.co analysis / RAGFlow 2025 review / multiple engineering blogs

[04]case study · saarthi

6,000 tokens.
Every one deliberate.

Saarthi's context window is constrained by design — not by cost, but by reliability. Every token placement is a deliberate engineering decision.

Zone-aware assembly

The system prompt is structured by attention priority. Safety-critical data and the doctor's question are placed at primacy. Current documents and guardrails at recency. Historical data sits in the middle — where it's less likely to cause grounding failures.

Query planner

Before any data is fetched, the query planner maps intent to the minimal D1 fetch plan. A "latest labs?" query fetches observations only. A "change the medication?" query fetches conditions, history, and guidelines. Retrieval depth matches query complexity.

<6,000 token target

Not just a cost target — a reliability constraint. Liu et al. (2023) shows the middle-zone attention trough scales with context length. At 6,000 tokens, primacy and recency zones are close enough that even middle content stays in tolerable attention range.

Zero-cost history compression

Conversation history older than the last 6 messages is compressed to the first sentence of each assistant turn. Zero LLM cost — works because G9 (Conversational Brevity) forces headline-first responses, making first-sentence extraction semantically useful.

If your context window is a mess,
we'll audit every layer.

The Cost Diagnostic Sprint includes a full context window audit: prompt structure, zone design, retrieval architecture, history compression, and token budget enforcement. One week. Specific findings.

Start with a diagnostic

Your AI read it.It just didn't pay attention.

Four ways your context windowis working against you.

The prompt is an architecture.Most teams treat it as a string.

Context architecture isn't optionalat scale.

6,000 tokens.Every one deliberate.

If your context window is a mess,we'll audit every layer.

Your AI read it.
It just didn't pay attention.

Four ways your context window
is working against you.

The prompt is an architecture.
Most teams treat it as a string.

Context architecture isn't optional
at scale.

6,000 tokens.
Every one deliberate.

If your context window is a mess,
we'll audit every layer.