Your AI read it.
It just didn't pay attention.
Context window mismanagement is the most expensive invisible problem in production AI. You're paying for tokens the model isn't using, and the model is missing tokens it needs.
Failure type
Context design / RAG architecture
Where it shows up
Prompt assembly, retrieval pipeline
User impact
Inconsistent answers, token waste
Four ways your context window
is working against you.
01
Context stuffing
Everything goes in: full conversation history, all retrieved documents, static reference data, verbose guardrails. No prioritisation. The model reads it all but only uses a fraction. You pay for the rest.
02
Lost-in-the-middle
Critical context — the most relevant retrieved chunks, the specific data point the user asked about — ends up buried in the middle of a long prompt where model attention is weakest. The answer drifts.
03
No context budget enforcement
Token estimates are logged but not enforced. When session history grows or a document gets uploaded, context quietly doubles. No pruning. No fallback. Just higher cost and lower reliability.
04
Static prompt structure
The same prompt layout regardless of query type. A simple lookup gets the same full context assembly as a complex multi-document synthesis. No routing, no trimming, no query-aware construction.
The prompt is an architecture.
Most teams treat it as a string.
No context budget
Without a hard token limit enforced at runtime, context grows to fill available space. Past a threshold, more context reverses — more opportunity for the middle-zone attention trough to swallow relevant data.
Flat injection order
Prompt assembly treats all context as equal. There's no deliberate placement of high-priority information at primacy (beginning) or recency (end) zones where attention is highest. Critical content can end up mid-prompt by accident.
Monolithic retrieval
RAG pulls a fixed number of chunks regardless of query complexity. Simple queries get the same retrieval depth as complex ones. No query planner determines the minimal data set needed to answer the question.
Lossless history
Full conversation history is included verbatim until it hits a hard limit, then truncated bluntly. No compression. No summarisation. No priority-based pruning of older turns.
Context architecture isn't optional
at scale.
Fin's agentic pipeline had "eager requests" — LLM calls and context assembly that fired before it was confirmed they were needed. The system assembled full context for interactions where Fin was never ultimately engaged. This wasted tokens on every such interaction and added latency before any user-visible work began.
Finding
Discovered via cost-per-interaction metrics embedded directly in production traces. Engineers could see token waste in real time rather than waiting for daily warehouse aggregates. Fixing the eager requests alone saved 2 seconds off median TTFT and eliminated a category of unnecessary context assembly.
Databricks benchmarked Llama-3.1-405B and GPT-4-0125-preview on long-context tasks at production-representative lengths. The goal: establish where context length starts meaningfully degrading output quality, so engineering teams could design context budgets around real thresholds rather than model marketing specs.
Finding
Llama-3.1-405B starts degrading measurably after 32K tokens. GPT-4-0125-preview holds up until around 64K before accuracy begins falling. Both follow the U-shaped attention curve. Context budget decisions in production need to be anchored to measured thresholds, not advertised context limits.
At tens of thousands of queries per day, teams running full long-context approaches ran cost/latency analyses against RAG architectures. The comparison wasn't close: long-context approaches ran 30–60x slower and cost 8–82x more per query. Teams at this scale were forced to build context budget enforcement and selective retrieval not for performance but for economic survival.
Finding
At 10K queries/day with a 100K-token prompt at GPT-4 Turbo pricing, that's $15,000/day in input tokens alone. Teams that architected for context discipline at lower traffic didn't hit this wall. The ones that didn't had to make painful architectural changes under production pressure.
TianPan.co analysis / RAGFlow 2025 review / multiple engineering blogs
6,000 tokens.
Every one deliberate.
Saarthi's context window is constrained by design — not by cost, but by reliability. Every token placement is a deliberate engineering decision.
Zone-aware assembly
The system prompt is structured by attention priority. Safety-critical data and the doctor's question are placed at primacy. Current documents and guardrails at recency. Historical data sits in the middle — where it's less likely to cause grounding failures.
Query planner
Before any data is fetched, the query planner maps intent to the minimal D1 fetch plan. A "latest labs?" query fetches observations only. A "change the medication?" query fetches conditions, history, and guidelines. Retrieval depth matches query complexity.
<6,000 token target
Not just a cost target — a reliability constraint. Liu et al. (2023) shows the middle-zone attention trough scales with context length. At 6,000 tokens, primacy and recency zones are close enough that even middle content stays in tolerable attention range.
Zero-cost history compression
Conversation history older than the last 6 messages is compressed to the first sentence of each assistant turn. Zero LLM cost — works because G9 (Conversational Brevity) forces headline-first responses, making first-sentence extraction semantically useful.
If your context window is a mess,
we'll audit every layer.
The Cost Diagnostic Sprint includes a full context window audit: prompt structure, zone design, retrieval architecture, history compression, and token budget enforcement. One week. Specific findings.