Context engineering takes 6 days, not 6 months

I was building Saarthi, a clinical AI for India. The system prompt told the model to be India-specific. It wasn’t.

Queries about diagnostic thresholds were coming back with answers that felt right but weren’t grounded in anything we had imported. So we did what most teams do: we assumed the model didn’t have enough India-specific data and imported more. ICMR workflows, clinical guidelines, the full set of standards we could find.

That’s when it got worse.

The model started answering with more confidence. And one of those confident answers was wrong in a way that mattered. It told us that ICMR follows an older standard for IFG cutoffs for Indians — not the WHO standard. The actual ICMR document we had imported said the opposite: ICMR explicitly aligns with WHO values for IFG.

The model had read around the document and answered from what it already believed.

In a clinical context, a wrong answer about a diagnostic threshold isn’t a UX problem. It’s a patient safety problem.

The wrong instinct

Our first instinct — add more India-specific data — was an iteration response. We saw a symptom and tried to fix it by adding more of the thing we thought was missing.

That instinct is exactly how teams end up six months into a problem they could have diagnosed in two days.

The real question wasn’t “do we have enough data?” It was “what is actually in the context window when this query fires?”

When we looked — properly looked, not assumed — we found that the PDFs we had ingested weren’t being parsed correctly. The model was receiving malformed text from the documents and filling the gaps with its own parametric knowledge.

It was hallucinating confidently. The documents were decoration.

This is the failure mode that most teams never name. It doesn’t show up in evals. It doesn’t show up in your test suite. It shows up in production, on a query you didn’t write a test for, with an answer that sounds authoritative.

What the fix actually looked like

We stopped iterating and started diagnosing. Three changes, each traceable to a specific root cause:

We extracted the standards properly. Instead of feeding raw PDFs into the pipeline, we extracted structured text from the ingested documents and stored them in persistent memory. Now when the model retrieved a standard, it was reading the actual content — not a garbled render of a scanned PDF.

We separated the retrieval paths. Patient documents (images, scanned reports) went through one pipeline. Clinical standards went through another. RAG retrieval and reranking ran against the extracted documents, not the original files. This meant the model could no longer confuse its own training data with what we had actually given it.

We validated against the exact failure. The IFG cutoff query — the one that had confidently hallucinated — became our first regression test. Not the only test. The first one.

The confident wrong answer disappeared. Not because we had improved the model. Because we had fixed what was in the context window when the query fired.

The three ways context fails in production

I’ve run this same diagnostic across clinical AI, voice AI, and financial AI stacks since Saarthi. The symptom is always different. The root cause class is almost always one of three things:

Three context failure modes: Context Bloat, Grounding Gap, and Context Ordering

Context Bloat — Too much is entering the window: retrieved chunks that are plausible but not relevant, system prompt instructions that repeat across turns, conversation history that overrides current intent. The model buries the signal under noise, outputs get inconsistent, and costs compound. A voice AI company running 50,000 daily calls at 12,000 tokens per call is spending $6,000/day on API costs. 40% of that is typically recoverable through context restructuring alone.

Grounding Gap — The retrieval returns documents that are close, not correct — or in Saarthi’s case, returns documents that look complete but aren’t readable. The model fills the gap with what it knows. It answers confidently. It’s wrong. This is what happened to Air Canada’s chatbot when it told a passenger he could retroactively claim a bereavement fare — it passed every internal test and failed in a court case.

Context Ordering — Where content sits in the window changes how the model weights it. Instructions buried at 80% depth get ignored. Recent turns override system instructions. The architecture is correct but the layout is wrong — and the output is unpredictably wrong as a result.

The context decision that caused the failure was almost always made in sprint 1. Nobody traced it when things went wrong.

Why teams spend six months on a six-day problem

When something fails in production, the default is iteration.

Try a different prompt. Upgrade the model. Add more data. Each cycle takes a sprint. By month three you’ve narrowed it down. By month five you have something that mostly works. By month six you declare good enough.

The iteration trap vs the diagnostic path: why context engineering takes 6 months or 6 days

The engineers doing this aren’t bad engineers. They’re applying the only framework they have: try things until something changes.

The problem is that nobody teaches you to audit the context window first. The training you get — from courses, from documentation, from how LLM products are sold — is “iterate on prompts until it works.” Diagnosis as a method doesn’t exist in most teams’ vocabulary.

Nobody teaches you to audit the context window first. Diagnosis as a method doesn’t exist in most teams’ vocabulary.

What compresses it to six days

Cross-stack pattern recognition.

If you’ve seen a grounding gap fail in a clinical pipeline, a voice pipeline, and a financial pipeline, you recognise it within two hours of opening a context audit. You don’t spend six weeks eliminating the wrong hypotheses.

The six days aren’t six days of fixing. They’re:

Day 1–2: Audit — read the assembled context as the model sees it. Not what you intended to put in. What actually arrived.
Day 3: Diagnose — identify which failure class it falls into and trace it to its origin decision.
Day 4–5: Fix — one targeted change per root cause. Not a prompt rewrite. A structural change.
Day 6: Validate — the original failure case, and a regression suite built from it.

The audit is the step that doesn’t exist in most teams’ process. Adding it is what changes the timeline.

If your LLM has a reliability or cost problem right now

The question isn’t how long it will take to fix. The question is whether anyone has diagnosed it.

If you want to run this diagnostic yourself, I’m building a structured course — the same framework, taught as a repeatable process any AI engineering team can run independently. First cohort is free — register here.

Or if you want us to run it for you, book 20 minutes here.