NewThe Context Audit — first cohort now open.Register free →

Reliability failureGrounding

Your AI is confident
and wrong.

Hallucination isn't a model problem. It's a design problem. When production AI gives wrong answers with full confidence, the root cause is almost always a grounding architecture that was never enforced.

Failure type

Grounding / attribution

Where it shows up

Any RAG or retrieval system

User impact

Wrong answers served with confidence

[01]what it looks like in production

Four signals your team
is probably ignoring.

Confident fabrication

The model generates plausible-sounding answers that contradict the source data. No hedging. No "I don't know." Wrong with authority.

Citation drift

RAG retrieves the right document but the response drifts from it — paraphrasing, inferring, or filling gaps with training data instead of the retrieved text.

Guardrail bypass

Behavioural instructions say "only use provided data" but aren't enforced structurally. The model complies in eval and drifts in production.

Silent contradiction

Two documents contain conflicting information. The model picks one silently — no flag, no caveat, no audit trail. The wrong one often wins.

[02]root causes

Not the model. The architecture.

Weak grounding instructions

Instructions like "answer based on the documents" are advisory. Without structural enforcement — citation requirements, explicit prohibitions on inference — the model treats them as style guidelines.

Middle-zone burial

Relevant context injected in the middle of a long prompt suffers attention loss (Liu et al. 2023). The model fills the gap with pre-training knowledge instead of retrieved facts.

No contradiction detection

When the retrieval pipeline surfaces conflicting data, there's no guardrail to flag it. The model resolves the contradiction silently during generation.

Eval–prod mismatch

Evals use curated queries on clean data. Production surfaces edge cases — unusual queries, sparse retrieval, conflicting documents — that eval never covered.

[03]what it costs in production

Three cases where grounding
failed publicly.

Air Canada2024

A passenger asked Air Canada's RAG chatbot about bereavement fares. The chatbot told him they could be applied retroactively — directly contradicting the policy on another page of the same website. Air Canada tried to disclaim liability by calling the chatbot a "separate legal entity." The tribunal rejected this and held the airline liable. Damages: ~$812 plus legal fees.

Root cause

The RAG pipeline had two documents with conflicting information. No contradiction detection. No attribution requirement. The chatbot picked the wrong one with full confidence.

British Columbia Civil Resolution Tribunal, February 2024 ↗

Google AI Overviews2024

Google launched AI Overviews in Search at I/O 2024. Within days, screenshots went viral of the feature advising users to add glue to pizza, eat at least one small rock per day, and other dangerous misinformation. Google had to manually remove examples and walk back the rollout. The features that shipped had passed internal evals.

Root cause

Generation was not structurally grounded against authoritative sources. The model synthesised answers from the open web, including satirical articles and Reddit posts, with no citation enforcement or source quality filter.

Google I/O 2024, widely reported, May 2024 ↗

AI-generated legal briefs2023–2025

Over 120 documented cases of lawyers filing court documents citing AI-generated case law that does not exist. In one case, a $31,100 sanction was imposed. In another, a federal judge ordered the offending attorney to send copies of the sanction order to every judge cited in the fabricated brief.

Root cause

LLMs generate plausible case citations by pattern-matching against training data. Without retrieval grounding against a verified legal database and citation verification, fabricated citations are indistinguishable from real ones in the output.

ABA Journal / court records, 2023–2025 ↗

[04]case study · saarthi

Clinical AI that cannot
afford to be wrong.

Saarthi is a clinical decision support AI used by doctors to review patient data. The grounding requirements are non-negotiable — a fabricated lab value or a missed drug interaction is a patient safety event.

G1 — Source Grounding

Every clinical claim must cite a specific document or data point. Not advisory — structurally enforced in the recency zone where attention is highest.

G2 — Deterministic Grounding

Lab values quoted exactly as they appear in source data. No rounding, no interpretation. "Hb: 9.2 g/dL (document: CBC_2024-11-12)" — not "low haemoglobin."

G8 — Contradiction Detection

When two documents contain conflicting data, the model flags it explicitly instead of resolving it silently. The doctor decides — the AI doesn't.

Question Bookending

The doctor's question is injected at both primacy and recency — never buried in the middle. The model's attention stays on what it's supposed to answer, not what it already knows.

If your AI is fabricating answers,
we'll find exactly where.

The Cost Diagnostic Sprint includes a full grounding audit: context design, retrieval attribution, guardrail enforcement gaps, and eval coverage. One week. Specific findings.

Start with a diagnostic

Your AI is confidentand wrong.

Four signals your teamis probably ignoring.

Not the model. The architecture.

Three cases where groundingfailed publicly.

Clinical AI that cannotafford to be wrong.

If your AI is fabricating answers,we'll find exactly where.

Your AI is confident
and wrong.

Four signals your team
is probably ignoring.

Three cases where grounding
failed publicly.

Clinical AI that cannot
afford to be wrong.

If your AI is fabricating answers,
we'll find exactly where.