Telling an AI not to hallucinate is like telling a person not to make mistakes

IBM 1979: a computer can never be held accountable, therefore a computer must never make a management decision

Slide from an internal IBM training presentation, 1979. The image first surfaced online in 2017 via a former IBM employee; the original document is reportedly lost (destroyed in a flood) and IBM's own archives have not been able to locate a copy. Background on Know Your Meme; Simon Willison's commentary.

The sticky note

In August 2024, a Reddit user poking through the macOS 15.1 developer beta found a folder at /System/Library/AssetsV2/com_apple_MobileAsset_UAF_FM_GenerativeModels/ containing 29 JSON files. Inside were the literal system prompts powering Apple Intelligence. Two of the strings, reported widely at the time, have aged in interesting ways:

Do not hallucinate.

Do not make up factual information.

A trillion-dollar company's documented strategy for the hardest open problem in generative AI, as of mid-2024, was a sticky note that said "be smart." The internet laughed for a week and then moved on, which is a shame, because the joke turned out to be a fairly accurate description of how the rest of the industry was already operating. Almost two years later, that same mental model is still the dominant pattern in production AI systems. Most "AI safety" stories, if you scroll them carefully enough, are a longer version of the same sticky note.

Asking the system that produces the error to suppress it

Telling an LLM not to hallucinate is the engineering equivalent of telling a person not to make mistakes. The thing being asked to suppress the error is the same thing producing it, and the request itself adds no new information to the model's situation. It is, more or less, a polite tap on the shoulder of a stochastic process.

It is worth holding this next to how every other branch of software engineering deals with unreliability. No production database ships with a config flag for corruption: false. We do not get to ask the thing that breaks to please not break. Reliability in software comes from putting a separate checking system in front of the thing that breaks, where the verifier and the generator are different artifacts with different failure surfaces. Hardware fault-tolerance, distributed consensus, type checkers, schema validators, formal methods, even basic input sanitization, all reach for the same shape. AI engineering, oddly, hasn't internalised that pattern yet, and the Apple prompt is what the gap looks like when it gets shipped to a billion devices.

Two errors no prompt can fix

Ask a current frontier LLM how many R's are in "strawberry." Watch it answer two. Ask it again, in a different chat, with a different system prompt, and you will usually see the same answer. The model is not being lazy and it is not making a content error in the usual sense. It is unable to see letters. Tokenization splits "strawberry" into something like ["str", "aw", "berry"] before the model ever processes the string, so what the model is actually counting is tokens that vaguely correlate with the letter R, not the letters themselves (tokenization writeup). Adding "be careful when counting characters" to the system prompt is an instruction to be careful with a thing the model cannot perceive in the first place.

The second example sits a layer up. Ask the same model "the car wash is 50 meters away, should I walk or drive?" and most leading models will tell you to walk. They optimise for the distance question and miss the goal, which is getting the car cleaned. IBM ran this test against 53 models earlier this year and found that only 11 got it right on a single shot, and only 5 of those got it right ten times in a row. It is a world-model failure rather than a knowledge failure: the relevant information is all available to the model, but there is no embedded representation of "things have to physically move with you" for the model to draw on. A system prompt does not write a missing world model into existence.

Both errors are produced by the policy. Both errors are, importantly, invisible to a judge built from the same model, because the judge inherits the same blind spots. The Apple prompt operates on an implicit assumption that the model can self-diagnose. The strawberry-R answer and the car wash answer, between them, are most of the argument that it cannot.

Why it structurally fails

Here is roughly what a prompt-only guardrail looks like in practice:

flowchart TD
  user([User input])
  policy["Policy LLM<br/>generates answer + the hallucination"]
  judge["Judge LLM<br/>rates the answer"]
  prompt[/'do not hallucinate' prompt/]
  out([Approved output])

  user --> policy
  policy --> judge
  prompt -.-> judge
  judge --> out

The judge and the policy share blind spots: same training corpus in most cases, the same model family in many, often the same weights with a different system prompt. When the judge sees a confabulation from the policy, what it sees is plausible text, and plausible text passes. The judge approves.

This effect is now measurable in a way that was not true even a year ago.

Study	Setup	Result
Snorkel self-critique paradox	98%-accurate task pipeline + LLM self-critique loop	Accuracy fell to 57%
No Free Labels (arxiv 2503.05061)	Models scoring their own outputs vs human raters	GPT-4o overscores by ~10%, older Claude by ~25%
Carnegie Mellon office-task benchmark	Frontier agent on routine office tasks	65–70% failure rate
Contextual bias in security review (arxiv 2603.18740)	LLM reviewer on vuln-reintroducing PRs with security framing	88.2% approved or marked unclear

What an LLM judge is actually doing, in the relevant sense, is plausibility detection against its own training distribution. Truth is a separate question, and that question is not one the architecture can answer from the inside.

The Apple prompt is what you write when you try to fix this from inside the box. The interesting follow-on question is what else gets written when an entire industry tries to fix it from inside the box.

The same trap in code review

The most concrete version of this pattern in production right now is the "AI writes the PR, AI reviews the PR" pipeline. Engineering orgs have spent the last year wiring this loop in everywhere: a coding agent opens a PR, a reviewer agent gates it, sometimes the human is in the loop and sometimes there is no human, sometimes the reviewer is a different model and sometimes it is literally the same model with a stricter system prompt. It is the Apple sticky note translated into a release engineering pipeline, except that now the sticky note has a CI status check next to it.

The numbers are starting to come in and they are bad. The Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review paper measured what happens when an attacker frames a vuln-reintroducing PR with security-positive language ("urgent fix", "security improvement"). Across multiple frontier models, adversarial framing flipped 88.2% of vuln-reintroducing PRs to approval or to an unclear recommendation, and the same PRs were rejected once the framing was debiased. The reviewer model and the policy model were often the same model. The persuasion did not require novel exploits or sophisticated prompts. It required commit messages that sounded like security work.

The other half of the failure mode is the reverse. Are LLMs Reliable Code Reviewers? measured what happens when you ask an LLM reviewer to explain its reasoning and propose fixes, which is the default in most agentic code review tools. The result was systematic overcorrection: when asked to articulate reasons, the reviewer became more likely, not less, to flag correct code as broken, and the false-negative rate against requirements rose substantially. Articulating reasons does not move the model closer to ground truth, it moves it into a regime where it confidently invents justifications for hallucinated flaws.

What both findings have in common is the structural point. In production "AI writes, AI reviews" pipelines, the judge does not just have the same blind spots as the policy, the judge in many cases is the policy with a different system prompt. The pipeline is running the Apple sticky-note pattern at industrial scale and calling it review.

sequenceDiagram
    participant U as User / PR template
    participant W as LLM (writer hat)
    participant R as LLM (reviewer hat)
    participant M as Merge queue
    U->>W: "write feature X"
    W-->>U: PR with code + security-framed commit msg
    U->>R: review PR
    Note over W,R: same weights, same blind spots,<br/>different system prompt
    R-->>M: ✅ approved (88.2% of the time<br/>under adversarial framing)
    M->>M: ship to prod

Context as a poison vector

There is a second mechanism, separate from same-model-judge bias, that the sticky note cannot address: the fact that mistakes in the context window do not just persist, they compound.

When a wrong claim lands in a model's context, whether that is a hallucinated function name, a false premise in an earlier turn, or a wrong line of reasoning the model walked itself into, the model has no built-in mechanism to distinguish that claim from a true one (Elastic on context poisoning). Everything in the context window is treated as a premise. Subsequent turns inherit those premises and build on top of them. The model that hallucinated database.exec_many() in turn 1 will, in turn 4, refactor database.exec_many() calls with confidence, because the context says "this is your code."

Chroma formalised the broader phenomenon in 2025 with what they called context rot (Context Rot: How Increasing Input Tokens Impacts LLM Performance). They tested 18 frontier models and every single one of them got worse as input length grew. A model with a 200K-token window can show meaningful degradation at 50K tokens, well before any "context limit" warning fires. The "lost in the middle" effect, originally measured by Liu et al. in 2023, shows that attention to information sitting in the middle of a long prompt falls off compared to information at the start or the end, dropping retrieval accuracy by more than 30% on multi-document QA over 20 documents. The concrete failure modes are by now well documented: GPT-4.1 starts inserting duplicate words; Gemini loses the plot and produces unrelated paragraphs drawn from its training data; multiple models exhibit topic drift the further into a long input you go.

The reason this matters for agentic AI specifically is that the cascade is structural. Every tool call adds tokens. Every turn inherits the errors of the previous turn. A 20-turn coding agent run is not just longer than a 5-turn run, it is operating on a noisier and increasingly polluted context, where small errors from the early turns are still sitting there as premises in the late turns. The system prompt, sitting at the top of the context window, gets steadily out-weighted by the accumulating debris underneath it.

flowchart TD
    P["System prompt<br/>'do not hallucinate'"]
    T1["Turn 1<br/>hallucinates db.exec_many()"]
    T2["Turn 2<br/>writes tests for db.exec_many()"]
    T3["Turn 3<br/>refactors db.exec_many()"]
    T4["Turn 4<br/>'review this code'<br/>approves db.exec_many()"]
    P -.-> T1
    P -.-> T2
    P -.-> T3
    P -.-> T4
    T1 --> T2 --> T3 --> T4
    T1 -. "bad premise persists,<br/>compounds, gets approved" .-> T4

Even a perfectly compliant model would inherit that debris, and any review pass would inherit it too. "Do not hallucinate" cannot reach back through the conversation history and remove the hallucination that already happened.

What actually works

Move the judge out of the box.

# Doesn't work:
answer = llm(prompt=prompt, system="Do not hallucinate.")

# Works:
answer   = llm(prompt=prompt)
claims   = extract_claims(answer)
evidence = [retrieve_from_grounded_source(c) for c in claims]
verified = [c for c, e in zip(claims, evidence) if deterministic_match(c, e)]
if len(verified) < len(claims):
    return handoff_to_human(answer, claims, verified)

A useful verifier has three properties: it is a different system from the generator, ideally a different paradigm entirely (retrieval, schema validation, type checkers, SQL EXPLAIN, parsers, formal methods, unit tests against known-good data); it is deterministic wherever possible, so JSON.parse and a type checker and a SQL planner are all closer to "perfect guardrails" than another LLM grading vibes; and it lives outside the generation loop, so the model never sees it as instruction and has nothing to bypass through prompt engineering. The same logic applies to the code review case: the only escape from the "same-model judge" problem is to make the judge structurally different, which usually means tests, static analysis, formal verification, or a human, in some combination.

This is also the structural point Yann LeCun has been making for years about agentic AI needing a world model, namely that you need a separate system capable of predicting consequences before you let the LLM act, and it is what Karpathy's "decade of agents" framing is gesturing at when it points to the gap between current capability and the agent demos. Whatever the verification substrate eventually looks like, a longer system prompt is not going to be the thing that gets us there.

Why this matters more in agentic AI

A hallucination in single-turn chat is wrong text on a screen. A user reads it, frowns, and asks again. A hallucination in an agentic loop is a wrong action: a deleted row, a sent email, a kubectl apply to prod, a transferred file, a closed PR. The errors compound through the loop, the context poisons over time, and the "do not hallucinate" sticky note now governs a system that will go and execute the lie. Simon Willison has been warning about exactly this kind of failure mode for two years under the name lethal trifecta, where untrusted input, private data, and tool use end up sharing a context window, and every viable mitigation he documents has the same shape as the verifier pattern above: externalise the judge, narrow the action surface, and make the bad state physically unreachable rather than nominally forbidden.

Deterministic systems are not going anywhere

There is a clarification worth making here because the discourse keeps muddling it. A trained neural network is in principle deterministic at inference time: identical weights, temperature zero, fixed seed, identical input, you get identical output. Production LLMs almost never run that way. They run with sampling, with shifting system prompts, with context windows that are slightly different on every call, which means every answer is in practice a draw from a probability distribution over possible completions. The mean of that distribution may be correct on a good task. The tails are wrong. The model has no introspective signal that tells you which side of the distribution the current draw sits on.

The strawberry-R example sits exactly here. Anyone who has run the question through a non-tool-using model a few dozen times will recognise the pattern: the answer is "two" with very high probability and the occasional "three" is the outlier rather than a sign that the model is closing in on the right answer. Karpathy walked through why the failure mode is structural rather than statistical in his nanochat discussion on the same problem: the model has to break tokens into characters, hold a counter, and compare each character to the target, and the path from input to output simply does not run that subroutine. Adding samples sharpens the wrong answer; it does not shift it.

This is where the "just stack more LLM judges" school of thought runs into a hard ceiling. Particle physicists call a result a discovery at five sigma, roughly one chance in three and a half million of being noise, and the dream behind majority-vote ensembles of LLM verifiers is to chase that kind of reliability number by piling on more agents. The math behind that approach only works when the samples are statistically independent. LLM samples from the same model family are not independent in the way coin flips are. They share training data, tokenization schemes, attention patterns, and consequently they share blind spots. The 2026 arxiv paper Existing LLMs Are Not Self-Consistent For Simple Tasks measured this directly, showing that even DeepSeek-R1 and GPT-4 fail to give themselves the same answer twice on tasks as basic as comparing two points on a line. Stacking inconsistent models does not produce a consistent ensemble; it produces correlated noise. What stacked LLM judges do reliably is argue about taste: two reviewer models will happily disagree on whether to call a function extractClaims or parseClaims while the missing null check on line fourteen sails through untouched.

The unglamorous answer to "where do guardrails actually go" was already sitting on the shelf before any of this started. The deterministic-tooling stack is mature, decades old, has well-understood false-positive and false-negative profiles, and is auditable: type checkers, schema validators, linters, code-smell detectors, contract tests, property-based tests, fuzzers, formal-method verifiers. None of it is glamorous and none of it is being funded at the rate AI IDE wrappers are. All of it is what actually catches the bugs that ship.

The cleanest illustration of all this is the punchline OpenAI has been quietly writing for two years. The "how many R's in strawberry" failure was never closed by a better system prompt, a stack of three GPT-4 verifiers, or a different sampling temperature. It was closed by giving the model a tool. OpenAI never published a formal changelog for this, only a celebratory ChatGPT tweet in April 2026, but the mechanism is documented in OpenAI's own Code Interpreter guide: when ChatGPT now answers the strawberry question correctly, the model is being routed through a code-interpreter sandbox that runs "strawberry".count("r") in Python and reads the deterministic integer back. The reason it took so long to ship is described well in the tokenization writeup that originally explained the failure. The fix lives outside the model. It is a separate, deterministic, externally verifiable system the model is allowed to call. Coverage of the rollout notes the strawberry answer is fixed while other tokenization-driven confident-mistake patterns ("how many R's in cranberry," for example) still leak through, which is precisely what you would expect: the surface bug was patched with a tool, the underlying blind spot remains, and every new failure mode in the same family will need its own deterministic tool.

Deterministic programs are not going anywhere. The bill for an industry that wants reliability without paying for the architecture has started arriving in measurable form, and the cheapest way to start paying it down is to invest in the unglamorous tooling that already knows how to catch the bugs the model cannot see in itself. LLMs become first-class authors that submit to the verification gauntlet human-written code has always submitted to. The Apple sticky note will sound, in retrospect, like the moment the field nearly convinced itself otherwise.