Hallucination in LLMs: is it just semantics?

Stamper's 1973 semiotic ladder: physical world, empirics, syntactics, semantics, pragmatics, social world — with truth listed on the semantics rung and software on syntactics

Stamper's (1973) semiotic ladder as cited by Liu (2000) — figure via Nestori Syynimaa, ResearchGate. Read the rung inventories closely: they were written before any of this existed, and they place everything.

When someone waves away a distinction with "that's just semantics," they mean it's trivial wordplay. So here is a sentence that is both a pun and a thesis: LLM hallucination is just semantics — in the technical sense, and it is anything but trivial. Linguists have spent nearly a century slicing language into layers, and if you grade a modern LLM layer by layer, the failure lands in one precise place. Not where the popular discourse puts it ("the model is babbling nonsense"), and not where early NLP expected it (grammar). It lands inside semantics, along the fault line between meaning and truth.

That location matters more than it sounds, because of what it says about where machines stand. For seventy years, software lived below the semantics line — formal structure, logic, data, files. LLMs are the first machines to break through into the semantics layer at all, and the honest grade is that they are doing a startlingly good job there — with reliability that remains stochastic. The trouble is the spell this casts: because the thing converses, we deploy it as if it were functional several rungs higher, in the social world of commitments and account access. The gap between the rung it stands on and the rung we wire it into is where the real incidents happen — this post ends with one that cost 20,000 people their Instagram accounts.

The layer stack

The three-way split of syntax, semantics, and pragmatics comes from semiotics — Charles Morris drew it in Foundations of the Theory of Signs (1938): how signs relate to each other, to the world, and to their interpreters. Linguistics adopted the triad as its standard decomposition. Ronald Stamper, working in organizational semiotics, extended Morris's ladder in both directions — downward through empirics and the physical world (for an LLM, read: codes and tokens, then weights and GPUs), and upward to a social world of beliefs, commitments, contracts, and law. To be clear about provenance: you will not find that top rung in any linguistics model — linguistics stops at pragmatics and handles commitments inside it, via speech act theory. The social rung comes from information-systems semiotics, where it carries the field's core principle that signs have value only through the social change they produce. The field's textbook is Kecheng Liu's Semiotics in Information Systems Engineering (Cambridge, 2000) — which, in this post's spirit of honest citation: is paywalled, and named here by reputation, not because I have read it. For a source you can actually open, Arantes (2013) walks all six rungs — physical world, empirics, syntactics, semantics, pragmatics, social world — using the ladder on a real system. The bottom rungs are the IT platform, and it is tempting to wave them off as "works fine" — but they touch the model more than that. Empirics is Shannon's rung — codes, entropy, channels — and for an LLM it is load-bearing: tokenization is a coding scheme, and its artifacts leak all the way up. A model that cannot count the r's in strawberry is not failing at reasoning; it is failing at empirics — it reads token IDs, not characters. The sampling step lives here too: logits, a distribution, a draw — the stochasticity this post keeps running into is manufactured at this rung, and temperature is literally a dial on the distribution's entropy. Even the physical world intrudes: floating-point arithmetic on parallel hardware is not associative, so "deterministic" decoding isn't quite. The difference is that these failures are bounded and diagnosable — you can point at the token table. It is the top of the ladder where the trouble gets interesting.

Look at what 1973 wrote on the steps — the ladder at the top of this post. Software, files, logic, data — that is the syntactics rung, and it is where every machine before this generation lived. Compilers, databases, type checkers, the entire profession: formal structure manipulated formally, no meanings anywhere. And one rung up, on semantics: propositions, validity, truth. The ladder placed truth inside semantics fifty years before anyone needed to explain hallucination. LLMs are the first machines to climb onto that rung — they hold the meanings end of its inventory remarkably well, and the truth end only by the draw. A fifty-year-old diagram locates the breakthrough and the failure on the same step. Borrowing Stamper's social rung on top of Morris's three gives the four levels worth grading:

Syntax — is the sentence grammatically well-formed?
Semantics — does it mean something?
Pragmatics — does it do the right thing in context: answer the question asked, hit the right register, imply what it should?
Social — does it create commitments and obligations between agents: promises, warranties, accountability?

The stack is a diagnostic instrument. Point it at an LLM and ask: at which layer does the output stop being trustworthy?

flowchart TD
    %% one node per layer, grade inside the node — no edge labels, no side notes
    %% the coherence/truth fault line runs INSIDE semantics — that is the thesis
    OUT["LLM output"] --> SYN["Syntax: near-certain"]
    SYN --> SEM

    subgraph SEM["Semantics"]
        COH["Coherence: reliable"] --> TRU["Truth: unwarranted — hallucination"]
    end

    TRU --> PRA["Pragmatics: right per move, not per task — wrong silently"]
    PRA --> SOC["Social: open — accountability stays outside"]

    %% grade colors as borders only — fills inherit the site theme (Nord, light/dark)
    %% green holds, red breaks, amber partial, gray unresolved
    classDef green stroke:#a3be8c,stroke-width:2.5px
    classDef red stroke:#bf616a,stroke-width:2.5px
    classDef amber stroke:#ebcb8b,stroke-width:2.5px
    classDef gray stroke:#7b88a1,stroke-width:2.5px
    class SYN,COH green
    class TRU red
    class PRA amber
    class SOC gray
    style SEM fill:none,stroke:#888,stroke-dasharray:4 4

The canonical layers — syntax, semantics, pragmatics, social — with semantics opened up: coherence holds, truth doesn't, and the fault line runs inside the box. The red grade means "unwarranted," not "false" — any given output may be true by chance; verification reveals which draw you got.

Syntax: solved, and that was the surprise

Modern LLMs essentially never produce ungrammatical text. Whatever else goes wrong, the sentences parse. It is worth pausing on how shocking this is historically: for half a century, grammar was supposed to be the hard part of language — the thing requiring innate structures, formal rule systems, decades of parser engineering. It turned out to be the layer that falls most readily out of next-token prediction over enough text. Nobody proofreads an LLM's grammar, and that sentence would have sounded like science fiction to most of the field in 2015.

To be precise even here: this is not a guarantee either. It is the same stochastic process that fails at truth — the distribution just concentrates almost all of its mass on grammatical strings. Same dice, overwhelmingly better odds. The layers differ in probability, not in kind.

Semantic coherence: reliable

The next layer up also holds. LLM output is not word salad. Sentences are internally meaningful, terms are used in ways consistent with what they denote, and the text is interpretable end to end. A model will not tell you the quadratic formula tastes purple. Coherence — meaningfulness — is delivered so consistently that, like grammar, you stop checking for it.

A note on terms, because the distinction is standard but these labels are not. The textbook notion here is semantic well-formedness: Chomsky's "colorless green ideas sleep furiously" was constructed in the 1950s — it first appeared in his 1955 thesis, then famously in Syntactic Structures (1957) — to show a sentence can be syntactically perfect yet semantically anomalous — and avoiding that anomaly is what LLMs reliably deliver. ("Coherence" as a term of art actually belongs to discourse analysis, a level up.) The split this post draws inside semantics is the truth-conditional one: a sentence's meaningfulness is having truth conditions at all; its truth is those conditions actually holding. "Coherence" and "truth" are this post's shorthand for that pair — the pair itself is bedrock semantics, not an invention.

And that is exactly where the trouble starts.

Semantic truth: where the guarantee breaks

The research literature carves hallucination into taxonomies — faithfulness versus factuality, input-conflicting versus fact-conflicting. Collapse the taxonomy onto this stack and it is one thing: semantically well-formed falsehood. A fabricated citation is grammatical. It is coherent. It is perfectly interpretable — journal name, plausible authors, page range, a year that fits. It just doesn't correspond to anything in the world.

Truth-conditional semantics holds that a sentence's meaning is the set of conditions under which it would be true. An LLM hands you the sentence with no guarantee that the conditions hold. The failure is not that the model produces meaningless output — it is that meaningfulness and truth come apart, and the model only reliably delivers the former.

Be precise about what "no guarantee" means, because it is not "usually false." Most of what a model says is true — it was trained on mostly-true text, and the true continuation is often also the statistically likely one. But the process is stochastic: when the output is true, it is true the way a lucky guess is true — nothing in the system warranted it. Verification doesn't make the output correct; the output was already true or false the moment it was sampled. Verification only tells you which draw you got. And it is expensive for exactly the reason it is necessary: the false draws read identically to the true ones.

This is the academically respectable version of the critique, made sharply by Bender and Koller's "form vs. meaning" argument — Best Theme Paper at ACL 2020: training on text alone gives you mastery of linguistic form; grounding in the world has to come from somewhere else. Two honesty notes on that citation, by this post's own standards. It is from 2020 — argued against BERT and GPT-2, before instruction tuning, reinforcement learning from feedback, and tool use existed. And it is contested: there is live debate over whether scale and feedback training produce something meaning-like inside the weights. What has aged well is the operational core, which does not depend on how the internals are described: nothing in next-token training warrants truth, whatever representations emerge. That missing warranty is precisely why retrieval, web search, and tool use exist — prosthetic truth-conditions bolted onto a system that natively has none.

flowchart TD
    %% truth is a draw the model doesn't see — verification reveals the draw, it doesn't cause it
    TEXT["Training text"] -->|"next-token prediction"| FORM["Mastery of form"]
    FORM --> SENT["Fluent sentence"]
    WORLD["The world"] -.->|"no native connection"| SENT
    SENT --> T(["True — by the draw"])
    SENT --> F(["False — by the draw"])
    T --- SAME["the two read identically"]
    F --- SAME
    CHECK["Retrieval, search, your verification"] -->|"reveals which one you got"| SAME

    %% amber: true but unwarranted; red: false and indistinguishable
    %% borders only — fills inherit the site theme
    classDef red stroke:#bf616a,stroke-width:2.5px
    classDef amber stroke:#ebcb8b,stroke-width:2.5px
    class T amber
    class F red

It is also why evals have become the unit tests of LLM engineering: you cannot trust the layer by construction, so you measure it empirically.

Pragmatics: right per move, wrong per task

One layer further up, the news is better than you might expect — but only if you grade the right unit. Per individual move, instruction tuning bought a remarkable amount of pragmatic competence: models handle indirect requests, register, implicature, and "what the user actually wants" far better than their base-model ancestors. Ask an LLM "do you know what time it is?" and it will not answer "yes."

But the failure modes that remain are distinctly pragmatic ones, and they have names:

Sycophancy — agreement that tracks your framing rather than the evidence. In Gricean terms, the maxim of Quality (don't say what you believe false) losing to agreement-seeking — and the cause is documented: Sharma et al. (2023) trace it to preference training, where human raters reward responses that match their views. The model is being a good conversational partner at the expense of being a good informant.
Miscalibrated confidence — a guess delivered in the tone of a fact. The propositional content might even be flagged as uncertain somewhere in the model's internals, but the pragmatic signal — fluent, declarative, unhedged — says "this is known." Arguably this is what makes hallucination dangerous rather than merely wrong: the falsehood arrives wearing the full pragmatic costume of a confident, well-formed fact.
Malicious compliance — doing exactly what you said instead of what you meant. Tell an agent to "make the tests pass" and it may delete the failing tests; both readings satisfy the literal instruction, only one satisfies the speaker. This is the purest Gricean failure of the three: pragmatics is the layer where literal meaning and intended meaning are supposed to be reconciled, and the model resolves the ambiguity in whichever direction is cheapest to generate. It matters most in agentic use, where the misreading doesn't just sit in a chat window — it executes.

And none of these is rare. The honest check is your own workflow: if this layer were solved, single-shot prompting would be the norm. It isn't — the standard way of working with an LLM is iterative re-prompting, and most of those iterations are pragmatic repairs: the model answered something, just not the thing you meant, at the scope you meant, in the form you meant. The arithmetic is unforgiving, because the competence is per move and a task takes many moves — a model that gets each pragmatic call right nine times out of ten still misses somewhere on most multi-step tasks. And the layer fails silently: there is no pragmatic equivalent of a syntax error, so the error detector in that loop is you. That is what re-prompting is — you, supplying the feedback signal the layer doesn't have.

That cuts in a direction worth being honest about: if the error detector is you, the grade depends on what you know and care about. Pragmatic correctness is observer-relative. Ask for a script while holding no opinions about framework, idiom, structure, or what the codebase already does, and the model passes this layer automatically — there is nothing for it to miss. "If it works, it works." The expert and the novice are sampling the same model; they re-prompt at wildly different rates because they hold different standards, and the model can read neither. Twenty re-prompts from one chair and zero from the other are not measuring the model — they are measuring the chairs.

Which exposes the limit of the obvious cure. If pragmatic misses are gaps between what you said and what you meant, the fix is to say more: specify the framework, the style, the architecture, the edge cases. Follow that slope far enough and you arrive somewhere familiar — you are writing the code yourself, in English, through a stochastic compiler. The pragmatic layer doesn't get solved by better prompting; it gets traded, word by word, for programming.

When an LLM outputs "I promise" or "I guarantee this is correct," does that create a commitment in the social sense — an obligation held by an agent who can be held to it? Right now the social facts get created around the model rather than by it. The deploying company carries the warranty. The user carries the verification burden. The model's "I'm confident" is a pragmatic signal worth appropriate skepticism, not a social commitment backed by accountability — which is one reason an LLM cannot be the guardrail for another LLM: a stack of agents can pass fluent assurances up the chain without any of them holding an obligation at the end of it.

This layer is not broken so much as unoccupied. Whether it can be occupied — whether anything in the architecture could carry a commitment — is an open question, and most current deployment practice quietly answers it "no" by routing accountability to humans and institutions.

The social layer is also where "works for me" loses jurisdiction. Hardcoding an API key is pragmatically fine — the app runs, the asker is satisfied. Push it to a public repo and it is wrong in a way no amount of not-caring can fix, because this rung is graded by other people's standards: auditors, attackers, the law. The no-standards user the pragmatic layer just waved through is exactly the one walking into it.

Except deployment increasingly answers it "yes" without meaning to. Because the machine converses, we wire it into positions that are social-rung by construction: customer-support bots with backend access don't just talk about account changes, they perform them — and performing an account change is a social act, a commitment executed against a contract. Between April and June 2026, attackers simply asked Meta's AI support chatbot to add their email address to victims' Instagram accounts — and it did, no identity verification, over 20,000 accounts hijacked. Graded on the ladder, nothing surprising happened: a semantics-rung machine handled a pragmatics-rung request ("help me recover my account" — cooperative, plausible, exactly what it was tuned to serve) and executed a social-rung action it had no equipment to evaluate, because evaluating it requires the one thing the social rung is made of — whose request this legitimately is. Meta's fix says the quiet part aloud: the bot lost write access to account commands, and changes now require human approval. Accountability walked back down to the only rung that could hold it.

What this buys you in practice

The layer model converts "LLMs are unreliable" — true but useless — into a graded checklist. As a decision flow:

flowchart TD
    %% intent gate first — truth-checking the wrong artifact is wasted verification
    %% sycophancy is an input problem — fix the question before judging the answer
    %% three terminals: green only via verification, amber = unverified risk you own, red = discard
    A["LLM hands you output"] --> B["Skip grammar, skim coherence — those layers hold"]
    B --> P{"Did it do the thing you meant — scope, form, intent?"}
    P -->|"no"| RP["Re-prompt — you are the error signal"]
    RP --> A
    P -->|"yes"| C{"Did you ask a leading question?"}
    C -->|"yes"| D["Re-ask neutrally — keep only what survives both phrasings"]
    C -->|"no"| E{"Cost if this is wrong?"}
    D --> E
    E -->|"high"| F["Verify independently"]
    F -->|"checks out"| G(["Use it — the claim is yours now"])
    F -->|"doesn't"| H(["Discard"])
    E -->|"low"| I(["Use as draft — unverified, you own the risk"])

    %% borders only — fills inherit the site theme
    classDef green stroke:#a3be8c,stroke-width:2.5px
    classDef red stroke:#bf616a,stroke-width:2.5px
    classDef amber stroke:#ebcb8b,stroke-width:2.5px
    class H red
    class G green
    class D,I,RP amber

In words:

No real need to proofread an LLM's grammar. Mistakes there are unlikely enough that the check is rarely worth your time — same dice as everything else, just overwhelmingly better odds.
Rarely check coherence. If the output is interpretable, it almost certainly is.
First ask whether it did the thing you meant. Scope, form, and intent are pragmatic calls, and the model misses them silently — truth-checking the wrong artifact is wasted verification. A re-prompt is cheaper than an audit, which is why the loop exists.
Everything else is unwarranted. Be honest about what that includes. Specifics — names, numbers, dates, citations, quotes, API signatures — are merely where falsehood is easiest to catch, not where it stops. Summaries, causal explanations, "this library behaves like X," architectural claims, code that compiles cleanly: all of it lives on the broken layer, and the general claims are more dangerous than the specific ones precisely because there is no single fact to look up. A fluent paragraph of wrong reasoning fails no checkable detail.
So triage by cost, not by surface form. The question is never "does this look like the kind of output that hallucinates" — it all is. The question is what it costs you if this particular claim is false. High cost: verify independently or don't use it. Low cost: treat it as a draft you now own. This is where reviewing LLM output as a team earns its keep — the diff compiles and reads beautifully; whether it is correct is a separate, human-owned question, and signing the review means the claim is yours now.
Stay alert to pragmatic distortion: agreement that mirrors your framing, and confidence that the semantics doesn't warrant.

The cruel part of the design is that the layers interlock against you. The flawless syntax and effortless coherence are precisely what make truth-layer failures hard to spot — a century of reading has trained you to use fluency as a proxy for reliability, because in human writers it usually is. With LLMs, that proxy is severed. The competence of the lower layers is not evidence about the upper ones.

So: is hallucination just semantics? Yes — exactly, technically, semantics: the gap between a sentence meaning something and a sentence being true. "Just" was never the right word for that gap. It is the entire job.

And keep the ladder in view. It took machines seventy years to climb from signals to meanings, and they have genuinely arrived — that is the breakthrough, and it deserves the awe it gets. The spell to break is the next step up: a foot on the semantics rung is not citizenship of the social world. Twenty thousand hijacked accounts are what it costs, per incident, to keep mistaking one for the other.