Context engineering: the job prompt engineering became

A library card catalogue — information architecture before software. Photo: Dr. Marcus Gossler, CC BY-SA 3.0.

A rename that was secretly a promotion

In 2023 the hot job title was "prompt engineer," and the popular image of it was a person who knew the magic words — say "you are an expert," say "think step by step," offer the model a tip, threaten it slightly — and got better answers out of ChatGPT than the rest of us. There was a real skill in there, but the framing was about phrasing, and phrasing is a thin thing to build a discipline on. Most "prompt engineering" advice aged into folklore within a model generation, because the next model didn't need the incantation.

In mid-2025 the field quietly renamed the serious version of this work, and the rename was a promotion. The term is context engineering, and it got its canonical framing on a single day. On June 27, 2025, Andrej Karpathy endorsed it on X: "+1 for 'context engineering' over 'prompt engineering'," describing the discipline as "the delicate art and science of filling the context window with just the right information for the next step." The same week, Shopify CEO Tobi Lütke defined it as "the art of providing all the context for the task to be plausibly solvable by the LLM" and called it a core skill. Simon Willison picked it up the same day and made the sharpest point of the three: "prompt engineering" had collapsed in popular usage into "typing into a chatbot," and the new term was needed precisely to reclaim the actual, non-trivial work from that dismissive reading.

The thesis of this post is that the rename describes a real migration of leverage. It moved from the wording of a single message to the architecture of everything the model can see — and that second thing is genuine engineering in a way the first never was.

The window is the resource, and it is scarce, expensive, and leaky

To see why context engineering is real engineering and prompt-whispering wasn't, you have to take the context window seriously as a resource, the way you take memory or bandwidth seriously. It has three properties that make managing it an engineering problem rather than a writing problem.

It is scarce. The window has a hard token ceiling, and everything competes for it: the system prompt, the conversation history, retrieved documents, tool definitions, tool outputs, the user's actual question. You cannot fit everything, so you are constantly making allocation decisions, and allocation under a fixed budget is the most engineering thing there is.

It is expensive. You pay per token, on input and output, on every single call. A bloated context isn't just slower; it is a recurring line item. An agent loop that re-sends a growing history on every iteration is paying compound interest on whatever junk it accumulated early.

And — the property the marketing skips — it is leaky: model performance degrades as the window fills, well before the limit. Chroma's 2025 context rot study tested 18 frontier models and found every one of them got worse as input length grew, with meaningful degradation showing up at 50K tokens on a model nominally rated for 200K. The older lost-in-the-middle finding showed attention to information in the middle of a long prompt falling off sharply — retrieval accuracy dropping more than 30% on multi-document tasks. So you are not just budgeting a scarce, expensive resource. You are budgeting one that gets less reliable per token the more you put in. More context is not more better. Past a point, more context is actively worse, and knowing where that point is for your task is the job.

Once you internalise that, "just stuff everything relevant into the prompt" stops being a strategy and starts being a bug. The actual work is deciding what the model should and should not see at each step — and that is an information-architecture problem, not a phrasing problem.

What the job actually consists of

Karpathy's own list of what goes into the window is a decent job description: task instructions, few-shot examples, retrieved facts, multimodal data, tools, state, history, all carefully compacted into a limited window. Turn each of those into a verb and you have the discipline:

Retrieval — choosing what to pull in from outside (RAG, search, database lookups) and, harder, what to leave out. The skill is precision and recall on the inputs, because a retriever that returns ten documents when two are relevant has just poisoned the window with eight distractors.
Tool-result curation — a tool call can return a 5,000-token JSON blob. Do you feed all of it back into context, or extract the three fields that matter? This single decision is the difference between a loop that stays coherent and one that drowns in its own observations.
Memory and state — what persists across turns, what gets summarised, what gets dropped. An agent that remembers everything is an agent that has filled its window with last hour's irrelevance.
Compaction — the active discipline of summarising or pruning history so the window stays inside the high-performance zone. This is the closest thing context engineering has to garbage collection, and it is just as load-bearing.
Ordering — given lost-in-the-middle, where in the window something sits affects whether the model attends to it. The most important context goes at the edges, not buried in the middle.

flowchart TD
    SYS["System prompt"]
    RET["Retrieved facts<br/>(selected, not dumped)"]
    TOOL["Tool results<br/>(extracted, not raw)"]
    MEM["Memory / state<br/>(compacted)"]
    HIST["History<br/>(pruned)"]
    Q["The actual task"]
    W{{"Context window<br/>(scarce · expensive · leaky)"}}
    SYS --> W
    RET --> W
    TOOL --> W
    MEM --> W
    HIST --> W
    Q --> W
    W --> OUT["Next step"]
    OUT -. "feeds back, must be compacted" .-> MEM

None of these are about wording. Every one of them is about what information exists in the budget at the moment of generation, which is architecture. You could write each piece in mediocre prose and a well-architected context will still outperform a beautifully-phrased prompt stuffed with the wrong twenty documents.

Why this is real engineering and prompt-whispering wasn't

The honest distinction is that prompt engineering optimised over a space with almost no structure — the phrasing of one message — and context engineering optimises over a space with hard constraints, measurable costs, and a feedback signal. You can measure context engineering: token count, cost per call, retrieval precision, performance-versus-window-fill curves. You can regression-test it. You can profile it the way you profile memory usage, find the call that balloons the context, and fix it. That measurability is the line between engineering and folklore. The 2023 prompt-engineering tip "add 'think step by step'" had no stable measurement behind it and stopped working when reasoning models internalised the behaviour. "Compact tool outputs before re-injecting them" is a claim you can verify on your own traces and that stays true across models because it is about the resource, not the model's quirks.

This is also where context engineering connects to the system-prompt-as-product argument from the post on AI IDEs being skins over the same models. Those leaked system prompts — tens of thousands of tokens of scaffolding — are context engineering artefacts, not prompt engineering ones. Their length is a context-budget decision, their ordering is an attention decision, and the fact that a hidden "keep outputs concise" rule silently overrides a user's stated preference is a context-priority decision. The product those IDEs sell is, quite literally, a context-engineering decision baked into a config file. Drew Breunig's line that the model sets the ceiling and the system prompt determines whether the peak is reached is a statement about context engineering: the model is fixed, and the only lever left is the architecture of what surrounds it in the window.

The skeptical bit: it is real, but it is not magic

I want to be careful not to do to context engineering what the industry did to prompt engineering — inflate a real skill into a cure-all. Context engineering is genuine and it is leverage, but it has the same ceiling everything else in this stack has: it cannot make the model do something the model cannot do. Perfect context will not let a model count the R's in "strawberry", because that is a tokenization limit no arrangement of the window touches. Perfect context will not give a model a missing world model. Context engineering moves the model closer to its own ceiling, efficiently and measurably. It does not raise the ceiling.

It also doesn't escape the failure modes from the rest of this series — it mostly operationalises the defence against them. The reason context engineering matters so much in agentic systems is that an agent loop is a context-management machine whether you designed it as one or not: every iteration adds tokens, every tool result is a budget decision, and an agent with no compaction discipline is just a context-rot generator with a goal. The thing that keeps a long loop coherent is exactly the curation, compaction, and ordering described above. And on the team side, the deterministic map argued for in reviewing LLM diffs as a team is, among other things, a context-engineering artefact for the humans — a compacted, accurate representation of a change so the reviewer's own limited attention budget is spent on signal. The discipline generalises past the model's window to the team's.

So the right framing is modest and load-bearing at once. Context engineering is the part of this work that is real, durable, and worth getting good at, precisely because it is about a resource with hard constraints rather than about words that flatter a model. It is also not the thing that breaks the model's fundamental limits, and anyone selling it as such is repeating the 2023 mistake with a more respectable vocabulary.

A short close

Prompt engineering was a job title built on the wording of a single message, and it aged the way folklore ages — into advice that stopped working when the model changed. Context engineering is the job that work matured into, and the maturation was a promotion: from phrasing to architecture, from incantation to allocation. The reason it deserves the word "engineering" is that the context window is a real resource — scarce, billed per token, and degrading as it fills — and managing a scarce, expensive, leaky resource under a fixed budget is the oldest engineering discipline there is, just with a new substrate.

The leverage in 2026 is not in knowing the magic words. The model has heard every magic word. The leverage is in deciding, precisely and measurably, what the model gets to see at the moment it has to act — and that is the kind of work that stays valuable across model generations, because it is about the shape of the information, not the mood of the prompt.