Twenty LLMs do not make a team

Pieter Bruegel the Elder, The Tower of Babel (c. 1563), Kunsthistorisches Museum, Vienna

A photo I took at the Kunsthistorisches Museum in Vienna, 8 July 2025 — Pieter Bruegel the Elder's The Tower of Babel (c. 1563). I love what it captures: progress on even the greatest and most well-made project can be wiped out by nothing more than a failure of communication and shared comprehension.

"For the human makers of things, the incompletenesses and inconsistencies of our ideas become clear only during implementation." — Fred Brooks, No Silver Bullet, 1986 (republished as ch. 17 of The Mythical Man-Month, anniversary edition, 1995). Full essay (UNC TR86-020).

Implementation is where understanding is built. Every loop reveals the case the design missed, every bug exposes an assumption that did not hold, every refactor surfaces the abstraction the original code was missing. When an LLM does the implementation, those revelations happen inside the model rather than inside the engineer, and the team inherits the merged diff without the learning attached. Much of the 2026 productivity discussion lives downstream of that quiet swap.

The mythical agent-month

In July 2025, METR published the results of a randomised trial that the AI-coding discourse has mostly continued to talk past. Sixteen experienced open-source developers were given 246 real tasks in mature codebases they already worked in, with the tasks randomly split into an arm where AI tools were permitted and an arm where they were forbidden. The frontier-tools assumption was honoured: most participants used Cursor Pro with Claude 3.5 or 3.7 Sonnet, which were state-of-the-art at the time.

When the data came in, the developers reported an average self-perceived speedup of about 20% from using AI, while the measured wall-clock data showed them running roughly 19% slower (95% confidence interval −26% to +9%, which crosses zero but still rules out the kind of large speed-up the discourse was promising). The methodology in METR's writeup and the arxiv preprint is unusually careful for this corner of the industry, and the gap between perception and measurement is the part of the study that has aged the most interestingly. A follow-up cohort METR ran in February 2026 with a newer model generation softened the slowdown to roughly -4% with a confidence interval that crosses zero, but the perception gap remained — developers continued to overestimate how much faster they were going.

The simple Brooksian framing of "more producers, more delay" does not quite carry across to the LLM era, since LLMs genuinely do sometimes ship in an afternoon what would once have taken a week. A handful of writers, including Wes McKinney and a recent O'Reilly Radar piece, have already been calling the agent-coordination version of this the "mythical agent-month," largely focused on how agents step on each other when they share state. That framing is worth knowing about, even though the angle this post wants to run is one step adjacent: what happens to the humans supposed to be running these agents, and what cost is being silently added to the team's balance sheet.

That cost has accumulated enough of a literature now that it has a name. Most of the people writing about it are converging on the same one: cognitive debt.

Output scales. Comprehension does not.

Consider a working codebase staffed by twenty senior engineers. Twenty individual mental models exist in the room. Twenty people have read the auth middleware end to end at some point, twenty can sketch the database schema from memory, and when one of them rotates off, the team still has nineteen functional copies of the system's design. The bus factor was already twenty before any code was written.

Now consider the same codebase staffed by twenty engineers each working primarily through a frontier coding model. The diff volume is significantly higher, possibly an order of magnitude. The mental-model count is somewhere between one and zero, because nobody had to sit with the design long enough to internalise it — the model did that work, and the model does not attend the post-incident review.

This is more than an intuition. Three pieces of 2025 data point at it from different angles.

Stack Overflow's annual survey, summarised in Pixelmojo's vibe-coding tech-debt piece, tracks two trend lines that should not normally cross in a healthy adoption story. Developer trust in AI coding tools fell from 43% to 29% across eighteen months, while developer usage climbed past 84%. The failure mode most frequently cited was not outright wrongness, which developers find easy to filter, but rather "almost, but not quite, right" suggestions, which 45% of respondents flagged as the most painful category. These near-misses are particularly expensive because catching them requires understanding the system better than the model did.

A GitHub Copilot PR analysis referenced in the same piece reported that AI-suggested blocks were accepted at high rates yet showed roughly two to three times the bug-fix rate in subsequent commits compared to human-authored blocks, which is the comprehension gap leaving a paper trail in git log.

The 2025 DORA report phrases the same pattern in its careful, institutional language: AI behaves as an amplifier of existing team capability rather than a corrector of existing team dysfunction. Thoughtworks's read is more direct — strong teams accelerate, while weaker teams primarily produce broken code at higher velocity. Nowhere in the data is there a scenario in which a team that did not previously understand its codebase came to understand it through adopting AI tooling.

flowchart TD
    A["Before LLM adoption"] --> B["Output per dev-week"]
    A --> C["Comprehension per dev-week"]
    B --> D["After LLM adoption:<br/>output grows fast"]
    C --> E["After LLM adoption:<br/>comprehension flat or down"]
    D --> F["Gap widens every quarter"]
    E --> F
    F --> G["Cognitive debt accumulates<br/>on the comprehension axis"]

The diagram captures the argument compactly. Output and comprehension start coupled and diverge once the tooling absorbs most of the writing work, and the area between the two curves accumulates into a liability that few engineering balance sheets currently track.

Cognitive debt is the bill for skipping the understanding step

The MIT Media Lab study led by Nataliya Kosmyna (arxiv 2506.08872) ran EEG measurements on participants writing essays under three conditions: a brain-only cohort with no tools, a search-engine cohort, and an LLM cohort. Across four sessions the researchers measured neural connectivity and behavioural outcomes, and in the final session each cohort was asked to attempt the task again without their assigned tool. The project overview and the associated publication describe the design in some detail.

The headline finding was that the LLM cohort exhibited the weakest connectivity across the relevant networks and underperformed the other groups at neural, linguistic and behavioural levels, with one further effect that arguably deserved more attention than it received in coverage: participants who had been using the LLM struggled to re-engage the networks needed for independent writing once the tool was removed. Outsourced work proved difficult to take back. The paper is a preprint, and its sample size and ecological validity have been reasonably challenged, so it should not be cited as proof that LLMs cause cognitive damage. What it does support is a direction of effect that aligns with what experienced engineers have been observing informally since 2024: skill that is not being exercised tends not to remain available on demand.

The folk version of this finding is considerably older than the EEG data. "Good judgement comes from experience, and experience comes from bad judgement" has been circulating in print since at least 1932 (attribution history), and what it names is the same chain the MIT study measured collapsing under tool use: struggle produces the substrate, the substrate produces judgement, and removing the struggle quietly removes the judgement along with it. In software terms, every wrong first guess that gets pre-empted by an accepted suggestion, every dead-end refactor that the model silently routes around, every confusing error message someone else's tooling resolves for you, is a small instance of the chain getting shortened. The end state is a workforce able to ship code at unprecedented velocity and increasingly unable to reason about what it has shipped.

The same pattern translates fairly cleanly into software work. Each accepted suggestion that was never first attempted unaided is a small transaction against future capability. A single such transaction is harmless, but the accumulation over a year of dense LLM-assisted work produces a codebase whose nominal authors cannot reliably reconstruct the reasoning behind it, even where git blame records their name on every line.

It helps to keep the vocabulary clean here, because the current discourse tends to collapse three distinct phenomena under the umbrella term "tech debt":

Technical debt describes a shortcut taken in the code itself, payable later in refactoring effort, with the cost residing in the repository.
Cognitive debt describes a shortcut taken in the developer's own thinking, payable later as eroded ability to debug, design, or onboard, with the cost residing in the human.
Comprehension debt — a term originally introduced by Jason Gorman, formalised in the Comprehension Debt in GenAI-Assisted Software Engineering Projects paper, and sharpened by Simon Willison in February 2026 — describes the team-level analogue, in which no individual on the team can confidently explain the system end to end, with the cost residing in the organisation.

These are three separate ledgers with three separate forms of insolvency, and LLM-driven workflows are unusually effective at running balances on all three simultaneously.

A concrete instance of comprehension debt is common enough in mature codebases that it borders on cliché. Somewhere in almost every multi-year repository there is a utils/ or helpers/ directory that grew organically over time — string_utils.py, date_helpers.ts, misc.go — accumulating fifty to two hundred small functions, most of them written by an engineer who has since rotated off the team or left the company. The directory has no real index, the names are not always discoverable, the docstrings are uneven, and nobody currently on the team carries a complete mental map of what already lives in there.

The mechanical consequence is duplication. A new ticket lands that needs to format a duration as 2h 14m. The engineer working it writes a fresh format_duration in the file they are already editing, because searching the kitchen-sink directory feels like more work than just writing the four lines. The codebase now contains two implementations of the same idea, possibly three. Six months later somebody patches a bug in one of the copies, and the other copies stay broken until a future incident surfaces them.

LLM-assisted workflows do not reliably correct for this and often accelerate it. The model rarely runs a project-wide search for "is there already a helper that does this?" before producing a new function, and even when it does, kitchen-sink names like misc.py or common.ts are noisy enough that the model confidently writes a fresh implementation rather than re-using the one already living a few lines deeper in the same file. The architectural pattern-match — recognising that two stretches of code are restating the same concept and lifting that concept into a single shared utility — is precisely the kind of work that requires holding the entire codebase in a single head. The model holds only the context window it was handed at the start of the turn, and the engineer who once carried the full map of the helpers directory left in Q2. The pattern is empirical, not anecdotal: a January 2026 study, More Code, Less Reuse, found that agent-generated pull requests systematically disregard code-reuse opportunities and introduce more redundancy per change than human-authored code, which is exactly the duplication geometry described above.

The Pixelmojo piece projects 2026–2027 as the window in which the accumulated debt from the 2023–2024 vibe-coding wave reaches crisis levels, and the projection looks plausible. The shape is one the industry has seen before: a new tool lets a generation of practitioners skip a previously load-bearing step, the step turns out to have been doing real work, and the bill arrives somewhere around three years later when the original authors have rotated off and the next team inherits artefacts no one wrote on purpose.

Twenty LLMs on a project nobody understands

The thought experiment the title is gesturing at is worth running explicitly.

Imagine running a team. The reasoning goes that since one engineer plus a coding agent is now plausibly two-engineers-fast, twenty engineers each paired with their own agent should be approximately forty-engineers-fast, and a quarter of roadmap should be deliverable in a month. The agents are spun up, the IDEs are licensed, the dashboards report green.

What tends to happen, in something like this order:

Throughput on closed-form tickets climbs. Bug fixes, small features, schema migrations the model has seen many times before all ship faster, and the velocity dashboard reflects a clear win.
Architectural drift increases more sharply than throughput, because the un-instrumented work of keeping new code consistent with existing code shape is not being done. Each agent picks the local optimum it knows, and the codebase grows new dialects.
Review quality degrades. Reviewers are themselves under deadline pressure and end up using the model to assist their reviews, which produces the AI-writes / AI-reviews failure mode discussed at length in the post on guardrails, where the verifier and the generator are effectively the same artefact and their blind spots overlap.
A genuine production incident eventually lands — a behavioural bug under load, of the sort that requires real understanding rather than a clean stack trace. The on-call engineer pulls up the file and finds that nobody on the team can explain why it behaves the way it does. The model can offer a plausible guess, but plausible guesses in this category are often "almost, but not quite, right," and a couple of hours can disappear chasing one.
The single engineer who still carries a working mental model of the system gradually absorbs the operational load, burns out under the volume, and leaves, taking the team's last reliable internal representation of the codebase with them. The dashboards never reflected the fact that the bus factor was effectively one.

flowchart TD
    A["20 generator agents"] --> B["20× diffs"]
    B --> C["1 reviewer human<br/>(also using the model)"]
    C --> D["Merged to main"]
    D --> E["Production incident"]
    E --> F["1 debugger human:<br/>load-bearing mental model"]
    F --> G["Burnout / departure"]
    G --> H["Team has no one left<br/>who understands the system"]

This is the cost the dashboards never measured. A team functions as a distributed cache of shared comprehension as much as it functions as a producer of code, and that cache is invalidated the moment its last informed maintainer logs off. Twenty LLMs do not constitute a team in any meaningful sense, because LLMs are excellent at producing artefacts and largely useless as places where a durable mental model can live. The cache cannot be scaled by adding more writers; it scales only when its readers continue to be readers.

Context switching is a tax. LLMs hand you twenty new tabs.

The context-switching literature predates LLMs and has been remarkably consistent over the last decade. Industry analyses (Hivel, Jellyfish) keep reproducing a similar shape, mostly anchored on Gloria Mark's long-running UC Irvine research on interruption recovery: developers report somewhere around twelve to fifteen significant context switches per working day, with recovery times in the twenty-to-twenty-three-minute range, summing to roughly four and a half hours of deep focus lost per day. Specific numbers vary by methodology, but the direction has held steady for years.

LLM workflows have not reduced this number, though they have altered the geometry of the switching. A 2026 day for a heavy LLM user typically layers the following on top of the standard interruption pattern: writing one's own code for short stretches, dropping into a 200-line agent-produced diff to read it carefully, parsing the model's confident-but-occasionally-wrong explanation, tabbing out to verify a claim the model made about an external API, re-prompting with sharper context, and dropping back into parts of the codebase the model didn't read in order to verify the assumptions it baked into its suggestions. The cycle then repeats, several times per hour in active development.

The Stack Overflow 2025 numbers (via Pixelmojo once more) reflect the cost of this geometry directly: 45% of developers report that debugging AI-generated code is more time-consuming than writing it from scratch, which is less a complaint about model quality than a measurement of what it costs to drop a stranger's code into the middle of one's own flow and be expected to vouch for it.

The 2025 DORA report is, as ever, careful with its language on burnout. The survey data did not show a statistically significant correlation between AI adoption and self-reported burnout in the current measurement window, though the report also flagged stalled work, rising restart rates, and substantially higher parallel-thread counts as operational consequences accumulating beneath the survey instruments (InfoQ coverage). A reasonable reading of this is that the cost has not yet shown up in the form that HR surveys are designed to detect, while operational data is already pointing in that direction.

Flow state remains the only mode in which design-level thinking reliably happens, which is the kind of thinking that notices when a schema is wrong, when a class wants to become two classes, or when a new feature contradicts a load-bearing invariant. Tools that fragment attention tax every cognitive layer above the typing one, even where they appear to subsidise typing itself.

The unpopular guardrail: measure understanding, not output

The guardrail missing from the 2026 LLM-coding workflow is not architectural in any conventional sense, and a model upgrade, a smarter policy LLM, or yet another VS Code fork is unlikely to supply it. As argued in the previous post, the model cannot police itself in any structurally meaningful way. The guardrail still available — and largely absent from current practice — is a personal and team-level discipline of refusing to ship code that the human author cannot explain in their own words.

A few concrete consequences follow from taking that discipline seriously, in no particular order:

For any non-trivial accepted LLM diff, the author should be able to write a one-paragraph summary, in their own words, of why the change is correct. The point is not to describe what the diff does, which the diff already shows, but to articulate why this is the right change, and if that paragraph cannot be written, the change is not ready to merge.
Lines-of-code and PRs-per-week were already weak productivity proxies before LLMs and have become weaker since. The metric worth tracking is whether a new engineer could be reasonably onboarded to a given part of the code, because a no on that question signals comprehension debt regardless of how the velocity dashboard looks.
The framing that resisting the model is anti-productivity should be retired. The METR data quietly suggests, with confidence intervals, that thoughtful resistance on comprehension-critical paths is itself productive, and that the faster-feeling path has been costing more than it saves.
LLM-generated changes deserve a human-authored explanation in the PR description as a matter of course, which forces the comprehension step into the workflow at a point where teammates can notice if it has been skipped.
The commit message can serve as a kind of receipt for cognitive work done. A change that cannot be summarised by its author in their own words should probably not land that day.
Deterministic infrastructure — types, tests, linters, schemas, lockfiles — remains the most reliable floor under LLM-assisted code, because the compiler tolerates exactly no ambiguity even when the model is happy to invent some.

None of this should be read as an argument against using LLMs. The interesting and frequently conflated positions in the discourse are "do not use LLMs" and "rate-limit your own cognitive offloading," which are not the same position and should not be treated as one. Reading everything a model writes before committing it should be a baseline professional standard rather than a controversial one.

Review huge diffs as a team, and draw a deterministic map

Single-reviewer code review made sense when a typical diff fit comfortably in one engineer's head. Hand-authored changes were usually small enough that the author and one reviewer could share the full mental model between them, and the social cost of disagreement was bounded by the relationship between two people who both already knew what they were arguing about.

LLM-assisted diffs break both halves of that assumption. The diff is often large enough — six files, three hundred lines, a couple of refactors snuck in along the way — that one reviewer cannot reliably hold all of it in working memory long enough to vouch for it. The empirical record on review size is long enough to make this concrete: SmartBear and Cisco's classic study found that defect-detection effectiveness drops below 70% once the diff exceeds 400 lines and below 50% past 1,000, and Microsoft Research found reviewers spend roughly 6 minutes per file on small PRs and only 1.5 minutes per file on large ones. The diff is not getting smaller; reviewer attention per file is. And the author is not the diff's craftsman in the traditional sense, so the social weight of disagreement no longer sits on a colleague's craft; it sits on the model's output. Two practices follow from that, and they reinforce each other.

The first is team review for non-trivial LLM-generated changes. The practice is not new — extreme programming has used pair and ensemble (mob) review for over two decades, with multiple reviewers giving synchronous feedback on the same change — but the case for it sharpens considerably when the diff being reviewed was not written by anyone on the team. Three reviewers, each holding one slice of the diff, distribute the comprehension across the comprehension cache the team already is, rather than concentrating it in whoever happened to draw the on-call review rotation that morning. The point is not consensus or quorum; it is that three mental models examining the change at the same time leave fewer blind spots than one mental model trying to absorb a multi-file refactor under deadline. The reframe is social as much as procedural — because the diff is the model's output rather than a colleague's craft, reviewers can be more direct and more opinionated than they would have been on handwritten code, and authors can take the criticism as material to act on rather than as a personal challenge to defend against. The dynamic ends up closer to editing a draft from a third party than to critiquing a teammate's work, and the feedback loop runs faster as a result. There is also a warning sign in the empirical data here: a 2026 study, These Aren't the Reviews You're Looking For, found that human reviewers currently express more neutral or positive sentiment toward AI-authored pull requests than toward human-authored ones — the inverse of the dynamic this section is arguing for. The reframe is available, but it is not the default behaviour the data shows teams settling into.

The second practice is to pair the diff with a deterministic, machine-generated map of the code flow — a sequence diagram of the new call path, a control-flow graph of the changed function, a schema-diff visualisation, a dependency graph extracted by the language's own tooling. The tooling for this exists today: AppMap auto-generates runtime sequence diagrams for Ruby, Python, Java, and JavaScript projects, and tools such as CodeAnt AI ship an auto-generated sequence diagram alongside every PR to show reviewers the runtime flow introduced by the change. The artifacts are available; the practice question is whether teams choose to put them in front of reviewers as a matter of course. The key word is deterministic. The map is produced from the code itself, not narrated by the model, so the picture and the code cannot disagree with each other. Reviewers can verify structure against the map in parallel with reading the prose in the PR description, and the places where the two diverge are often exactly where the "almost, but not quite, right" errors discussed earlier in the post tend to hide. The map is also the closest thing the team has to a portable mental model of the change — it can be saved, attached to the ticket, and revisited the next time someone needs to understand why the system behaves the way it does.

flowchart TD
    M["LLM produces large diff<br/>(multi-file, multi-refactor)"]
    M --> A["Curator (author):<br/>writes the why-paragraph"]
    M --> D["Deterministic map<br/>generated from the code<br/>(sequence diagram, call graph,<br/>schema diff, dependency graph)"]
    A --> R1["Reviewer 1<br/>slice: data model"]
    A --> R2["Reviewer 2<br/>slice: control flow"]
    A --> R3["Reviewer 3<br/>slice: integration points"]
    D --> R1
    D --> R2
    D --> R3
    R1 --> V["Cross-check prose<br/>against the map"]
    R2 --> V
    R3 --> V
    V --> MERGE["Merge with comprehension<br/>shared across the team"]

Together, these two practices extend the team's comprehension cache instead of letting the LLM hollow it out — a wider review surface to match a wider diff, and a structural representation that the model cannot quietly hallucinate around. The deeper version of both practices, with worked examples, tooling recommendations, and the cultural shift the empirical data says they require, is in the follow-up post.

Returning to Brooks briefly: The Mythical Man-Month held up not because people are intrinsically slow but because coordination is a function of comprehension, and comprehension does not parallelise. Understanding cannot be split across nine people the way a feature can be. LLMs accelerate the part Brooks already identified as cheap — producing code — and leave the part he identified as expensive, namely sharing the mental model of the system, essentially untouched. The geometry of the cost has not really moved; only the tooling has.

The teams that come through the next five years well are likely to be those that still understand their own code, regardless of how many agents they have running in parallel.