Review LLM diffs as a team, and draw a deterministic map

Rembrandt, The Anatomy Lesson of Dr. Nicolaes Tulp (1632), Mauritshuis, The Hague

Rembrandt van Rijn, The Anatomy Lesson of Dr. Nicolaes Tulp (1632), oil on canvas, Mauritshuis, The Hague. A team examining something none of them produced, with a deterministic structural reference (Vesalius's anatomical atlas) open in the foreground — the practice this post is arguing for, on canvas, four centuries early.

The paradox in the data

A study published in May 2026, These Aren't the Reviews You're Looking For, measured how humans review AI-generated pull requests in practice and surfaced a finding that should be more unsettling than it has been treated as: reviewers express more neutral or positive sentiment toward AI-authored pull requests than toward human-authored ones. The default behaviour in 2026 is for reviewers to be quietly gentler on code the model wrote than on code their colleagues wrote, which is precisely the opposite of what the evidence on AI code quality suggests should be happening. A separate January 2026 study, More Code, Less Reuse, found that agent-generated PRs systematically disregard code-reuse opportunities and introduce more redundancy per change than human-authored code. Less reuse, more redundancy, gentler review. That is the shape of the loop the industry is currently running.

The previous post in this series, Twenty LLMs do not make a team, argued that comprehension is the bottleneck the LLM-coding stack accelerates around rather than through. The closing section sketched a two-part practice for keeping a team's comprehension cache intact under heavy LLM use. This post is the deeper version of that section: what to actually do at the review boundary, what tools already exist, and what cultural pattern the data says needs to change for the practice to land at all.

The headline claim is small. Single-reviewer code review was already breaking before LLMs were producing pull requests at scale. LLM-sized diffs broke it harder. The fix is two old ideas — team review and machine-generated code-flow maps — both available, both well-understood, and both currently absent from most engineering workflows.

Single-reviewer review was already broken

The empirical record on diff size versus review effectiveness has been remarkably consistent for the better part of two decades. SmartBear and Cisco's classic study found that defect-detection effectiveness drops below 70% once a single review exceeds roughly 400 lines of code and below 50% past 1,000. Microsoft Research found that reviewers spend roughly six minutes per file on small PRs (under five files) and only about 1.5 minutes per file on large ones (over twenty files). Google's published analyses of internal code review report the same shape: change lists over a thousand lines take a median 24 hours to turn around and receive substantially fewer substantive comments than smaller changes.

None of those numbers were collected with LLM diffs in mind. They were collected when the diff under review had been written by a human who was in the room, who knew what the change was for, who could answer follow-up questions on Slack, and who would feel something specific about getting the review wrong. Even under those conditions, a single reviewer running past four hundred lines was already producing review work whose defect-detection rate was a coin flip.

LLM-assisted development pushes against every one of those assumptions at once. GitHub's own metrics, shared in a recent blog post, report that more than one in five code reviews on the platform now involve an agent, with Copilot code review alone having processed sixty million reviews and grown by an order of magnitude in less than a year. The diffs in that traffic are larger on average, span more files, and carry more refactors-snuck-in-on-the-side than equivalent human-authored changes. The author is not the diff's craftsman in the traditional sense — they are closer to a curator of the model's output — which means several of the social and informational assumptions single-reviewer review was built on no longer hold. The review system has not changed; the input has.

Team review, with a sharper case than it had in 1996

The first half of the practice is team review for non-trivial LLM-generated changes. The idea predates the LLM era by about thirty years. Pair programming was formalised in extreme programming in the late 1990s, and ensemble (or mob) programming, in which three or more engineers work on the same change synchronously, has been a documented Agile practice since the mid-2000s. Both approaches treat review as a multi-participant activity rather than a single-reviewer rubber stamp at the end of a pipeline.

What the LLM era adds to that argument is empirical sharpness. The diffs are larger, the authors are less attached to them in the craftsmanship sense, and the data on default reviewer behaviour suggests that single reviewers are now systematically under-critical of AI-authored code rather than appropriately critical. Three reviewers looking at the same change in parallel produce three sets of independent attention and three slightly different mental models, which leaves fewer blind spots than one reviewer attempting to absorb a multi-file refactor at deadline pressure.

In practice it does not have to be a full mob session. The asynchronous version is most of the benefit at a fraction of the calendar cost:

For LLM-generated PRs above some threshold (one heuristic: more than three files, or more than two hundred lines of change), require three approvals rather than one. The threshold matters less than the structural fact that the PR cannot land on one person's reading.
Pre-assign reviewer slices in the PR template. Reviewer A: data model and persistence. Reviewer B: control flow and error handling. Reviewer C: integration points and tests. Three different reading orders, three different mental models built in parallel.
Drop the consensus requirement. The point is not a quorum vote; the point is that three independent passes leave fewer blind spots than one. Any one reviewer can block; the team is not trying to agree on the diff, it is trying to understand it.
Make team review the default for any change that touches load-bearing code, irrespective of the author of the diff. The practice loses its meaning the moment it is reserved for "important" PRs only.

The social side of the reframe matters here. Because the diff was not personally crafted by a colleague — it was curated from a model's output — reviewers can be more direct and more opinionated than they would have been on handwritten code. The criticism lands on the model rather than on a person. The author is in a better position to take the feedback as material to act on rather than as a personal challenge to defend against. The dynamic ends up closer to editing a draft from a third party than to critiquing a teammate's work, which is exactly the dynamic the reviewer-sentiment paradox above suggests is not the current default.

Deterministic maps: what to generate, and what they replace

The second half of the practice is to pair every non-trivial LLM-generated diff with a deterministic, machine-generated map of the code flow before merge. The map is not narrative; it is produced from the code itself by a tool that does not negotiate with the model and cannot be talked into a more flattering picture of the change.

The kinds of maps that pay back for the time they take:

Sequence diagrams of the new call path or the modified call path, generated from runtime instrumentation rather than from the diff prose. AppMap does this for Ruby, Python, Java, and JavaScript projects by recording a representative test or interactive session. The output is a UML sequence diagram that mirrors the actual execution path, not the model's narration of it.
Control-flow and call graphs of the changed functions, extracted by the language's own tooling. pycallgraph for Python, the Go compiler's call-graph extraction, Rust's cargo-call-stack, IntelliJ's structural diagram features for the JVM. The picture is built from the AST; the model is not in the loop.
Schema-diff visualisations for any change that touches a database. Tools like Atlas and pg-diff produce side-by-side renderings of schema state before and after the migration, including the indices and constraints the diff prose tends to gloss over.
Dependency-graph deltas for any change that touches package.json, Cargo.toml, pyproject.toml, go.mod, or equivalent. npm ls, cargo tree, pipdeptree, go mod graph — every ecosystem has the tool. Diff the graph before and after; surface the transitive additions the lockfile is hiding.

The key adjective is deterministic. The map is produced from the code, not from the model. The picture and the code cannot disagree with each other, because the picture is a projection of the code. When a reviewer reads the PR description and the description says "the new caching wrapper short-circuits the repository call when the key is fresh," the reviewer can check the sequence diagram and see whether that is actually what the call path does, rather than taking the model's word for it. The places where prose and map diverge are exactly where the almost, but not quite, right errors hide.

Tooling for this is no longer speculative. CodeAnt AI ships an auto-generated sequence diagram alongside every pull request to show reviewers the runtime flow introduced by the change. AppMap ships a VS Code extension that produces sequence diagrams from recorded sessions. Schema-diff tooling has been a solved problem for ten years. The artifacts are available; the practice question is whether teams choose to make them a default part of the PR template or leave them as a thing one engineer occasionally runs locally.

A worked example

Concrete: a 280-line refactor across five files, produced by an agent inside a couple of minutes. The PR description, drafted by the curator-author, says: "Extract a caching wrapper around UserRepository.find_by_id to reduce duplicate DB hits inside the request lifecycle." It links to a ticket. It looks reasonable.

What the diff actually does, in increasing order of subtlety:

Adds a CachedUserRepository decorator class, with a TTL of sixty seconds.
Adds a new helper format_cache_key(user_id) that produces f"user:{user_id}:v1". The codebase already has a make_redis_key in utils/cache.py that produces f"user:{user_id}". Two different cache key conventions for the same entity, silently coexisting.
Catches RedisConnectionError inside the decorator and falls back to the underlying repository. The fallback works. It also silently catches ValueError, which the surrounding code was previously using to signal a malformed ID — that signal now disappears.
Modifies the cache invalidation in UserService.update_profile to clear the new key, but not the old make_redis_key form. So under cache contention with the old code path, stale data leaks.

A single reviewer running at 1.5 minutes per file on a five-file PR has nine minutes to find all four. The first finding is obvious. The other three are not. In the absence of a deterministic map of the call path, the reviewer is taking the model's word for what the diff is doing.

A three-reviewer team with a generated sequence diagram and a dependency-graph delta lands on a different outcome. The data-model reviewer notices the two cache-key conventions colliding because the dependency graph shows both helpers reachable from the same service. The control-flow reviewer notices the swallowed ValueError because the sequence diagram shows the new decorator catching where the old one did not. The integration reviewer flags the missed invalidation because the sequence diagram of update_profile shows the old key path still live. All three findings surface in a 24-hour async review cycle, rather than in production at 3 a.m. some Wednesday eight weeks later when a customer support ticket lands describing stale profile data.

The cultural shift this requires

There are two predictable objections, and the data answers both.

The first objection is this will slow us down. The METR finding from the previous post — that experienced developers were measured running roughly 19% slower with AI tools while feeling about 20% faster — is the relevant counter. The slowdown was already happening. The practice this post is proposing trades a known cost (thirty minutes of team review and a few seconds of machine-generated diagram) against an unknown cost that was already being paid invisibly in production debugging, comprehension debt, and the eventual departure of the engineer who happened to know which cache key was the load-bearing one. The trade looks bad on the velocity dashboard. It looks better in incident review.

The second objection is this is process for process's sake. The reviewer-sentiment paradox finding directly contradicts that framing. The current default is not "review thoughtfully and merge what survives"; the current default is "review more leniently because the model is the author and the social weight of criticism feels lighter." The practice this post is proposing is the corrective, not the imposition. If review attention on AI-authored code matched review attention on human-authored code today, the case for team review and deterministic maps would be weaker. The data says it does not match, which is the case for putting two structural counterweights in place: one social (more reviewers, with the criticism explicitly aimed at the model's output), one mechanical (a deterministic map the model cannot soften).

When the writer of a diff is no longer in the room, the team becomes the reader of last resort, and a deterministic map of the change is the only artifact of it the team can still rely on a year later.