Vibe coding and the tech-debt bill

The surviving wall of the Marshalsea debtors prison, London — where the bill eventually arrived. Photo: Russell Kenny, CC BY 3.0.

The term was always about throwing it away

Andrej Karpathy coined "vibe coding" in a throwaway tweet on 2 February 2025: "There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good." He described the loop as seeing stuff, saying stuff, running stuff, and copy-pasting stuff, and "it mostly works." The phrase was viewed millions of times, became Collins Dictionary's word of the year for 2025, and Karpathy himself later called it "a shower of thoughts throwaway tweet that I just fired off."

The part that almost everyone forgets is that the original framing was explicitly about disposable work. Karpathy was describing weekend projects where speed matters more than scrutiny — code you never intend to maintain, read back, or hand to anyone. That is a genuinely good use of the technique. I have built throwaway tools this way and I will keep doing it, because the cost of being wrong is that I close the tab.

The trouble is that the workflow did not stay on the weekend. It walked into production, kept its name, and dropped its caveat. And the 2026 audits are now showing what the bill looks like when you forget that the code exists in a codebase other people have to keep alive. The short version of the argument: vibe coding is useful at the prototype layer and ruinous at the production layer, and the debt it runs up is not the kind a refactor sprint pays down, because most of it is comprehension debt — deferred, not erased.

The bill, itemised

Start with the numbers, because the discourse around vibe coding has been long on vibes and short on figures. Pulling from Pixelmojo's 2026 compilation of recent audits: roughly 63 percent of developers report spending more time debugging AI-generated code than they would have spent writing it themselves. AI-assisted development is running about 12 percent more expensive than baseline in year one, and — this is the line that should make a CTO sit up — without active debt remediation, year-two maintenance costs are landing around 4x baseline. Around 45 percent of AI-generated code is being found to contain vulnerabilities.

The security picture has a particularly clean case study attached. In 2025, security researchers audited 1,645 applications built on Lovable, the vibe-coding platform whose entire pitch is "describe an app, get an app." They found that 170 of them — about 10 percent — carried critical vulnerabilities, exposing 303 vulnerable endpoints leaking names, emails, API keys, and financial records. The root cause was missing or misconfigured Row Level Security policies in the generated Supabase backends; the flaw got a CVE, CVE-2025-48757. Attackers needed no credentials — the public anon key embedded in the client let them dump entire tables. Lovable's eventual fix, a "security scan" feature, only checked whether RLS was present, not whether it actually worked, which is a fairly exact metaphor for the whole problem: a vibe check is not a verification.

These are not abstract risks. They are the predictable output of a workflow whose defining instruction is to forget that the code exists, applied to code that very much continues to exist, on the internet, holding other people's data.

The number that should have ended the hype

The single most important data point in this whole discussion is also the most counterintuitive, and it has been talked past relentlessly. In July 2025, METR ran a randomised controlled trial: sixteen experienced open-source developers, 246 real tasks in mature repositories they already maintained, tasks randomly split into an AI-allowed arm and an AI-forbidden arm. Most participants used Cursor Pro with Claude 3.5 or 3.7 Sonnet — frontier tooling at the time.

The developers predicted AI would make them about 24 percent faster. After the fact, they still believed it had made them roughly 20 percent faster. The measured wall-clock data showed them running about 19 percent slower when AI was allowed. The arxiv preprint is unusually careful for this corner of the field. A follow-up cohort METR ran in February 2026 with a newer model generation softened the slowdown to roughly negative 4 percent with a confidence interval that crosses zero — so the slowdown may have largely closed — but the perception gap did not. Developers continued to feel faster than the stopwatch said they were.

That gap is the engine of the tech-debt bill. The feeling of velocity is real and immediate; the cost is real and deferred. The work feels done because a plausible diff appeared, and the parts that did not get done — reading it, understanding it, checking it against the rest of the system — do not register as undone until much later. This is the same dynamic I wrote about in Twenty LLMs do not make a team: output scales, comprehension does not, and the gap between the two curves is where the liability lives.

flowchart TD
    A["Vibe-coded change<br/>feels done"] --> B["Plausible diff appears"]
    B --> C["Merged on the feeling<br/>of velocity"]
    C --> D["Reading / understanding /<br/>checking step skipped"]
    D --> E["Comprehension debt<br/>accrues silently"]
    E --> F["Year-two maintenance<br/>~4x baseline"]
    F --> G["Bill arrives ~3 years<br/>after the original authors<br/>rotated off"]

Why a refactor sprint will not save you

The instinct, when the maintenance number lands, is to treat this as ordinary technical debt: schedule a remediation quarter, pay it down, move on. That instinct underestimates the problem, because vibe coding runs balances on three separate ledgers, and only one of them is paid down by editing code.

It helps to keep the vocabulary clean, the way Twenty LLMs do not make a team lays it out. Technical debt is a shortcut in the code, payable in refactoring effort — the cost lives in the repository. Cognitive debt is a shortcut in the developer's own thinking, payable as eroded ability to debug and design — the cost lives in the human. Comprehension debt, a term sharpened by Simon Willison in early 2026, is the team-level analogue: no one on the team can confidently explain the system end to end — the cost lives in the organisation.

A refactor sprint pays down the first ledger. Vibe coding's heaviest balances are on the second and third. When you forget that the code exists at the moment of writing it, you skip the step where understanding gets built — and as Fred Brooks observed in No Silver Bullet, "the incompletenesses and inconsistencies of our ideas become clear only during implementation." The implementation still happens; it just happens inside the model. The team inherits the merged diff without the learning attached. There is no later sprint that retroactively installs the mental model the author never built, because the author was, by design, not paying attention. The debt was deferred to a future engineer who will have even less context than the original author had — which is to say, none.

The empirical shape of this is already in the literature. A January 2026 study, More Code, Less Reuse, found that agent-generated pull requests systematically disregard code-reuse opportunities and introduce more redundancy per change than human-authored code. That is comprehension debt with a paper trail: the model does not hold the whole codebase in its head, so it writes a fresh format_duration rather than finding the one already living in utils/, and the duplicates quietly diverge until an incident surfaces them.

The prototype layer is where it belongs

None of this is an argument against vibe coding. It is an argument for keeping it on the side of the line Karpathy originally drew. The technique is genuinely excellent at the prototype layer, where its core assumption — that the code is disposable — actually holds:

Spikes and proofs of concept. When the goal is to find out whether an idea is worth building at all, the fastest path to a runnable answer wins, and throwing the result away is the plan, not the failure mode.
Throwaway internal tooling. A script you will run twice and delete does not need a mental model behind it. If it breaks, you regenerate it.
Learning the shape of an unfamiliar API. Vibe coding against a new library to see what the calls feel like is a good use of the loop, provided you do not ship the exploration.
The first draft you fully intend to rewrite. Generating a rough skeleton to react against can be faster than a blank file, as long as "rewrite" is genuinely on the schedule and not a comforting fiction.

The common thread is that the cost of being wrong is bounded and local. The moment the output is going to live in a shared codebase, hold real data, or be maintained by someone else, the assumption that makes vibe coding fast — forget the code exists — becomes the assumption that generates the bill.

This is also why the choice of AI IDE is a red herring here. As I argued in Every new AI IDE is the same model with a different system prompt, the tech-debt numbers move with the model and with how it is used, not with whether you vibe-coded the mess in Cursor or Windsurf or Kiro. The wrapper sells ergonomics, which are real; it does not sell comprehension, which is the thing being skipped.

The discipline that turns vibe coding back into engineering

If the production layer is where the debt accrues, the production layer is where the discipline has to live. The single most useful guardrail is also the least technological: refuse to ship code you cannot explain in your own words.

Concretely, the practices that keep the bill down are the ones Twenty LLMs do not make a team and Review LLM diffs as a team lay out in detail, and they apply with extra force to vibe-coded changes:

For any non-trivial accepted diff, the author should be able to write a one-paragraph "why this is correct" in their own words before merge. Not what the diff does — the diff shows that — but why it is the right change. If that paragraph will not come, the change is not done; it only feels done.
Treat large LLM-authored diffs as material for team review rather than single-reviewer rubber-stamping. The defect-detection data has shown for two decades that one reviewer past 400 lines is a coin flip, and the diffs are getting bigger, not smaller.
Lean on deterministic infrastructure — types, tests, linters, schema checks, RLS policies that are actually tested rather than merely present. The compiler tolerates exactly no ambiguity even when the model is happy to invent some, and a real test would have caught the Lovable RLS gap that a "vibe check" did not.

The reframe is not "stop vibe coding." It is "vibe-code the prototype, then do the engineering before it becomes production." The model wrote the draft fast; the comprehension step is the part you are being paid for, and it is the part the velocity dashboard does not measure.

A short close

Vibe coding is real, it is fun, and at the prototype layer it is one of the better things to happen to software in years. The mistake was never the technique. It was forgetting that Karpathy's caveat — throwaway weekend project — was load-bearing, and carrying the workflow into production while leaving the caveat behind.

The bill itemised above — 4x year-two maintenance, 45 percent vulnerability rates, the Lovable CVE, the METR slowdown that everyone felt as a speedup — is not a verdict on AI coding. It is a verdict on skipping the comprehension step and calling the result done. The debt was deferred to whoever inherits the codebase next, and on current trends, that is increasingly going to be a team that no longer remembers how to read code it did not generate. The teams that come through well will be the ones that kept vibe coding on the weekend and did the engineering on Monday.