ZFS, btrfs, and when to leave mdadm + ext4

The classic Linux storage stack is two independent layers: md/RAID handles redundancy at the block level, and a filesystem like ext4 sits on top, formatting the single device md presents. It works, it's mature, and it has one structural blind spot: the two layers don't talk to each other.

md knows it has four disks but has no idea what a file is. ext4 knows what a file is but thinks it's writing to one perfect disk. So when a disk silently returns the wrong bytes — bit-rot, a firmware glitch, a cosmic-ray flip that the drive doesn't flag as an error — md faithfully mirrors the bad bytes onto the good disk, or protects them with parity, never knowing they're wrong. ext4's metadata_csum catches corruption in filesystem metadata, but your actual file data has no checksum anywhere in the stack. Nothing can detect the rot, so nothing can heal it.

Fixing that requires checksums and redundancy in the same layer, so the system can say "this block's checksum is wrong, fetch the good copy from the mirror, and repair the bad one." That's the whole pitch for the integrated storage systems.

flowchart TD
    DISK["A disk silently returns wrong bytes"] --> ME["mdadm + ext4: file data has no checksum in either layer"]
    DISK --> ZF["ZFS / btrfs: every block carries a checksum"]
    ME --> MEX["Bad bytes mirrored or parity-protected as if correct — undetected"]
    ZF --> ZFX["Checksum mismatch on read — good copy fetched, bad block repaired"]

    %% color = outcome claim: red propagates the rot, green catches and heals it
    classDef bad stroke:#bf616a,stroke-width:2.5px
    classDef good stroke:#a3be8c,stroke-width:2.5px
    classDef plain stroke:#7b88a1,stroke-width:2.5px
    class MEX bad
    class ZFX good
    class DISK,ME,ZF plain

ZFS

ZFS fuses the volume manager, the RAID layer, and the filesystem into one thing. You give it disks, it makes a pool, and datasets draw from the pool. What that integration buys:

End-to-end checksums. Every block is checksummed; on read, a mismatch triggers a fetch of a good copy and a silent repair. Run a periodic scrub and the pool actively walks every block looking for rot before you hit it.
Its own RAID: mirrors and RAIDZ1/2/3 (single/double/triple parity, analogous to RAID 5/6 but without the classic write hole, because parity and data commit atomically).
Snapshots and clones that are instant and cheap (copy-on-write), plus send/recv for efficient replication to another box.
Compression (lz4 is basically free and usually a net win) and strong dataset-level controls.

The costs are real: ZFS wants RAM (its ARC cache is hungry; rule-of-thumb a GB per TB, more if you enable dedup — which you usually shouldn't). It lives outside the mainline kernel for licensing reasons (its CDDL license is incompatible with the GPL), so you install it as a separate module and carry a small risk around kernel-upgrade timing. And its vdev model is rigid: historically you couldn't add a single disk to a RAIDZ group (RAIDZ expansion landed in recent OpenZFS, easing but not erasing the rigidity). You plan a ZFS pool up front more than you grow it organically.

btrfs

btrfs brings the same copy-on-write, checksumming, snapshot, and subvolume ideas, with one big advantage: it's in the mainline kernel, no separate module. For a single disk or a mirror it's solid and widely deployed (it's the default on some major distros), and its snapshots are excellent.

The long-standing asterisk: btrfs RAID 5/6 is still considered unsafe — the parity write hole was never fully resolved, and the project's own documentation warns against it for important data. So in practice you use btrfs in single, RAID 1, or RAID 10 profiles, and if you need parity-style efficiency you reach for ZFS RAIDZ or mdadm RAID 6 instead.

(bcachefs is the newest entrant with similar ambitions; it was merged into the mainline kernel, then removed again in late 2025 after a governance falling-out, and now ships as an out-of-tree DKMS module. Promising on paper, but not a homelab default — watch it, don't bet on it.)

When mdadm + ext4 is still the right call

Integrated storage is not automatically the upgrade. The old stack keeps winning in a few situations:

Simplicity and maturity. ext4 on md is the most boring, best-understood, most-debugged storage on Linux. When it breaks, every tool and every forum answer applies. No module to match to your kernel, no pool semantics to learn.
Flexible reshaping. md will happily grow an array, change levels, and reshape in place in ways ZFS's vdev model resists.
Modest RAM. No ARC to feed; the box can spend its memory on workloads.
The data is replaceable anyway. This is the real one. If the array holds media and caches you could re-acquire, the bit-rot ZFS prevents is an annoyance, not a catastrophe — and you've drawn the line so that anything irreplaceable lives elsewhere, small and properly backed up. That's exactly the tradeoff behind the mismatched four-disk RAID 10 I actually run: mdadm + ext4, because losing it would be irritating, not ruinous.

Picking

Irreplaceable data, integrity matters, can plan capacity up front, have the RAM: ZFS. The checksums and scrubs are the point.
Single disk or a mirror, want snapshots, want it kernel-native: btrfs.
Need parity efficiency (RAID 6-style) with proven tooling: ZFS RAIDZ2, or mdadm RAID 6 under ext4/xfs.
Bulk replaceable data, value simplicity and flexibility, back up the precious bits separately: mdadm + ext4, unfashionably and correctly.

The decision isn't "which is best" — it's "what does this data cost me if it rots, and am I willing to pay ZFS's price to prevent it." Answer that per array, not per dogma.