What is Change Data Capture (CDC)?

Change Data Capture (CDC) is the practice of detecting every row-level change in a source database — inserts, updates, and deletes — and emitting them as a stream of change events that downstream systems can apply. Instead of asking "what does the whole table look like now?", CDC answers "what changed since I last looked?" and ships only that delta. It is the mechanism that keeps a warehouse, a search index, or a Kafka¹ topic in lockstep with an operational database without anyone copying the entire table over and over.

🔗 Learn more — ¹ What is Apache Kafka?

What CDC solves

The naive way to sync two systems is a full reload: every hour, truncate the destination and copy the source table in its entirety. For a 10-row config table that is fine. For a 500-million-row orders table it is absurd — you move hundreds of gigabytes to capture a few thousand actual changes, you hammer the source during the copy, and your freshness is bounded by how long a full dump takes.

Incremental sync is the alternative, and CDC is how you do incremental sync correctly. You capture only the rows that changed and apply just those downstream. The hard part is reliably knowing which rows changed — including the ones that were deleted, which a full reload handles for free but an incremental approach has to work for explicitly. That is the entire problem CDC methods are competing to solve.

The three methods

There are three established ways to capture changes, and the choice between them dominates everything else about a CDC setup.

flowchart TD
    DB["Source OLTP database — handles live inserts/updates/deletes"] --> LOG["Log-based — read the write-ahead/transaction log"]
    DB --> QUERY["Query-based — poll WHERE updated_at > last_seen"]
    DB --> TRIG["Trigger-based — DB triggers write to an audit table"]
    LOG --> SINK["Change stream — to Kafka / warehouse / lake"]
    QUERY --> SINK
    TRIG --> SINK

    %% color = recommended (green) vs compromised (amber)
    classDef good stroke:#a3be8c,stroke-width:2.5px
    classDef meh stroke:#ebcb8b,stroke-width:2.5px
    classDef plain stroke:#7b88a1,stroke-width:2.5px
    class LOG good
    class QUERY,TRIG meh
    class DB,SINK plain

Log-based. The database already writes every committed change to a durable log for its own crash recovery — Postgres calls it the write-ahead log, MySQL calls it the binlog, Oracle has redo logs. Log-based CDC reads that log and decodes it into change events. It sees every change, including deletes, in commit order, and it imposes almost no load on the source because reading the log is not running queries against tables. The cost is operational: you depend on a database-specific log format, you must keep the log retained until you have consumed it, and you need privileged access (a replication slot in Postgres, for example).
Query-based. You add an updated_at timestamp or a monotonic version column and periodically run SELECT ... WHERE updated_at > :last_seen. Simple, portable, needs no special privileges. But it runs real queries against the live table, it can only poll so often (so it is not truly real-time), and — the fatal flaw — a plain timestamp poll cannot see hard deletes, because a deleted row is simply gone and matches no query.
Trigger-based. You attach database triggers to each table so that every insert/update/delete also writes a record into an audit/shadow table, which you then drain. This captures deletes and runs inside the source transaction, so it is consistent — but every write now does extra work synchronously, adding latency to the application's own transactions, and triggers are a maintenance burden as schemas evolve.

Why log-based won for streaming

For real-time pipelines, log-based CDC is the default, and the reasons stack up. It is low-impact on the source because it taps a log the database writes anyway. It is complete: deletes and updates appear just like inserts. It preserves commit order, which matters when you are reconstructing state downstream. And it is low-latency — changes surface within milliseconds of commit rather than on a poll interval. This is exactly the niche Debezium occupies: it reads Postgres, MySQL, MongoDB, and other logs and publishes the changes, most commonly into Apache Kafka, where a whole ecosystem of consumers can react.

The canonical use is replicating an OLTP² database into an analytical store: stream changes from the operational Postgres behind your app into a data warehouse³ or a data lake⁴ so analysts query fresh data without ever touching production. The same stream feeds search indexes, caches, and microservices — one change log, many subscribers.

🔗 Learn more — ² OLTP vs OLAP: two opposite jobs

🔗 Learn more — ³ What is a data warehouse?

🔗 Learn more — ⁴ What is a data lake?

The honest caveats

CDC is not free. Schema changes are the perennial pain: when someone adds or drops a column on the source, the change stream's shape shifts mid-flight, and naive consumers break — handling schema evolution gracefully is a real engineering task, not a checkbox. Deletes demand a decision: do you hard-delete downstream to match, or keep a soft-delete tombstone for audit and history? The two give very different downstream tables. And ordering and exactly-once are subtle: across partitions, or after a connector restart and replay, you can get events out of order or duplicated, so consumers generally have to be idempotent⁵ — apply by primary key, keyed on the latest version, so re-processing the same change twice is harmless. Get those three right and CDC turns a brittle nightly batch into a continuous, low-load mirror of your database. Get them wrong and you have a fast pipeline that quietly drifts out of sync.

🔗 Learn more — ⁵ What is idempotency (in data pipelines)?