← Learn/Data & lakehouse

Pipelines & ingestion

6 articles in this category.

18 Jun 2026· min read

Batch vs stream processing

Batch processes finite datasets on a schedule for throughput; stream processing handles unbounded events continuously for low latency. When to use each.

18 Jun 2026· min read

What is a DAG (and why orchestrators use them)?

A DAG models a pipeline as tasks (nodes) and dependencies (edges) with no cycles — so a valid run order always exists. Why orchestrators rely on it.

18 Jun 2026· min read

What is a data pipeline?

A data pipeline moves data from sources through ingest, transform, store, and serve — reliably, on a schedule or as a stream. The stages, batch vs streaming, and where pipelines rot.

18 Jun 2026· min read

What is Change Data Capture (CDC)?

CDC streams inserts, updates, and deletes out of a database as they happen — log-based, query-based, or trigger-based — so downstream systems stay in sync.

18 Jun 2026· min read

What is ETL (and how is ELT different)?

ETL extracts data, transforms it, then loads it. ELT loads raw first and transforms inside the warehouse. Why cheap cloud compute flipped the order, and where each still fits.

18 Jun 2026· min read

What is idempotency (in data pipelines)?

An idempotent step gives the same result whether it runs once or ten times — the property that lets a crashed, re-run pipeline stay correct instead of double-counting.