What is a data pipeline?

A data pipeline is the plumbing that moves data from where it is produced to where it is useful, applying whatever cleaning, reshaping, and combining it needs along the way. Instead of someone manually exporting a spreadsheet every Monday, a pipeline does it automatically, repeatedly, and on a defined trigger — pulling from sources, transforming the data, landing it somewhere durable, and exposing it to whoever consumes it. The word "pipeline" is literal: data flows through a series of connected stages, each one taking the previous stage's output as its input.

The stages: source to serve

Almost every pipeline, however it is built, walks through the same five stages:

flowchart TD
    SRC["Source: app DB, API, event stream, files"] --> ING["Ingest: pull or receive raw data"]
    ING --> TF["Transform: clean, type, join, aggregate"]
    TF --> ST["Store: warehouse, lake, or table format"]
    ST --> SV["Serve: dashboards, APIs, ML, reports"]

    %% color = reliability hinge of the pipeline
    classDef plain stroke:#7b88a1,stroke-width:2.5px
    classDef key stroke:#a3be8c,stroke-width:2.5px
    classDef warn stroke:#ebcb8b,stroke-width:2.5px
    class SRC,ING,ST,SV plain
    class TF key

Source. The system that produces data — an application database, a third-party API, an event stream, an SFTP drop of files. You usually do not control it, which is the root of most pipeline pain.
Ingest. Getting the raw data out and into your control, either by pulling it on a schedule or receiving it as it arrives. This is where ETL¹ and ELT differ: ETL transforms before loading, ELT loads raw first and transforms in place.
Transform. The actual work — cleaning nulls, enforcing types, deduplicating, joining datasets, aggregating into the shapes consumers need. This is the stage that holds business logic, which is why the diagram marks it as the hinge: get it wrong and everything downstream is wrong but still green.
Store. Landing the result somewhere durable and queryable — typically a data warehouse² for modelled analytics, or a lake for cheap open storage.
Serve. Exposing the data: BI dashboards, an API, a feature store for ML, a scheduled report. The stage that justifies the other four.

🔗 Learn more — ¹ What is ETL (and how is ELT different)?

🔗 Learn more — ² What is a data warehouse?

Batch vs streaming

Pipelines split into two shapes by when they move data. A batch pipeline processes data in chunks on a schedule — every hour, every night — and is the default for analytics because it is simpler to reason about, cheaper to run, and easy to re-run when something breaks. Batch processing³ trades freshness for simplicity: your numbers are correct as of the last run, not as of this second.

🔗 Learn more — ³ Batch vs stream processing

A streaming pipeline processes events continuously as they arrive, usually off a log like Kafka⁴, giving you second-level freshness at the cost of real operational complexity — handling out-of-order and late events, exactly-once semantics, and state that never stops. Stream processing earns its complexity only when staleness genuinely costs money (fraud detection, live inventory); for most reporting, batch is the honest choice.

🔗 Learn more — ⁴ What is Apache Kafka?

What keeps it from breaking

A pipeline that runs once is a script. A pipeline you can trust at 3am is engineered. The properties that separate the two:

Orchestration. Something has to decide what runs, when, and in what order — running transform only after ingest succeeds, and skipping serve if transform failed. This is the job of an orchestrator, which models the pipeline as a dependency graph and handles scheduling so steps don't run on stale or missing inputs.
Idempotency⁵. Re-running a step must produce the same result, not duplicate rows or double-counted totals. Idempotency is what makes retries safe — without it, the automatic recovery that should heal a pipeline instead corrupts it.
Retries. Transient failures (a timed-out API, a flaky network) are normal. A reliable pipeline retries them automatically with backoff instead of paging a human for a problem that fixes itself.
Observability. You need to know a run happened, how long it took, how many rows moved, and whether anything looked wrong — before a consumer notices the dashboard is empty. Logging, metrics, and data-quality checks are not optional polish; they are how you find out the pipeline broke.

🔗 Learn more — ⁵ What is idempotency (in data pipelines)?

Where pipelines rot

The honest part: pipelines decay, and they rarely do it loudly. The two classic killers are silent failures and schema drift. A silent failure is a run that "succeeds" while producing wrong or partial data — a transform that swallowed an error, a join that quietly dropped half the rows, a source that returned an empty file the pipeline happily loaded as zero. Nothing turns red; the numbers are simply wrong, and you find out weeks later when someone questions a report.

Schema drift is the upstream source changing without telling you — a renamed column, a type change, a new required field. Because you don't own the source, you find out when the pipeline ingests garbage or crashes. The defense is the same for both: validate inputs at ingest, assert on outputs after transform, and alert on the absence of expected data, not just on exceptions. A pipeline that only fails loudly when code throws is a pipeline already rotting quietly in the cases that matter most.