What is ETL (and how is ELT different)?

ETL stands for Extract, Transform, Load: the three stages of moving data from where it is produced to where it is analyzed. You pull data out of source systems (extract), reshape and clean it (transform), and write the result into a destination like a data warehouse¹ (load). The order of those words is not decoration — it says transformation happens before the data lands. ELT, the more common pattern today, simply swaps the last two letters and the whole architecture along with them.

🔗 Learn more — ¹ What is a data warehouse?

The three stages

Every analytics pipeline does these three jobs, regardless of the order:

Extract — read data out of source systems: an application's OLTP² database, a SaaS API, log files, a stream. This is mostly about reliability and not overloading the source, often via incremental reads or change data capture³ rather than full dumps.
Transform — clean, deduplicate, type-cast, join, and aggregate the raw data into a model fit for analysis. This is where business logic lives: what counts as an "active user," how revenue is recognized, which currency to normalize to.
Load — write the result into the destination, typically a columnar data warehouse or a data lake⁴.

🔗 Learn more — ² OLTP vs OLAP: two opposite jobs

🔗 Learn more — ³ What is Change Data Capture (CDC)?

🔗 Learn more — ⁴ What is a data lake?

The interesting question is when the transform step runs. That single choice is the difference between ETL and ELT.

The ETL to ELT shift, and why it happened

In classic ETL the transform runs on a dedicated machine between the source and the destination. You pulled raw data into a staging server, ran transformation jobs there, and loaded only the finished, modelled tables into the warehouse. This made sense when warehouse storage and compute were scarce and expensive: you did not want to pay to store raw junk or to compute heavy joins inside the warehouse, so you cleaned data outside it and loaded only the polished result.

Cheap cloud changed the economics. Object storage made it nearly free to keep raw data, and warehouses like BigQuery⁵, Snowflake, and Redshift⁶ separated storage from compute and made large SQL transforms fast and on-demand. Once transforming data inside the warehouse was cheaper and simpler than maintaining a separate transform tier, the order flipped:

🔗 Learn more — ⁵ What is BigQuery?

🔗 Learn more — ⁶ What is Amazon Redshift?

flowchart TD
    SRC["Source systems"] --> E["Extract"]

    E --> ETL_T["ETL: Transform on a separate engine"]
    ETL_T --> ETL_L["Load modelled data"]
    ETL_L --> WH1["Warehouse holds clean tables only"]

    E --> ELT_L["ELT: Load raw data"]
    ELT_L --> WH2["Warehouse holds raw data"]
    WH2 --> ELT_T["Transform with SQL inside the warehouse"]

    classDef plain stroke:#7b88a1,stroke-width:2.5px
    classDef key stroke:#a3be8c,stroke-width:2.5px
    classDef alt stroke:#ebcb8b,stroke-width:2.5px
    %% color = green: ELT path, amber: classic ETL path, grey: shared
    class SRC,E plain
    class ETL_T,ETL_L,WH1 alt
    class ELT_L,WH2,ELT_T key

ELT (Extract, Load, Transform) loads raw data into the warehouse first, then transforms it in place with SQL. The raw data stays around, so you can re-derive your models any time the logic changes, and tools like dbt⁷ let analysts express transformations as version-controlled SQL instead of a separate ETL application. For most cloud OLAP stacks this is now the default.

🔗 Learn more — ⁷ What is dbt?

Where each still fits

ELT is not strictly better; it is better when the destination has cheap, elastic compute and you are allowed to land raw data there. ETL still wins in real cases:

Regulatory or PII constraints — if certain raw fields must never be stored in the warehouse, you transform (mask, drop, tokenize) before loading. You cannot load-then-transform data you are not permitted to load.
Heavy non-SQL transforms — work that is awkward in SQL (complex parsing, ML feature generation, image or text processing) often runs better on a dedicated engine like Spark⁸ before loading.
Streaming pipelines — transforming events in flight, before they reach a serving store, is transform-then-load by nature.
Expensive or rigid destinations — if warehouse compute is metered steeply, doing transforms elsewhere can be cheaper.

🔗 Learn more — ⁸ What is Apache Spark?

The honest tradeoff

ELT buys flexibility and simplicity at the cost of storing more data and pushing transform compute (and its bill) into the warehouse. You keep raw data, so mistakes are recoverable and new questions are answerable without re-extracting — but you also store data you may never use, and a careless SQL model can run up a real compute invoice. ETL buys a smaller, cleaner warehouse and tight control over what lands, at the cost of an extra system to run and the loss of the raw data you discarded on the way in.

The short version: ETL transforms before the warehouse, ELT transforms inside it. Cheap, separable cloud compute is what made ELT the default — but the moment your constraint is what you are allowed to store or what SQL can't do, ETL is still the right shape.