What is a data warehouse?

A data warehouse is a database built specifically for analytics. It is where a company's data goes to be queried at scale — every order ever placed, joined and aggregated for a dashboard — without touching the live application database. The defining fact about a warehouse is that it is engineered the opposite way from your app's database: it is an OLAP¹ system, and that one choice shapes everything about it.

🔗 Learn more — ¹ OLTP vs OLAP: two opposite jobs

An OLAP database, by design

Your app's database is OLTP: many small reads and writes of whole rows, optimized for low latency per operation. A warehouse serves the other workload — few queries, each scanning enormous numbers of rows but only a handful of columns, aggregating as it goes. To do that well it makes choices an OLTP database never would:

Columnar storage. Data is laid out by column, not by row, so a query that touches 5 of 200 columns reads only those 5 off disk. This is the same idea Parquet² exploits on the file level, applied to the whole engine.
Massively parallel processing (MPP). A single query is split across many machines (or many cores), each scanning a slice of the data, with the partial results merged at the end. Scans scale out horizontally.
Built to read, not to mutate. Warehouses optimize bulk loads and giant scans, and are comparatively slow at single-row inserts and updates — exactly backwards from OLTP, on purpose.

🔗 Learn more — ² How Parquet works: columnar storage explained

How data gets there, and how it is shaped

A warehouse does not collect data itself; data is moved in from the source systems by a pipeline, then modelled for analytics.

flowchart TD
    SRC["Sources: app databases, APIs, event streams"] --> ELT["ETL / ELT pipeline — extract, load, transform"]
    ELT --> WH["Data warehouse — columnar, MPP"]
    WH --> STAR["Star schema: fact tables + dimension tables"]
    STAR --> BI["BI tools, SQL, ML"]

    classDef plain stroke:#7b88a1,stroke-width:2.5px
    classDef key stroke:#a3be8c,stroke-width:2.5px
    class WH key
    class SRC,ELT,STAR,BI plain

Inside, analytical data is usually arranged as a star schema³: a central fact table (one row per event — a sale, a click) surrounded by dimension tables (customer, product, date) it references. Unlike an OLTP database, which normalizes aggressively to avoid duplication, a warehouse denormalizes — repeating data to avoid expensive joins at query time. Read speed beats storage efficiency here, because storage is cheap and the queries are huge.

🔗 Learn more — ³ Star schema and dimensional modeling

Modern cloud warehouses (Snowflake, BigQuery⁴, Redshift⁵) add one more idea: separating storage from compute, so data sits cheaply in object storage and you spin up compute only when a query runs — which is also the seam where the warehouse and the lakehouse start to blur.

🔗 Learn more — ⁴ What is BigQuery?

🔗 Learn more — ⁵ What is Amazon Redshift?

The honest take on cost

The warehouse is the right place to serve interactive analytics: data on fast local storage, a real buffer pool, a planner built for big scans. That is exactly where querying files straight out of a lake on every request falls down, and why I argue for keeping the lake and the warehouse separate.

But "warehouse" and "expensive cloud subscription" are not synonyms, even though the marketing treats them that way. The managed clouds bill per credit or per byte scanned, and the bill grows with use in a way that quietly becomes one of the larger line items in a data org. You do not always need them: ClickHouse⁶ (open source, C++) delivers warehouse-grade columnar performance you can self-host, and DuckDB gives you an in-process columnar engine that handles a startling amount of analytics on a single machine with no server at all. Reach for a paid cloud warehouse when you genuinely need its scale and operability — not by reflex.

🔗 Learn more — ⁶ What is ClickHouse?

The short version: a data warehouse is an OLAP database — columnar, MPP, denormalized into star schemas — that serves analytics so your app's database does not have to. It is the fast serving layer of an analytics stack; just remember the performance is the product, and the cloud invoice is optional.