Star schema and dimensional modeling

A star schema is the standard way analytical data is laid out inside a data warehouse¹: one central fact table of measurable events, surrounded by dimension tables that describe the context of those events. Drawn out, the fact sits in the middle with dimensions radiating off it — hence "star." It is the physical output of dimensional modeling, a design discipline for shaping data so that humans can query it and the engine can aggregate it fast.

🔗 Learn more — ¹ What is a data warehouse?

Facts and dimensions

A fact table holds the things you measure. Each row is one event — a sale, a click, a shipment — and the columns are foreign keys to dimensions plus numeric measures (amount, quantity, duration). It is long and narrow, and it grows forever: one row per event, billions of rows over time.

A dimension table holds the descriptive context — the who, what, when, and where. A product dimension, a customer dimension, a date dimension. These are short and wide: relatively few rows, but many text and attribute columns you filter and group by ("revenue by region by month").

The fact answers "how much"; the dimensions answer "by what." A query joins the fact to a handful of dimensions, filters on dimension attributes, and sums a measure. That is the entire shape of most BI.

Grain, the decision that comes first

Before anything else you fix the grain of the fact table — exactly what one row represents. One row per order line? Per order? Per daily-aggregated total? The grain is a contract: every measure must be true at that level, and every dimension must apply to it. Getting the grain wrong is the expensive mistake in dimensional modeling, because it is baked into every row you load afterward. Pick the finest grain the source data supports; you can always aggregate up, never down.

Why denormalize on purpose

An application database normalizes aggressively — split data into many small tables so nothing is duplicated, which is correct for OLTP² writes. A star schema does the opposite: it denormalizes, repeating descriptive data so the warehouse touches fewer tables per query. Fewer joins, simpler query plans, faster aggregation. Storage is cheap; analyst time and scan latency are not. This is a deliberate trade against the normalization³ rules, made because the workload is OLAP read-heavy scans, not transactional writes.

🔗 Learn more — ² OLTP vs OLAP: two opposite jobs

🔗 Learn more — ³ Normalization vs denormalization

flowchart TD
    DIM_DATE["Dim: Date"] --> FACT["Fact: Sales — keys + measures"]
    DIM_PROD["Dim: Product"] --> FACT
    DIM_CUST["Dim: Customer"] --> FACT
    DIM_STORE["Dim: Store"] --> FACT

    %% color = role: green = fact table, grey = dimension tables
    classDef fact stroke:#a3be8c,stroke-width:2.5px
    classDef dim stroke:#7b88a1,stroke-width:2.5px
    class FACT fact
    class DIM_DATE,DIM_PROD,DIM_CUST,DIM_STORE dim

The edges point from each dimension into the fact: the fact carries the foreign keys, so the relationship runs dimension-to-fact.

Snowflake, the normalized variant

A snowflake schema is what you get when you normalize the dimensions instead of leaving them flat — the product dimension splits into product, then category, then department, each its own table. The diagram sprouts branches off the points of the star, hence the name. It saves a little storage and removes some redundancy, at the cost of more joins per query and a model that is harder to read. In practice most teams keep dimensions flat (true star) unless a dimension is genuinely huge or volatile; the snowflake's normalization rarely pays for the joins it adds.

Kimball lineage and the honest limits

Dimensional modeling is the body of work associated with Ralph Kimball, whose books defined facts, dimensions, grain, and patterns like slowly changing dimensions (how you track a customer's address changing over time without losing history). The vocabulary here — fact, dimension, grain, conformed dimension — is attribute, not marketing: it is the shared language data engineers actually use.

The honest scope: star schemas are built for BI and OLAP, where queries scan and aggregate. They are a poor fit for OLTP, where you want normalized tables and single-row writes — modeling your application database as a star would be a mistake. And modern columnar warehouses have softened the pattern: column pruning and cheap joins mean a few extra joins or a "one big table" wide layout can perform fine, so the rigid star is less mandatory than it was in the row-store era. But it has not died. The star still gives you the clearest mental model of a business process, and that legibility — not just query speed — is why dimensional modeling endures.

The short version: a star schema is one fact table of events surrounded by dimension tables of context, denormalized on purpose for fast analytics — pick the grain first, flatten the dimensions, and reach for it when the job is BI, not transactions.

Facts and dimensions

Grain, the decision that comes first

Why denormalize on purpose

Snowflake, the normalized variant

Kimball lineage and the honest limits

Sources