What is the medallion architecture (bronze, silver, gold)?
A layered lakehouse pattern: bronze holds raw ingested data, silver is cleaned and conformed, gold is business-ready aggregates. Refinement flows upward.
The medallion architecture is a convention for organizing a data lakehouse1 into three layers — bronze, silver, and gold — where data gets progressively cleaner and more useful as it moves up. It is not a technology or a specification; it is a naming scheme, popularized by Databricks2, for a discipline most mature pipelines already followed: keep the raw input, refine in stages, and serve from the top.
🔗 Learn more — 1 What is a data lakehouse?
🔗 Learn more — 2 What is Databricks?
The three layers
flowchart TD
SRC["Sources: apps, APIs, CDC, files"] --> BRONZE["Bronze: raw, as-ingested, append-only"]
BRONZE --> SILVER["Silver: cleaned, deduplicated, conformed, typed"]
SILVER --> GOLD["Gold: business aggregates and marts"]
GOLD --> BI["BI, reports, ML features"]
%% color = refinement level: amber raw, grey cleaned, green serve-ready
classDef raw stroke:#ebcb8b,stroke-width:2.5px
classDef mid stroke:#7b88a1,stroke-width:2.5px
classDef ready stroke:#a3be8c,stroke-width:2.5px
class BRONZE raw
class SILVER mid
class GOLD,BI ready
class SRC mid
- Bronze is the landing zone: data exactly as it arrived, appended without edits. It is ugly on purpose. Its job is to be a faithful, replayable record of what the source actually sent, so you can reprocess everything downstream without re-ingesting.
- Silver is where the cleaning happens: deduplication, type casting, schema enforcement, joining reference data, applying change-data-capture to get current state. Silver is the trustworthy, queryable model of the business entities.
- Gold is the serving layer: aggregations, denormalized marts, metrics, and feature tables shaped for specific consumers — dashboards, reports, ML.
Why the separation earns its keep
The value is not the medals; it is the reproducibility and clear contracts between stages. Because bronze keeps the raw input untouched, any bug in cleaning or aggregation is fixable by replaying from bronze — you never have to beg the source for last quarter's data again. Each layer is a stable interface: analysts build on silver, executives consume gold, and a change in raw format is absorbed at bronze without rippling everywhere at once.
It also separates concerns that otherwise tangle. Ingestion reliability lives at bronze, data-quality logic at silver, and business semantics at gold. Each can be owned, tested, and reasoned about on its own.
The honest caveat
Three layers is a default, not a law. A small pipeline moving one clean source into one table does not need three physical copies of the data and the storage and orchestration that implies — that is ceremony, not architecture. The pattern also tempts layer sprawl: bronze-bronze, pre-silver, "platinum," a maze of intermediate tables nobody can map.
Take the discipline, not the dogma. The real lesson is older than the branding: keep your raw data immutable and replayable, refine in explicit stages with clear contracts, and serve from a layer shaped for the consumer. Whether you call them bronze, silver, and gold — or raw, clean, and serving — is cosmetic. The staging is the point.