What is Databricks?
Databricks is a cloud data and AI platform from the creators of Apache Spark, built around the lakehouse: Spark compute over open table formats.
Databricks is a cloud data and AI platform built by the original creators of Apache Spark1. Its whole pitch is the lakehouse2: instead of separating a cheap data lake3 from an expensive warehouse, you run Spark compute directly over open table formats sitting in object storage, and govern it all from one catalog. The platform bundles managed Spark, notebooks, a SQL engine, an ML toolkit, and governance into a single product available on AWS, Azure, and GCP.
🔗 Learn more — 1 What is Apache Spark?
🔗 Learn more — 2 What is a data lakehouse?
🔗 Learn more — 3 What is a data lake?
That coherence is the appeal. It is also, from a lakehouse-skeptic seat, the thing to watch: Databricks is selling one platform for ingestion, transformation, BI, and machine learning, and that story deserves scrutiny before you commit a whole data org to it.
Managed Spark, notebooks, and Delta Lake
At the core are managed Spark clusters. You request compute, Databricks provisions JVM-based Spark workers, autoscales them, and tears them down when idle. You drive the cluster from collaborative notebooks in Python, SQL, Scala, or R. This removes the genuine pain of running Spark yourself, and that pain was real.
The storage layer is Delta Lake4, an open table format Databricks created and open-sourced. Delta adds a transaction log on top of Parquet5 files, which buys you ACID6 writes, schema enforcement, and time travel over plain files in a data lake. Delta is one of Databricks' real gifts to the ecosystem; you can use it outside Databricks entirely.
🔗 Learn more — 4 What is Delta Lake (and how does it compare to Iceberg)?
🔗 Learn more — 5 How Parquet works: columnar storage explained
🔗 Learn more — 6 What is ACID (database transactions)?
The catch is the engine underneath. Spark is powerful but JVM-heavy, and you pay for cluster minutes on top of cloud compute. For a lot of mid-sized work the Spark tax is hard to justify when a single-node engine would finish faster and cheaper.
Unity Catalog and SQL warehouses
Unity Catalog is the governance layer: one place to define tables, manage permissions, track lineage, and audit access across every workspace. Centralized governance is genuinely useful once you have many teams and datasets, and it is the connective tissue that makes the one-platform story hang together.
For BI-style querying, Databricks offers SQL warehouses powered by Photon, a vectorized query engine7 written in C++ rather than the JVM. Photon exists precisely because Spark's JVM execution is too slow for interactive SQL — a quiet admission that the lakehouse needs a non-Spark engine to compete with purpose-built warehouses like Snowflake on price and latency.
🔗 Learn more — 7 What is a query engine (Trino, Presto, and friends)?
MLflow and the AI side
On the machine-learning side, Databricks built and open-sourced MLflow, a framework for tracking experiments, packaging models, and managing a model registry. Like Delta and Spark, MLflow stands on its own and is widely used outside Databricks. The platform leans hard into the "data and AI" framing, integrating MLflow and, more recently, LLM tooling so that training data, features, and models live next to each other.
The honest read
%% color = green: the open-source pieces you can run without the platform
flowchart TD
OBJ["Object storage (S3/ADLS/GCS)"] --> DELTA["Delta Lake / Iceberg tables"]
DELTA --> SPARK["Managed Spark clusters + notebooks"]
DELTA --> PHOTON["SQL warehouses (Photon, C++)"]
SPARK --> ML["MLflow: experiments + model registry"]
UC["Unity Catalog: governance + lineage"] --> DELTA
UC --> SPARK
UC --> PHOTON
classDef grey stroke:#7b88a1,stroke-width:2.5px
classDef green stroke:#a3be8c,stroke-width:2.5px
class DELTA,ML green
class OBJ,SPARK,PHOTON,UC grey
Databricks is genuinely coherent and well-engineered, and its open-source contributions — Apache Spark, Delta Lake, MLflow — shaped the whole field. Its 2024 acquisition of Tabular, the company behind Apache Iceberg8, was a notable move: rather than fight the rival open format, Databricks bought its way toward Iceberg interoperability, easing the format lock-in that lakehouses are otherwise prone to.
🔗 Learn more — 8 How Apache Iceberg actually works
The skepticism is not about quality. It is about cost and posture. You are buying into Spark and the JVM as the default compute, paying cluster premiums on top of cloud bills, and adopting a vendor that wants to be your single platform for everything. For many workloads a leaner stack of focused tools is faster and cheaper. The reasonable stance: respect what Databricks built, use its open formats freely, and reach for the full platform only when the scale and team size actually justify it.