What is Databricks?

Databricks is a cloud data and AI platform built by the original creators of Apache Spark¹. Its whole pitch is the lakehouse²: instead of separating a cheap data lake³ from an expensive warehouse, you run Spark compute directly over open table formats sitting in object storage, and govern it all from one catalog. The platform bundles managed Spark, notebooks, a SQL engine, an ML toolkit, and governance into a single product available on AWS, Azure, and GCP.

🔗 Learn more — ¹ What is Apache Spark?

🔗 Learn more — ² What is a data lakehouse?

🔗 Learn more — ³ What is a data lake?

That coherence is the appeal. It is also, from a lakehouse-skeptic seat, the thing to watch: Databricks is selling one platform for ingestion, transformation, BI, and machine learning, and that story deserves scrutiny before you commit a whole data org to it.

Managed Spark, notebooks, and Delta Lake

At the core are managed Spark clusters. You request compute, Databricks provisions JVM-based Spark workers, autoscales them, and tears them down when idle. You drive the cluster from collaborative notebooks in Python, SQL, Scala, or R. This removes the genuine pain of running Spark yourself, and that pain was real.

The storage layer is Delta Lake⁴, an open table format Databricks created and open-sourced. Delta adds a transaction log on top of Parquet⁵ files, which buys you ACID⁶ writes, schema enforcement, and time travel over plain files in a data lake. Delta is one of Databricks' real gifts to the ecosystem; you can use it outside Databricks entirely.

🔗 Learn more — ⁴ What is Delta Lake (and how does it compare to Iceberg)?

🔗 Learn more — ⁵ How Parquet works: columnar storage explained

🔗 Learn more — ⁶ What is ACID (database transactions)?

The catch is the engine underneath. Spark is powerful but JVM-heavy, and you pay for cluster minutes on top of cloud compute. For a lot of mid-sized work the Spark tax is hard to justify when a single-node engine would finish faster and cheaper.

Unity Catalog and SQL warehouses

Unity Catalog is the governance layer: one place to define tables, manage permissions, track lineage, and audit access across every workspace. Centralized governance is genuinely useful once you have many teams and datasets, and it is the connective tissue that makes the one-platform story hang together.

For BI-style querying, Databricks offers SQL warehouses powered by Photon, a vectorized query engine⁷ written in C++ rather than the JVM. Photon exists precisely because Spark's JVM execution is too slow for interactive SQL — a quiet admission that the lakehouse needs a non-Spark engine to compete with purpose-built warehouses like Snowflake on price and latency.

🔗 Learn more — ⁷ What is a query engine (Trino, Presto, and friends)?

MLflow and the AI side

On the machine-learning side, Databricks built and open-sourced MLflow, a framework for tracking experiments, packaging models, and managing a model registry. Like Delta and Spark, MLflow stands on its own and is widely used outside Databricks. The platform leans hard into the "data and AI" framing, integrating MLflow and, more recently, LLM tooling so that training data, features, and models live next to each other.

The honest read

%% color = green: the open-source pieces you can run without the platform
flowchart TD
    OBJ["Object storage (S3/ADLS/GCS)"] --> DELTA["Delta Lake / Iceberg tables"]
    DELTA --> SPARK["Managed Spark clusters + notebooks"]
    DELTA --> PHOTON["SQL warehouses (Photon, C++)"]
    SPARK --> ML["MLflow: experiments + model registry"]
    UC["Unity Catalog: governance + lineage"] --> DELTA
    UC --> SPARK
    UC --> PHOTON

    classDef grey stroke:#7b88a1,stroke-width:2.5px
    classDef green stroke:#a3be8c,stroke-width:2.5px
    class DELTA,ML green
    class OBJ,SPARK,PHOTON,UC grey

Databricks is genuinely coherent and well-engineered, and its open-source contributions — Apache Spark, Delta Lake, MLflow — shaped the whole field. Its 2024 acquisition of Tabular, the company behind Apache Iceberg⁸, was a notable move: rather than fight the rival open format, Databricks bought its way toward Iceberg interoperability, easing the format lock-in that lakehouses are otherwise prone to.

🔗 Learn more — ⁸ How Apache Iceberg actually works

The skepticism is not about quality. It is about cost and posture. You are buying into Spark and the JVM as the default compute, paying cluster premiums on top of cloud bills, and adopting a vendor that wants to be your single platform for everything. For many workloads a leaner stack of focused tools is faster and cheaper. The reasonable stance: respect what Databricks built, use its open formats freely, and reach for the full platform only when the scale and team size actually justify it.