What is Apache Spark?
Spark is a distributed compute engine that runs one logical query across a cluster of machines. It builds a lazy plan, splits it into parallel tasks, and shuffles data between stages. How the driver, executors, and DAG actually fit together.
Apache Spark is a distributed compute engine: you write what looks like a single query or program, and Spark runs it across a cluster of machines in parallel. It is the workhorse that turns "compute over a petabyte of Parquet1" into a job that actually finishes. The thing to understand is how it splits one logical operation into thousands of parallel tasks — because everything about Spark's performance follows from that.
🔗 Learn more — 1 How Parquet works: columnar storage explained
Lazy plans, not eager steps
When you write Spark code — df.filter(...).groupBy(...).count() — nothing runs. Spark records each transformation into a logical plan and does nothing until you call an action (write the result, collect it, count it). Only then does it optimize the whole plan at once and execute.
This laziness is the point. Because Spark sees the entire pipeline before running anything, its optimizer (Catalyst) can reorder filters, prune columns, and push predicates down into the Parquet scan — reading only the columns and row groups a query needs. An eager, step-by-step engine cannot do this; it has already read the data before it knows what you wanted.
Driver, executors, and the DAG
A Spark application is one driver coordinating many executors.
flowchart TD
CODE["Your code → logical plan"] --> DRV["Driver: optimizes plan, builds DAG of stages"]
DRV --> S1["Stage 1: tasks (one per partition)"]
DRV --> S2["Stage 2: after a shuffle"]
S1 --> E1["Executor 1 (cores = parallel tasks)"]
S1 --> E2["Executor 2"]
S2 --> E1
S2 --> E2
E1 --> SH["Shuffle — repartition data across the network"]
E2 --> SH
classDef plain stroke:#7b88a1,stroke-width:2.5px
classDef key stroke:#a3be8c,stroke-width:2.5px
class DRV,SH key
class CODE,S1,S2,E1,E2 plain
- The driver holds the plan, breaks it into a DAG (directed acyclic graph2) of stages, schedules tasks, and tracks progress. One per application.
- Executors are the worker processes on the cluster. Each has a slice of memory and several cores; each core runs one task at a time.
- Tasks and partitions. The data is split into partitions, and Spark runs one task per partition. 200 partitions across 50 cores means 50 run at once, then the next 50. Partition count is your parallelism — too few and machines sit idle, too many and scheduling overhead dominates.
- Stages and the shuffle. A stage is a run of operations that can happen without moving data between machines (
filter,map). When an operation needs data grouped differently —groupBy,join— Spark must shuffle: every executor sends rows across the network to the executor responsible for each key. The shuffle is the boundary between stages and almost always the most expensive thing in a Spark job.
🔗 Learn more — 2 What is a DAG (and why orchestrators use them)?
Why the shuffle dominates
A filter is embarrassingly parallel — each task works on its own partition and never talks to another. A groupBy("country") is not: rows for EE might live on all 50 executors and must end up together. That means writing intermediate data to disk and sending it over the network — orders of magnitude slower than in-memory work. Most Spark tuning is really shuffle management: reducing how much data moves, avoiding skew where one key (one giant country) lands all on a single overloaded task.
Where it fits
- Batch ETL3. Read raw Parquet or Iceberg4 tables, transform, write back. The dominant heavy-transform engine in the lakehouse.
- Streaming. Structured Streaming reads Kafka5 topics and writes them incrementally into tables, using the same DataFrame API as batch.
- SQL and ML. Spark SQL runs queries over those tables; MLlib trains models on the same distributed data.
🔗 Learn more — 3 What is ETL (and how is ELT different)?
🔗 Learn more — 4 How Apache Iceberg actually works
🔗 Learn more — 5 What is Apache Kafka?
Do you actually need Spark?
Less often than people reach for it. Spark was built for genuinely huge data — multi-terabyte jobs that do not fit on one machine. Most companies do not have that. Their "big data" is tens of gigabytes, and on data that size a single fat machine running a native, in-process engine beats a Spark cluster on wall-clock time and on cost: it skips the JVM startup, the cluster scheduling, and most of the shuffle.
The performant, open-source alternatives are mature, and none of them are JVM:
- DuckDB — a C++ in-process OLAP6 engine that queries Parquet and Iceberg directly. No cluster, no JVM, runs inside your shell or a Python process.
- Polars — a Rust DataFrame library with a lazy, query-optimized API, multithreaded on a single box.
- ClickHouse7 — a C++ columnar database that scans billions of rows per second on one node.
🔗 Learn more — 6 OLTP vs OLAP: two opposite jobs
🔗 Learn more — 7 What is ClickHouse?
Each is native code, starts instantly, and carries no per-query cluster tax. Jordan Tigani's essay "Big Data Is Dead" makes the empirical case: the median analytical query is small, and the industry over-provisions distributed compute for data that fits in RAM. Spark earns its keep at real scale, or when you already run a cluster and the data truly does not fit — not as the reflex first choice. Reaching for a JVM cluster to crunch 20 GB is paying cluster overhead, and usually cloud markup, for nothing.
What Spark is not
It is not a database or a storage system — it is compute that runs over storage like a lakehouse. It is not a scheduler: Spark runs one job, but deciding when to run it and what depends on what is the job of an orchestrator like Apache Airflow8. And it is not free for small data — the cluster, JVM startup, and shuffle machinery are pure overhead under a few gigabytes, where a single-node engine like DuckDB wins easily.
🔗 Learn more — 8 What is Apache Airflow?
The short version: Spark takes one logical query, optimizes the whole thing lazily, and runs it as a DAG of stages split into per-partition tasks across executors — with the network shuffle between stages as the cost you spend the most effort minimizing.