What is Hadoop (and why MapReduce faded)?
Hadoop launched the big-data era with HDFS, MapReduce, and YARN. Foundational, but largely superseded by Spark, S3, and cloud warehouses.
Hadoop is the open-source framework that launched the big-data era: HDFS (a distributed file system) plus MapReduce (a batch compute model) plus YARN (resource scheduling), built to process huge datasets on commodity clusters by moving compute to the data instead of hauling data to the compute. In the late 2000s, when a single machine could not hold or process a multi-terabyte log, that idea was revolutionary. Today it reads as history. Understanding it is still worth your time, because almost every tool you use now is a reaction to its limits.
HDFS: store the data first
The Hadoop Distributed File System solves one problem: putting a dataset too large for any single disk across a cluster of cheap machines, and surviving when those machines die. HDFS splits each file into large blocks (128 MB by default) and spreads them across the cluster. Every block is replicated, typically three times, onto different nodes, so a failed disk or a dead server loses nothing — the data still lives on two other machines, and HDFS quietly re-replicates to restore the count.
A NameNode tracks where every block lives; DataNodes hold the actual bytes. The design assumes commodity hardware fails routinely and bakes recovery into normal operation. That was the genuinely durable idea. The weakness was operational: you ran and babysat a physical cluster, the NameNode was a scaling and availability bottleneck, and storage was welded to compute, so you could not scale one without the other.
MapReduce: move compute to the data
MapReduce is the batch processing1 model that made Hadoop famous, lifted from a 2004 Google paper.1 You express a computation as two functions. A map function runs on each block, locally, on the node that already holds the data — this is "moving compute to the data," and it avoids shipping terabytes across the network. Map emits key-value pairs. The framework then shuffles: it sorts those pairs and groups them by key, routing all values for a given key to one place. Finally a reduce function aggregates each group into a result.
🔗 Learn more — 1 Batch vs stream processing
flowchart TD
IN["Input blocks on HDFS"] --> MAP["Map: process each block locally"]
MAP --> SHUF["Shuffle: sort + group by key (writes to disk)"]
SHUF --> RED["Reduce: aggregate each key group"]
RED --> OUT["Output written back to HDFS"]
classDef plain stroke:#7b88a1,stroke-width:2.5px
classDef slow stroke:#bf616a,stroke-width:2.5px
%% color = red: the disk-heavy stage that made MapReduce slow
class IN,MAP,RED,OUT plain
class SHUF slow
The model is correct and scales, but it is disk-heavy and high-latency by construction. Every shuffle writes intermediate data to disk and reads it back. A real job is a chain of many MapReduce stages, and each one round-trips through HDFS, so an iterative workload — anything that loops, like most machine learning or interactive analytics — pays that disk tax over and over. Jobs that should take seconds took minutes.
YARN, and what replaced all of it
The first Hadoop tied scheduling to MapReduce itself. YARN (Yet Another Resource Negotiator) split that out into a general cluster resource manager: it hands out CPU and memory containers to any framework, not just MapReduce. That decoupling is what let other engines run on a Hadoop cluster at all — and it quietly set up MapReduce's replacement.
Apache Spark2 took the same map-shuffle-reduce shape but kept intermediate data in memory between stages instead of writing it to disk every time, which made iterative and interactive jobs far faster. For most workloads it simply ended MapReduce. The rest of the stack unbundled too. Object storage like Amazon S3 replaced HDFS: cheaper, infinitely scalable, and decoupled from compute, with no NameNode to babysit. Cloud data warehouses and lakehouses replaced the on-prem cluster, so you query huge datasets without operating any servers. Pieces of the ecosystem survived in new clothing — the Hive metastore3 still backs table metadata in many a data lake4, and "moving compute to the data" lives on in every storage-compute-separated query engine5.
🔗 Learn more — 2 What is Apache Spark?
🔗 Learn more — 3 What is the Hive metastore?
🔗 Learn more — 4 What is a data lake?
🔗 Learn more — 5 What is a query engine (Trino, Presto, and friends)?
Treat Hadoop as the JVM-era ancestor of today's stack, not a current default. It proved that commodity clusters could store and crunch datasets no single machine could hold, and it normalized fault-tolerant distributed storage. But if you are choosing tools in 2026, you reach for Spark, object storage, and a cloud warehouse — each of which exists because it fixed something Hadoop made you live with.
1 Dean & Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Google, 2004 — https://research.google/pubs/pub62/