What is Amazon Redshift?

Amazon Redshift is AWS's columnar, massively parallel processing (MPP) data warehouse¹. It launched in 2012 as a provisioned-cluster service — a leader node coordinating one or more compute nodes — built on code forked from PostgreSQL and the ParAccel analytic engine. That heritage is why it speaks a familiar SQL dialect but behaves nothing like a transactional database underneath. It was the first managed cloud warehouse most teams ever touched, and a lot of its design reflects an era before storage and compute were routinely pulled apart.

🔗 Learn more — ¹ What is a data warehouse?

MPP, columnar, and the keys you tune

Two ideas do the heavy lifting. First, columnar storage: Redshift stores each column separately, so an aggregation that reads three columns out of fifty only scans those three, and similar values compress well. This is standard OLAP² warehouse behavior. Second, MPP: a query is split across the compute nodes, each node working on its slice of the data in parallel, and the leader node assembles the result.

🔗 Learn more — ² OLTP vs OLAP: two opposite jobs

The part that defines the older Redshift experience is that you decide how data lands on those nodes. Each table has a distribution key, which controls which node a row goes to. Pick it well and a join happens locally on each node; pick it badly and the engine shuffles huge amounts of data across the network to line rows up. Each table also has a sort key, which orders rows on disk so the engine can skip blocks that can't match a filter. Choosing distribution and sort keys — and watching for skew, where one node holds far more data than the others — is real, ongoing operational work. This is exactly the layer that Snowflake and BigQuery³ abstract away: on those platforms you generally don't hand-tune physical layout at all. Redshift can be faster and cheaper when tuned, and frustrating when not.

🔗 Learn more — ³ What is BigQuery?

RA3 and Redshift Spectrum

The original node types coupled storage and compute: more data meant more nodes, even if you didn't need the extra CPU. RA3 nodes broke that coupling. With RA3 and managed storage, hot data sits on local SSD while the full dataset lives in S3, and you size the cluster for compute, scaling storage independently. This closed one of the most-cited gaps against newer competitors, where separating storage from compute was the headline feature from day one.

Redshift Spectrum is the other reach outside the cluster. It lets a Redshift query read data sitting in S3 directly — Parquet⁴, ORC, and similar open formats — without first loading it into the warehouse. You join external S3 tables against tables inside the cluster in a single query. It is AWS's answer to querying a data lake⁵ in place, and it leans on Spectrum compute that's billed separately by the amount of S3 data scanned.

🔗 Learn more — ⁴ How Parquet works: columnar storage explained

🔗 Learn more — ⁵ What is a data lake?

%% color = green: managed storage layer that decoupled storage from compute
flowchart TD
    SQL["SQL client"] --> LEADER["Leader node: plans + coordinates"]
    LEADER --> C1["Compute node 1"]
    LEADER --> C2["Compute node 2"]
    C1 --> RMS["RA3 managed storage (S3-backed)"]
    C2 --> RMS
    LEADER --> SPEC["Redshift Spectrum → query S3 directly"]

    classDef grey stroke:#7b88a1,stroke-width:2.5px
    classDef green stroke:#a3be8c,stroke-width:2.5px
    class RMS green
    class SQL,LEADER,C1,C2,SPEC grey

Serverless, and the honest comparison

Redshift Serverless arrived later and removes the cluster from the user's view entirely: no nodes to provision, capacity measured in RPUs that scale with demand, and you pay for what you use. It's the clearest sign of Redshift catching up to the on-demand, no-cluster-management model that Snowflake and BigQuery shipped earlier. For new projects it's the friendlier entry point, and it sidesteps a lot of the node-sizing question — though the same distribution and sort key concepts still matter for performance under the hood.

The fair read: Redshift is the older cloud warehouse, and it shows. For years it felt dated next to Snowflake and BigQuery — node and key tuning was operational overhead those platforms simply didn't impose, and storage stayed bolted to compute long after rivals split them. RA3 and serverless closed real gaps, and a well-tuned Redshift cluster is a genuinely strong MPP engine. But it carries AWS lock-in: it's deeply wired into S3, IAM, and the rest of the AWS data stack, and that gravity is the point. If you're already all-in on AWS, Redshift is a sensible, mature choice. If you're not, the abstraction-first competitors are usually the easier place to start.

The short version: Amazon Redshift is a columnar MPP warehouse with PostgreSQL roots, where classic clusters reward physical tuning that newer platforms hide — modernized by RA3's managed storage, Spectrum's S3 querying, and a serverless option, all firmly inside the AWS ecosystem.

MPP, columnar, and the keys you tune

RA3 and Redshift Spectrum

Serverless, and the honest comparison

Sources