What is StarRocks?

StarRocks is an open-source MPP analytical database — written in C++, vectorized, with a cost-based optimizer — built for fast OLAP¹ queries. Its distinguishing trait is that it runs fast multi-table joins both on its own native storage and directly on open table formats like Apache Iceberg², Hudi³, Delta, and Hive. That puts it in an interesting spot: warehouse-grade query speed, but pointed at the open lake instead of locking your data into a proprietary store.

🔗 Learn more — ¹ OLTP vs OLAP: two opposite jobs

🔗 Learn more — ² How Apache Iceberg actually works

🔗 Learn more — ³ What is Apache Hudi?

Vectorized execution and a cost-based optimizer

StarRocks processes data in vectorized batches rather than row by row. Instead of evaluating one value at a time, the engine works on columns in chunks, which keeps the CPU busy and uses SIMD instructions and cache far more effectively. This is the same general approach that makes columnar OLAP engines fast, and it is implemented natively in C++ rather than on a JVM.

On top of that sits a cost-based optimizer (CBO). For a single-table scan-and-aggregate query, a good optimizer barely matters — there is one sensible plan. Where it earns its keep is multi-table joins: the optimizer uses table statistics to choose join order, join algorithm, and how to distribute data across nodes. Pick the wrong join order across five tables and a query can run orders of magnitude slower. The CBO is the reason StarRocks holds up on complex joins where simpler engines fall over.

Strong join performance without denormalization

This is the practical headline. Many fast OLAP engines win benchmarks on flat, single-table queries but degrade badly when you join several large tables. The common workaround is denormalization⁴: flatten everything into one wide table ahead of time so the query engine⁵ never has to join at runtime. That works, but it pushes cost and complexity upstream — you maintain heavy pre-join pipelines, and any schema change ripples through them.

🔗 Learn more — ⁴ Normalization vs denormalization

🔗 Learn more — ⁵ What is a query engine (Trino, Presto, and friends)?

StarRocks is notable for keeping good performance on normalized star and snowflake schemas, so you can join fact and dimension tables at query time and still get fast answers. In practice that means less denormalization to build and maintain, and you can keep your model closer to how the data is actually shaped.

Querying the lakehouse in place

StarRocks does not require you to load data into it. In its lakehouse query mode it acts as a query engine over data that already lives in a data lakehouse⁶, reading open table formats — Iceberg, Delta, Hudi, Hive — directly from object storage. You point a catalog at the tables and query them with the same SQL and the same vectorized engine, no copy step.

🔗 Learn more — ⁶ What is a data lakehouse?

%% color = green: the open tables StarRocks reads in place, no ingest copy
flowchart TD
    SR["StarRocks (MPP, vectorized, CBO)"] --> NATIVE["native storage: ingested tables"]
    SR --> LAKE["lakehouse catalog"]
    LAKE --> ICE["Iceberg / Delta / Hudi / Hive on object storage"]

    classDef grey stroke:#7b88a1,stroke-width:2.5px
    classDef green stroke:#a3be8c,stroke-width:2.5px
    class ICE green
    class SR,NATIVE,LAKE grey

This is the bridge it tries to build. Historically you chose between warehouse speed on a closed, proprietary store, or open formats you could query with whatever engine you liked but more slowly. StarRocks aims to give you warehouse-grade join performance while the data stays in open tables you own.

Materialized views for acceleration

When repeated queries are predictable and expensive, StarRocks offers a materialized view⁷: a precomputed result that the optimizer can transparently substitute for matching queries. You write your normal query against the base tables; if a suitable materialized view exists, StarRocks rewrites the plan to read the precomputed result instead. It refreshes as the underlying data changes. This gives you a middle path between fully on-the-fly joins and full denormalization — accelerate the hot queries, leave the rest dynamic.

🔗 Learn more — ⁷ What is a materialized view?

The honest caveats

StarRocks is the younger option. ClickHouse⁸ and Trino have larger communities, more battle-tested operational tooling, broader connector ecosystems, and more people who have run them at scale and written down what breaks. StarRocks has fewer of those miles. If you want maximum single-table scan throughput, a dedicated engine like ClickHouse may still edge it; if you want the broadest federated reach across many sources, Trino covers more ground.

🔗 Learn more — ⁸ What is ClickHouse?

The short version: StarRocks is a performance-first, C++ MPP analytical database that is genuinely good at multi-table joins — both on its own storage and querying open lakehouse tables in place — with a cost-based optimizer and materialized views doing the heavy lifting. Its weakness is maturity, not design: a smaller ecosystem than the incumbents, in exchange for closing the gap between warehouse speed and open lake storage.