What is a query engine (Trino, Presto, and friends)?
A query engine plans and executes queries over data it doesn't own — decoupling compute from storage. Trino and Presto run SQL across lakes and many sources at once.
A query engine is the part of a data system that turns a query into results: it parses the SQL, plans how to answer it, optimizes that plan, and executes it — often across many machines. The defining modern twist is that a standalone query engine like Trino (originally Presto) does this over data it does not own. Storage lives elsewhere — Parquet1 files in a data lake2, tables in a warehouse, rows in a database — and the engine brings the compute to it. That decoupling of compute from storage is the whole idea.
🔗 Learn more — 1 How Parquet works: columnar storage explained
🔗 Learn more — 2 What is a data lake?
What it actually does
Every query engine runs the same pipeline. Parse the SQL into a syntax tree. Plan a logical sequence of operations (scan, filter, join, aggregate). Optimize that plan — reorder joins, push filters down to the storage layer so less data is read, choose how to distribute the work. Execute it, usually by splitting the work across worker nodes that each process a slice in parallel and combine the results. That last step is massively parallel processing (MPP): the reason a query engine can scan a petabyte is that a hundred workers each scan a sliver at once.
Optimization is where engines earn their keep. Predicate pushdown — sending WHERE filters down so Parquet's stats and partition pruning skip most of the data — often matters more than raw execution speed, because the fastest data to process is the data you never read.
Storage is not its job
The key mental model: a query engine like Trino is compute, not storage. It has no tables of its own. You point it at connectors — a lakehouse3 catalog over Iceberg4, a relational database5, an object store of Parquet — and it queries them in place. That buys two things:
🔗 Learn more — 3 What is a data lakehouse?
🔗 Learn more — 4 How Apache Iceberg actually works
🔗 Learn more — 5 What is a database?
- Federation. One SQL query can join a table in your warehouse against files in your lake against rows in Postgres, because the engine speaks to all of them through connectors.
- Engine choice. Because the data sits in open formats behind a neutral catalog, you can run different engines against the same tables — Trino for ad-hoc, Spark6 for heavy batch — without copying data.
🔗 Learn more — 6 What is Apache Spark?
This is exactly the lakehouse premise: open files, a neutral catalog, and whatever compute fits the query.
Where it fits, and where it doesn't
A standalone query engine shines for ad-hoc and federated analytics over a lake — exploring data in place, joining across systems, avoiding a load step. It is the flexible, bring-compute-to-storage option.
What it is not is a low-latency serving layer. Reading Parquet out of object storage on every query carries real latency, and a query engine has no buffer pool or local NVMe holding hot data the way a warehouse does. For sub-second, high-concurrency interactive queries, a warehouse with data on fast local storage beats a query engine scanning blob — the same cold-path penalty that makes "query the lake directly" the wrong default for interactive work. Use a query engine for breadth and flexibility; use a warehouse when the reads have to be fast.