What is a database?

A database is software that stores data and answers questions about it — safely, concurrently, and fast. That sounds obvious until you ask why you cannot just keep your data in files. The answer is everything a database does around the data: it enforces structure, indexes for speed, lets many clients read and write at once without corrupting anything, and guarantees that a half-finished change never becomes visible. Strip those away and you have a folder; keep them and you have the foundation almost every other data system is built on top of.

What a database gives you over files

Put your data in a CSV¹ and you immediately lack five things a database hands you for free:

🔗 Learn more — ¹ CSV, TSV, and tabular data formats explained

Schema. Columns have types and constraints. The database rejects a row with a missing required field or a string where a number belongs, before it becomes a bug downstream.
A query language. SQL lets you ask "every order over €100 from Estonian customers last month" declaratively — you describe the result, the database figures out how to get it.
Indexes. A query that would scan ten million rows hits a B-tree index and finds its answer in a handful of disk reads. Indexes are the single biggest reason databases are fast.
Concurrency. Thousands of clients read and write simultaneously, and the database keeps them from stepping on each other with locks and isolation.
Transactions (ACID²). A group of changes either all happen or none do, and committed data survives a crash. Move money between two accounts and you are never left with it debited from one and not credited to the other.

🔗 Learn more — ² What is ACID (database transactions)?

Inside: how a query becomes an answer

A relational database is a few cooperating layers. A client sends SQL; the database does the rest.

flowchart TD
    CLIENT["Client sends SQL"] --> PARSE["Parser + planner — pick the cheapest execution plan"]
    PARSE --> EXEC["Executor"]
    EXEC --> IDX["Index (B-tree) — find rows without scanning"]
    EXEC --> ENG["Storage engine — buffer pool + pages on disk"]
    IDX --> ENG
    ENG --> WAL["Write-ahead log — durability + crash recovery"]

    classDef plain stroke:#7b88a1,stroke-width:2.5px
    classDef key stroke:#a3be8c,stroke-width:2.5px
    class PARSE,IDX key
    class CLIENT,EXEC,ENG,WAL plain

The two highlighted pieces are why it is fast and smart: the planner rewrites your query into the cheapest plan it can find (use this index, join in this order), and indexes let the executor jump straight to matching rows instead of reading the whole table. The write-ahead log is the durability trick — changes are written to a sequential log before the data pages, so a crash mid-write can always be replayed or rolled back.

Relational and the rest

The default kind is relational (Postgres, MySQL, SQLite): data in tables of typed rows, related by keys, queried with SQL. It is the right answer far more often than the internet suggests. "NoSQL³" is a grab-bag of databases that drop some relational guarantees for a specific gain:

🔗 Learn more — ³ SQL vs NoSQL

Key-value (Redis) — a giant hash map, blisteringly fast, no queries beyond "get key."
Document (MongoDB) — JSON-ish documents, flexible schema.
Wide-column (Cassandra) — tuned for huge write volume across many machines.
Graph (Neo4j) — relationships are first-class, for data that is mostly connections.

Reach for these when you have the specific problem they solve. For most applications, a relational database is the boring correct choice.

The fork that splits the field

One database cannot be great at both serving an app and crunching analytics — the two want opposite designs. That split, OLTP vs OLAP, is the single most important distinction in data, and it is why a data warehouse exists as a separate system from your app's database.

Worth saying plainly: databases are fast, and the best ones — Postgres, SQLite, ClickHouse⁴, DuckDB — are free and open source. A well-indexed Postgres or a single ClickHouse node handles workloads people assume require a sprawling, expensive cloud stack. A lot of "modern data" tooling is layers of paid convenience over a core that was already fast. Learn the database first; reach for the heavier machinery only when you can show the database actually ran out.

🔗 Learn more — ⁴ What is ClickHouse?

The short version: a database is schema + a query language + indexes + concurrency + transactions wrapped around your data. Everything from a warehouse to a lakehouse⁵ is, at bottom, an attempt to give you database guarantees at a scale or shape one database cannot reach alone.

🔗 Learn more — ⁵ What is a data lakehouse?