CSV, TSV, and tabular data formats explained
Tabular data is rows and columns — but how you serialize it to a file matters more than people think. CSV and TSV and their traps, JSON and JSONL, and why columnar binary formats like Parquet exist at all.
In this series
How data is storedTabular data is just rows and columns — a spreadsheet, a database table, a query result. The data is simple; the interesting question is how you write it to a file so another program can read it back. The format you pick decides whether your data is human-readable, whether its types survive the round trip, how big the file is, and how fast a query can skip the parts it does not need. Most people default to CSV and discover the costs later.
Delimited text: CSV and TSV
The oldest and most universal answer is delimited text: one row per line, fields separated by a character. CSV (comma-separated values, loosely standardized in RFC 4180) uses a comma; TSV uses a tab. They are plain text, so any tool on earth can open them, and a human can read them in a text editor. That universality is their whole appeal.
id,name,country,amount
1,Anu,EE,42
2,Mehmet,TR,17
TSV exists mostly because tabs collide with real data far less often than commas do — names and addresses are full of commas, rarely full of tabs — which sidesteps much of the pain below.
The traps in delimited text
Plain text looks simple and hides a surprising amount of pain:
- No types. Everything is a string. Is
02141a number or a zip code? Is2026-06-18a date or text? Every reader has to guess, and they guess differently. This is the number-one source of "it worked in my script but broke in yours." - No schema. The file does not declare its columns or their types. The header row is a convention, not a guarantee, and there is nothing to validate against.
- Quoting and escaping. A comma inside a value (
"Tallinn, Estonia") forces quoting; a quote inside that forces escaping; a newline inside a field breaks the one-row-per-line assumption entirely. Every parser handles these edge cases slightly differently. - Encoding. Is it UTF-8? Latin-1? Does it have a byte-order mark? Spreadsheets love to mangle this, and a wrong guess turns
äinto garbage. - No compression, no skipping. To read one column you parse every byte of every row. To find the last row you read the whole file.
CSV is excellent for what it was meant for: small datasets, quick interchange, eyeballing data. It is a bad choice the moment files get large or pipelines get automated.
JSON and JSONL
When data is nested — a record with a list of items, an object inside an object — flat CSV stops fitting and people reach for JSON. It carries a bit of type information (numbers, strings, booleans, null) and arbitrary nesting, at the cost of being verbose and still entirely text.
For data pipelines the useful variant is JSONL (JSON Lines): one JSON object per line. That makes a file streamable and splittable — you can process it line by line without loading the whole thing — while keeping JSON's flexibility. It is a common landing format for logs and event streams.
Why text breaks down, and what replaces it
flowchart TD
TAB["Tabular data: rows + columns"] --> TEXT["Text formats: CSV, TSV, JSON — readable, untyped, uncompressed"]
TAB --> BIN["Columnar binary: Parquet — typed, compressed, column-skipping"]
TEXT --> SMALL["Good: small data, interchange, human eyes"]
BIN --> BIG["Good: analytics, large data, machines"]
classDef plain stroke:#7b88a1,stroke-width:2.5px
classDef key stroke:#a3be8c,stroke-width:2.5px
class BIN key
class TAB,TEXT,SMALL,BIG plain
At analytical scale, text formats fail on three axes at once: they are large (no compression), slow to parse (every byte, every time), and impossible to read selectively (no way to grab one column without reading all of them). The fix is a columnar binary format, and the standard one is Apache Parquet. It stores each column together (so a query reads only the columns it needs), embeds the schema and types (no guessing), compresses hard (similar values sit adjacent), and carries per-chunk statistics so a query can skip whole blocks. The tradeoff is that you can no longer open it in a text editor — it is for machines, not eyes.
| Format | Human-readable | Typed | Schema | Compressed | Good for |
|---|---|---|---|---|---|
| CSV / TSV | Yes | No | No | No | Small data, interchange |
| JSON | Yes | Partly | No | No | Nested records, APIs |
| JSONL | Yes | Partly | No | No | Logs, event streams |
| Parquet | No | Yes | Yes | Yes | Analytics, large data |
The short version: tabular data is trivial, but the file format is a real decision. Use CSV/TSV for small, human-facing interchange; use Parquet for anything analytical or large. The most common avoidable mistake in data work is shipping multi-gigabyte CSVs between systems that should have been exchanging Parquet — paying in size, parse time, and type bugs for a human-readability nobody is using.