← Learn··Updated 18 Jun 2026·3 min read

CSV, TSV, and tabular data formats explained

Tabular data is rows and columns — but how you serialize it to a file matters more than people think. CSV and TSV and their traps, JSON and JSONL, and why columnar binary formats like Parquet exist at all.

Data & lakehouse
How data is storedPart 3 of 8
#data
#csv
#file-formats
#parquet
#ai-assisted

Tabular data is just rows and columns — a spreadsheet, a database table, a query result. The data is simple; the interesting question is how you write it to a file so another program can read it back. The format you pick decides whether your data is human-readable, whether its types survive the round trip, how big the file is, and how fast a query can skip the parts it does not need. Most people default to CSV and discover the costs later.

Delimited text: CSV and TSV

The oldest and most universal answer is delimited text: one row per line, fields separated by a character. CSV (comma-separated values, loosely standardized in RFC 4180) uses a comma; TSV uses a tab. They are plain text, so any tool on earth can open them, and a human can read them in a text editor. That universality is their whole appeal.

id,name,country,amount
1,Anu,EE,42
2,Mehmet,TR,17

TSV exists mostly because tabs collide with real data far less often than commas do — names and addresses are full of commas, rarely full of tabs — which sidesteps much of the pain below.

The traps in delimited text

Plain text looks simple and hides a surprising amount of pain:

  • No types. Everything is a string. Is 02141 a number or a zip code? Is 2026-06-18 a date or text? Every reader has to guess, and they guess differently. This is the number-one source of "it worked in my script but broke in yours."
  • No schema. The file does not declare its columns or their types. The header row is a convention, not a guarantee, and there is nothing to validate against.
  • Quoting and escaping. A comma inside a value ("Tallinn, Estonia") forces quoting; a quote inside that forces escaping; a newline inside a field breaks the one-row-per-line assumption entirely. Every parser handles these edge cases slightly differently.
  • Encoding. Is it UTF-8? Latin-1? Does it have a byte-order mark? Spreadsheets love to mangle this, and a wrong guess turns ä into garbage.
  • No compression, no skipping. To read one column you parse every byte of every row. To find the last row you read the whole file.

CSV is excellent for what it was meant for: small datasets, quick interchange, eyeballing data. It is a bad choice the moment files get large or pipelines get automated.

JSON and JSONL

When data is nested — a record with a list of items, an object inside an object — flat CSV stops fitting and people reach for JSON. It carries a bit of type information (numbers, strings, booleans, null) and arbitrary nesting, at the cost of being verbose and still entirely text.

For data pipelines the useful variant is JSONL (JSON Lines): one JSON object per line. That makes a file streamable and splittable — you can process it line by line without loading the whole thing — while keeping JSON's flexibility. It is a common landing format for logs and event streams.

Why text breaks down, and what replaces it

flowchart TD
    TAB["Tabular data: rows + columns"] --> TEXT["Text formats: CSV, TSV, JSON — readable, untyped, uncompressed"]
    TAB --> BIN["Columnar binary: Parquet — typed, compressed, column-skipping"]
    TEXT --> SMALL["Good: small data, interchange, human eyes"]
    BIN --> BIG["Good: analytics, large data, machines"]

    classDef plain stroke:#7b88a1,stroke-width:2.5px
    classDef key stroke:#a3be8c,stroke-width:2.5px
    class BIN key
    class TAB,TEXT,SMALL,BIG plain

At analytical scale, text formats fail on three axes at once: they are large (no compression), slow to parse (every byte, every time), and impossible to read selectively (no way to grab one column without reading all of them). The fix is a columnar binary format, and the standard one is Apache Parquet. It stores each column together (so a query reads only the columns it needs), embeds the schema and types (no guessing), compresses hard (similar values sit adjacent), and carries per-chunk statistics so a query can skip whole blocks. The tradeoff is that you can no longer open it in a text editor — it is for machines, not eyes.

Format Human-readable Typed Schema Compressed Good for
CSV / TSV Yes No No No Small data, interchange
JSON Yes Partly No No Nested records, APIs
JSONL Yes Partly No No Logs, event streams
Parquet No Yes Yes Yes Analytics, large data

The short version: tabular data is trivial, but the file format is a real decision. Use CSV/TSV for small, human-facing interchange; use Parquet for anything analytical or large. The most common avoidable mistake in data work is shipping multi-gigabyte CSVs between systems that should have been exchanging Parquet — paying in size, parse time, and type bugs for a human-readability nobody is using.