CSV explained — the quirks that quietly corrupt your data (and how to avoid them)

CSV looks like the simplest file format in existence. Values, separated by commas, one row per line. You could explain it to someone in ten seconds. And yet it quietly corrupts more data than almost any other format developers touch — leading zeros vanish, dates mutate, accented characters turn to garbage, and a single comma inside a field shifts every column to the right.

The reason is that "comma-separated values" is not really one format. It's a loose family of conventions that mostly agree and occasionally don't. This guide covers what CSV actually is, the specific quirks that bite developers, and how to read and write it correctly in JavaScript, Python, and the command line.

CSV is barely a standard

There is a specification — RFC 4180 — but it was written in 2005, years after CSV was already everywhere, and it documents common practice rather than dictating it. Plenty of tools predate it or ignore it. The result is that any two CSV files can disagree on the delimiter, the quoting rules, the line ending, the encoding, and whether there's a header row, and all of them are still legitimately "CSV."

This is the root of nearly every CSV bug: you assume the file follows the same conventions your code expects, and it doesn't.

The delimiter problem

The comma in "comma-separated values" is not guaranteed. The most common alternatives are the semicolon and the tab.

The big culprit here is Excel. In locales where the comma is used as the decimal separator (much of Europe), Excel writes and expects semicolon-delimited files by default — while still calling them .csv. So a file exported from Excel in Germany opens as a single mangled column in a tool expecting commas, and vice versa.

Comma-delimited:     name,age,city
Semicolon-delimited: name;age;city
Tab-delimited (TSV): name	age	city

If you control both ends, tabs are the most robust choice — real tab characters almost never appear inside data fields, so quoting becomes nearly unnecessary. If you're consuming files from elsewhere, detect the delimiter rather than assuming it: count how many commas, semicolons, and tabs appear in the header line and pick the winner.

Quoting and escaping

The moment a value contains the delimiter, you need quoting. RFC 4180 says: wrap the field in double quotes. And if the field itself contains a double quote, escape it by doubling it.

id,name,note
1,"Smith, John","He said ""hello"""
2,"Doe, Jane","Line one
Line two"

Three things are happening there that break naive parsers:

A comma inside a field (Smith, John) — only the quotes tell the parser this is one value, not two.
A quote inside a field — represented as two double quotes (""hello""), which decode to one.
A newline inside a field — row 2's note spans two physical lines, but it is one logical record.

That last one is the killer. It means you cannot parse CSV by splitting on newlines. A file with quoted line breaks will shatter if you do. This single mistake is behind a huge share of CSV bugs in homegrown parsers.

The newline trap, stated plainly

It is tempting — almost irresistible — to write this:

// DON'T DO THIS
const rows = text.split('\n').map(line => line.split(','));

It works perfectly on your test file and fails the moment real data arrives with a comma or a newline inside a quoted field. There is no regular expression that fully and safely parses CSV either. Use a real parser. Every language has one.

The encoding problem

CSV files carry no encoding metadata. The file is just bytes; how those bytes map to characters is something the reader has to assume. Get it wrong and café becomes cafÃ©.

UTF-8 is the right default in 2026, but two things complicate it:

The BOM. Excel on Windows often writes a UTF-8 byte order mark (the bytes EF BB BF) at the start of the file. Many parsers don't strip it, so your first column header arrives as \ufeffname instead of name — and your lookup for the name column silently fails. Strip the BOM on read.

Excel's legacy encoding. Older Excel "Save as CSV" uses the system's regional code page (like Windows-1252), not UTF-8. Open that file as UTF-8 and every accented character is corrupted. If you're generating CSV for Excel users, writing UTF-8 with a BOM is the pragmatic choice — it's the signal that makes modern Excel read UTF-8 correctly.

The Excel data-corruption traps

Even with the right delimiter and encoding, opening a CSV in Excel can silently destroy the data, because Excel eagerly "interprets" cell values:

Leading zeros disappear. A zip code 07920 or a product code 00451 becomes 7920 and 451 — Excel treats them as numbers.
Long numbers go scientific. A 16-digit credit card or a phone number becomes 1.23457E+15, and the original digits are gone.
Dates mutate. A value like 3-4 (meaning "size 3 to 4") becomes 4-Mar. The famous case: scientific journals had to rename human genes because Excel kept converting gene symbols like SEPT2 and MARCH1 into dates.
Long IDs lose precision. A 20-digit identifier stored as a number loses its trailing digits to floating-point rounding.

The data in the file is fine. Excel corrupts it on import and then re-saves the corruption. The defenses: keep identifiers as text (some workflows prefix a value with a tab or use Excel's import wizard to mark columns as Text), and never round-trip a data file through Excel if you can avoid it. For quick inspection and edits, a tool that treats every cell as text by default — and doesn't try to be clever — will not mangle your IDs.

CSV vs TSV

TSV (tab-separated values) is CSV's quieter, better-behaved cousin. Because tab characters essentially never appear inside real-world fields, TSV usually needs no quoting at all, which makes it far simpler to parse correctly by hand and far less prone to the comma-in-a-field bug. If you get to choose the format for a data export, TSV is often the safer pick.

Parsing CSV in JavaScript

Don't hand-roll it. Papa Parse is the standard and handles quoting, embedded newlines, delimiter detection, and headers:

import Papa from 'papaparse';

const csv = `id,name,note
1,"Smith, John","He said ""hi"""`;

const result = Papa.parse(csv, {
  header: true,        // use first row as keys
  skipEmptyLines: true,
  dynamicTyping: false // keep everything as strings — safer for IDs
});

console.log(result.data);
// [{ id: "1", name: "Smith, John", note: 'He said "hi"' }]

// Going the other way:
const out = Papa.unparse([{ id: 1, name: "Smith, John" }]);

Note dynamicTyping: false. Letting a parser auto-convert types reintroduces the leading-zero and precision problems on the code side. Keep values as strings unless you have a reason not to.

Parsing CSV in Python

The standard library csv module handles the format correctly. For data analysis, pandas is the workhorse — but mind the dtype trap:

import csv

# Standard library — correct quoting and newline handling
with open('data.csv', newline='', encoding='utf-8-sig') as f:
    for row in csv.DictReader(f):
        print(row['name'])

# pandas — read everything as text to protect IDs and zip codes
import pandas as pd
df = pd.read_csv('data.csv', dtype=str, keep_default_na=False)

Two details that prevent real bugs: encoding='utf-8-sig' transparently strips a UTF-8 BOM if present, and dtype=str stops pandas from turning 07920 into 7920 or a long ID into a float.

CSV on the command line

# csvkit — a proper CSV toolkit (pip install csvkit)
csvlook data.csv            # pretty-print as a table
csvcut -c name,city data.csv # select columns by name
csvstat data.csv            # summary stats per column

# Quick peek without extra tools
column -s, -t data.csv | less -S   # align columns (naive — breaks on quoted commas)

That last column trick is handy for a quick look but is itself a naive splitter — it will misalign rows that contain quoted commas. Fine for eyeballing, not for anything that matters.

Converting between CSV and JSON

This is one of the most common CSV tasks: an API speaks JSON, a spreadsheet speaks CSV, and you need to move between them. The mapping is straightforward — each CSV row becomes a JSON object keyed by the header row — but the same quoting and encoding rules apply in both directions. A field containing a comma or newline must be quoted on the way out, and the header names become your JSON keys, so trailing spaces or a stray BOM in the header will show up as malformed keys.

How to avoid the pain

A short checklist that prevents most CSV disasters:

Never split on commas or newlines yourself. Use a real parser, always.
Declare your encoding as UTF-8, and strip the BOM on read. Add a BOM on write only when the consumer is Excel.
Keep values as strings through the parse step. Convert types deliberately, not automatically.
Detect the delimiter instead of assuming a comma, especially for files that may have come from Excel in another locale.
Avoid round-tripping data through Excel. If you must, mark ID columns as Text on import.
Prefer TSV when you control the format and the data may contain commas.

CSV will never be elegant. But once you know where the bodies are buried, it's entirely predictable — which, for a format this universal, is good enough.