Despite their simplicity, CSV files come with surprising complexity. If you’re working with structured data, here’s what you really need to know about how CSV works—and where it doesn’t.
CSV files aim to satisfy at least three goals:
With these goals in mind, lots of the quirks we see in practical usage are easier to understand.
There’s no official, universally followed specification for CSV. The most commonly referenced document is RFC 4180, published in 2005. But it’s not a strict standard—just a best-effort attempt to describe what CSV should look like.
Even RFC 4180 itself acknowledges this:
While there are various specifications and implementations for the CSV format, there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files.
In other words: RFC 4180 defines one way CSV could work. But many tools, programming languages, and legacy systems don’t follow it at all.
Here’s a condensed version of what RFC 4180 recommends:
One record per line
Each line is a new row. Line endings should be CRLF (\r\n
), but many systems use LF (\n
) anyway.
Commas as field separators
It assumes commas—not semicolons or tabs.
Optional header row
The first line may contain column names, but it’s not required.
Fields can be quoted with double quotes
Quoting is needed when a field contains the delimiter, quotes, or line breaks.
Escaping quotes by doubling
If a quoted field includes a double quote, it must be escaped by writing two double quotes (""
).
Sounds reasonable? Yes—but most real-world CSV files deviate from these rules in one way or another. That’s why relying solely on RFC 4180 compatibility isn’t enough.
A CSV file doesn’t need to start with a header. Some do, some don’t. Some use metadata elsewhere (like in a README or separate spec file). You can’t rely on the first row being column names. At the other extreme, some so-called CSV files may even include multiple header rows.
While “CSV” stands for Comma-Separated Values, many files use semicolons, tabs, or pipes instead. In some regions, semicolon is the default because the comma is used as a decimal separator. Quoted fields can still contain commas, but this often makes the file harder for humans to read. Switching to semicolon as the separator keeps the file easier to scan for human readers in such a case. This is the reason CSV is sometimes explained as Character-Separated Values.
Some files quote every field. Others only quote fields that contain a separator or newline. Quotes within quoted fields are usually escaped by doubling them ("He said ""hi"""
), but not always. There’s no universal escaping rule. CSV parsers should aim to be as fault-tolerant as possible. Troubles arise when applying sloppy rules leads to ambiguity.
If a field is quoted, it can contain actual line breaks. This means you can’t reliably split a CSV file by line and expect one record per line, though you’ll find lots of such false solutions when people try to process a CSV file using command-line tools. Only if you’re completely sure that no field contains a line break should you use such a naive approach.
Coming from spreadsheet applications, many users assume columns or fields in a CSV document have some kind of well-defined types like Integer or Float or Date. This isn’t technically possible as in a CSV there’s no metadata that would contain such information. CSV is extremely WYSIWYG—what you see is what you get.
Some applications—like Excel, and even some CSV-specific tools—try to guess a column’s data type based on its content. While this seems to be rather comfortable, it can cause all kinds of problems. Perhaps your CSV contains the product number 9007199254740993
. An application reading it and converting it to a numeric type using IEEE 754 floating point numbers internally, converts this number to 9007199254740990
. If you don’t believe it, try it for yourself in Excel. Or in Python.
If a CSV document has this line abc,def ,ghi
(please note the space after “def” here), then the three fields would have this content: abc
, def
and ghi
. According to RFC 4180, spaces around the “actual” content are part of this content.
You shouldn’t use spaces (or other whitespace characters) to make a CSV file more readable for humans.
Even if a header row is present, nothing stops it from having duplicate column names. Most software will accept this without warning—and if you just use CSV editors or spreadsheet applications to open and read such a file, there shouldn’t be any problems. The trouble begins when you try to import a CSV file with duplicate column names into a relational database where each column name has to be unique. Or if you convert the CSV document to JSON or XML.
CSV is best understood as a convention, not a specification. It’s flexible—and that’s both its strength and its biggest weakness. If you’re building or using tools that handle CSV files, make as few assumptions as possible:
Bottom line:
A CSV file is just text—nothing more. Its structure depends entirely on convention and context. Treat it with caution, and always validate what you’re working with.
This tutorial is based on a newsletter issue from The Missing Header — my practical data newsletter. Subscribe now to get more real-world insights into data handling, CSV quirks, and tooling tips: