Tutorial: Working with really large CSV files

Jun 8 2019

One design goal when developing Tablecruncher was to enable users to work with really large CSV files in an efficient manner. But this comes at a price: To achieve that good performance Tablecruncher has to store the complete file in memory in a special format that consumes a lot of additional memory. So, when you open for example a 2 GB file it may use more than 6 GB of your machine’s RAM.

If the memory usage of Tablecruncher comes close to the amount of RAM that’s built into your Mac, things may extremely slow down. Your computer will start to store some content of your computer’s memory onto your SSD or, even worse, your hard disk. As a consequence, the performance gain will be lost and your Mac will become rather slow.

This tutorial provides some hints to reduce memory usage and what to expect when working on large CSV files. If your files are smaller than 100 MB though, you usually don’t have to worry about memory usage or performance.

Memory Usage

You can estimate the memory usage of your CSV file with this simple formula:
    memory = 25 * R * C + F
where R is the number of rows, C the number of columns and F the file size in bytes.

One of my test files is 524 MB large, contains 10 columns in 4.4 million rows. Using the formula from above the RAM usage will be about 1.6 GB:
    memory = 25 * 4,400,000 * 10 + 524,000,000 = 1,624,000,000 bytes

While this file is opened in Tablecruncher the Activity Monitor reports 1.4 GB RAM used, so the formula represents a rather accurate guess.

Closing and other actions will take some time

A large CSV file may consist of dozens of millions of cells, each cell has to be stored in memory when opening a file and this memory has again to be cleared when the file is closed. So be aware that closing a window with a large CSV file may take a considerable amount of time. During this time Tablecruncher may seem non-responsive. For example, opening a 500 MB file takes about 18 seconds on my iMac, closing the same file takes three or four seconds.

How Undo works and memory usage

Several actions change multiple cells at once. For such actions (especially moving, deleting and inserting columns or rows as well as execute some Macro code) Tablecruncher stores a copy of the whole table in memory to fast and reliably undo this action if needed. This works great (and fast) for small and medium-sized tables, but it costs a lot of RAM for large tables. If you’re working on a large CSV file, please be aware that every action that possibly changes more than one cell may take several seconds as Tablecruncher is storing the current state. The above mentioned 500 MB test file takes about six seconds to insert an additional column.

When working with really large tables it is recommended to disable Undo functionality. This is done by Edit > Disable Undo …. You may switch on Undo at any time, but please be aware that all changes you did while Undo was disabled are of course not stored and thus can’t be undone. Disabling Undo also deletes all undo states.

Tablecruncher warns you when you open a file consuming more than 20 percent of your RAM and offers to disable Undo functionality for you.

Choose CSV definition manually

For smaller files, Tablecruncher reads the first lines of your CSV file and tries its best to guess the definition of this file. For larger files, on the other hand, a wrong guess would only be obvious after Tablecruncher read the whole file, so it always asks you to enter the correct definition. The definition dialog offers the values Tablecruncher thinks are the best guess for this file, but you may change those values. You can reach this behavior for small files by choosing File > Open with format … too. If you opened a file using the wrong definition, File > Reopen … offers to open this file again, this time presenting the definition dialog.

Outlook

I’ll thrive to make Tablecruncher the best tool to work with even very large tabular data on your Mac. So several improvements are planned for later versions. One improvement will be a better storage engine that should save a lot of memory on large files. A more sophisticated Undo functionality is another feature that should reduce memory consumption.

All tests in this tutorial have been executed on an iMac from 2011 with 16 GB RAM and an SSD. With newer machines, execution times should be lower.

If you’ve questions regarding this tutorial or if you have ideas for further tutorials, don’t hesitate to contact me by e-mail: info@tablecruncher.com

Download (Apple Silicon) Download (Intel)
Version 1.7.0
requires macOS Sequoia, Sonoma, Ventura (Apple Silicon) or macOS Sequoia, Sonoma, Ventura, Monterey, Big Sur, Catalina (Intel)