Macro Tutorial: Find Duplicates in CSV File

Mar 1 2019

A typical task when working with CSV files is to find and process duplicates, very often to delete them. Since version 0.9.7 Tablecruncher provides Javascript as a builtin macro language. This tutorial shows how to find and delete entries with a duplicate column.

Please be aware, that Tablecrucher is still in beta. Always make backups of your files before manipulating them with Tablecruncher. The macro API is subject to change and more powerful features will be added in the future, making tasks like this one easier.

Step 1: Our initial file

This is our initial file that serves as an example for this tutorial.

Step 2: Select the column with the values to check for duplicates

Select the column that contains the values you want to check for duplicates (by simply clicking on the header field, in our example “E-Mail”) and insert a new column to the right using Edit > Insert Column Right.

Step 3: Set a header name for the new column

To set a new header name, uncheck the “Header” button in the tool bar (or use View > Switch Header Row) and enter the name for the new column in the first row. We’re using “Duplicate” in this tutorial.

Step 4: Set headers

Don’t forget to click the “Header” button again, so that our CSV file is represented showing the header names.

Step 5: Sort the column with the values to check for duplicates

Now we’re going to sort the column which possibly contains duplicate entries. This step ensures all rows with duplicates are grouped together. To achieve this, just right-click the header of our column “E-Mail” or use Find > Sort … and choose the correct column from the dropdown.

After clicking sort the table is sorted by the value of the chosen column.

Step 6: Mark lines with duplicates

While still the column “E-Mail” is still selected, choose Macro > Execute Macro …, click the little “+” sign and add a new macro with the name “Find Duplicates”. Insert the following code into the source area of the macro editor.

// Find Duplicates
for( r = ROWMIN+1; r<=ROWMAX; ++r) {
  var cell = getString(r,COLMIN);
  var prevCell = getString(r-1,COLMIN);
  if( cell == prevCell ) {
    setCell(r,COLMIN+1,"DUP");
  } else {
    setCell(r,COLMIN+1,"");
  }
}

This code runs from the second selected row (ROWMIN+1) to the last row (ROWMAX). In cell we store the content of the recent cell, while in prevCell there’s the content of the cell above it. If cell equals prevCell we found a duplicate, so we store the string “DUP” to the cell right to our duplicate cell using setCell(r,COLMIN+1,"DUP"). The first occurence of any multiple values isn’t marked with “DUP”, though.

Just click execute and the macros marks all rows with duplicate entries with “DUP” in the column “Duplicate”:

Step 7: Sort by column “Duplicate”

Now, we’re going to sort the table by the column “Duplicate”. You remember: Right-clicking the header cell opens the sort window with the clicked column preselected.

Click sort to sort the table by the choosen column. This groups all rows containing duplicates at the end of our table. Before sorting you should have checked if all the correct rows have the “DUP” marker.

Step 8: Delete all rows containing “DUP”

To finally delete the rows containing duplicates, select all lines from the first occurence of “DUP” downward. The easiest way is to select the first occurence of “DUP” with the mouse and then use the keyboard, hold down SHIFT and FN keys and press the DOWN key. Now click Edit > Delete Rows … or use the appropriate icon from the toolbar.

Step 9: We’re finished

You have now cleaned your CSV files from the rows containing duplicate entries. You can delete the column “Duplicate” now if you wish.

This tutorial should give you some ideas on how to implement your data cleansing tasks utilizing the power of the builtin Javascript language. If you’ve questions regarding this tutorial or if you have ideas for further tutorials, don’t hesitate to contact me by e-mail: info@tablecruncher.com

Download Beta – Free!
Version 0.9.7
requires Mac OS X El Capitan, macOS Sierra, High Sierra or Mojave