“Corrupt” import file for your Salesforce Data Load?

If you have ever gotten a CSV file to import from someone, then you’ve probably run into issues with “corrupt” files. Either you have “gibberish” characters in the data, or the file won’t load at all. And you probably have had to fix the data by importing and exporting it with various tools, or doing mass find-and-replace exercises to substitute invalid characters with suitable ones.

Most of the time it’s probably not data corruption at all, but rather a mixup with character encoding. In this day & age character encoding is not something one normally needs to think about; it is generally something that is handled automatically. But when it comes to data importing, it can be a very important factor. The basic premise is that when text is saved, there are several different ways that text can be encoded. But it must be read back in the same encoding for it to load correctly.

Let’s look at 3 sets of sample contact data, all containing the same values, but each encoded differently:

The first example is from the raw export from Salesforce (Using LexiLoader).  Notice the accent character in Jean-Lúc’s name, this is how it should be.

Default

In the second example, with that same file saved with “DOS (CP 437)” encoding, you see that accented character is now rendered as “gibberish”.  Importing this file will result with that incorrect character in Salesforce.

DOS

This final example was saved using 16-bit “Big Endian” encoding. This file won’t even allow you to import it into Salesforce.  LexiLoader will tell you that it is an invalid format.

16BE

But there is a pretty easy way to remedy this.  If you open the file with an advanced text editor, like NotePad++ or Sublime you can explicitly set the encoding that is being used when it’s saved. NotePad++ in particular does a really good job about auto-selecting the correct encoding. So open the CSV file you are having issues with, then save it with “UTF-8” encoding.  You should be able to view and import it correctly now.

If you’d like some more information on character encoding I recommend this article as a starting point.