August, 2009

From XLS to CSV

XLS format is often used to transport data. That’s a boorish behavior. XLS format should never be used to exchange data.


Most of the time simple CSV file suffice. A friend gave me a two thousand lines dataset like that:

GIS006003 ARI00601P usr9 4
GIS00300G ATD00302V usr8 l
GIS006003 ATD006019 usr10 6
GIS00700V APC007016 usr11 2

In XLS the file was over 20 Mb. Even counting the half of byte lost in ASCII data representation, it was a 1 to 250 waste.

But sadly, XLS format doesn’t only waste space, it seriously compromise the meaning of the data. It’s not so uncommon that numeric fields are misinterpreted as textual and hence don’t count into numeric operations like means or sums. You aren’t able to know if such error occurs if you don’t inspect cell-by-cell the file, and even so there are some nasty inner problem that cannot be inspect visually.

So please DON’T use XLS format to exchange data. This is the first good advice from your data char.

When you ask for data, please don’t accept XLS files if you don’t know which care your guest puts in making up such files.

The received files can be dramatically wrong, you can consider its state from ‘difficult to work out’ to ‘completely unuseful’. And, sure, you can bet you’ll waste time just to reach the data you need, well before you can use them.

And, for the worst, if you don’t do that, you won’t be ever sure the quality of the data you’re working on.

Less is better, hence.

CSV or TXT file format don’t have this problems. (they surely have other problems, but much more compatible with your work I mean).

When translated to text it’s much more simple to find possible errors in data fields, and what’s even better be SURE that you’re working on an mistake-free file (at least from representation errors).

For instance to be SURE that fourth column of the previous example don’t have letters instead of figures you can simple use the command

cut -c 33- example1.txt | xargs | sed 's/[ 0-9]\+//m'

That’s all. Now you know if non-number exists in the last field. (You obviously should adapt that command from case to case).

A little explanation for the braves. The command is a filter that operates line by line.

The first part of the command line (before the first vertical bar, commonly known as pipe) ‘cuts’ the line up to the 33th position (I had to count the column by hand). All the content of the last column is then aligned up in a single line with the xargs command (this is a sort of a side-effect of this command, which is well more useful for other things also), then the sed part cancels spaces and figures from the string leaving in the results only what should not be in (letters or other characters).

Hence if the response is an empty string I’m sure I have just figures as it’s correct, but if I have a non-empty string I should search for the intruder line by line.

That’s isn’t difficult, too. Just a command more. I can get the offending lines with

egrep -n '^.{32}[^ 0-9]' example1.txt

where the command simple finds a non-figure after the 33th column in the file. The result could be something like:
2:GIS00300G ATD00302V usr8 l

where it is reported the line number and the full content with the found character.

Maybe even if you use the unmentionable data software™ to make your data works (and your data char heartily doesn’t recommend you), you should ever use CSV or TXT file to exchange data with your peers or to effectively use commands like ones seen.

Dragging off your data from XLS

Now that you are concerned about leaving your data in XLS files, it’s time to automate the extraction from that cage.

Do not use ‘Save As…’ in the unmentionable data software™, which is not, by any mean, good at this. And it’s a drag to make, file by file, sheet by sheet.

So, please move XLS files to a platform where you can use Perl (more likley Unix, Linux or Mac OS X, but even the unmentionable operating system™ ) and use xls2csv program by Ken Prows.

The use of command is very simple. The options are
-x : filename of the source spreadsheet
-b : the character set the source spreadsheet is in (before)
-c : the filename to save the generated csv file as
-a : the character set the csv file should be converted to (after)
-q : quiet mode
-s : print a list of supported character sets
-h : print help message
-v : get version information
-W : list worksheets in the spreadsheet specified by -x
-w : specify the worksheet name to convert (defaults to the first worksheet)

The following example will convert a spreadsheet that is in the WINDOWS-1252 character set (WinLatin1) and save it as a csv file in the UTF-8 character set.

xls2csv -x "1252spreadsheet.xls" -b WINDOWS-1252 -c "ut8csvfile.csv" -a UTF-8

This example with convert the worksheet named “Users” in the given spreadsheet.

xls2csv -x "multi_worksheet_spreadsheet.xls" -w "Users" -c "users.csv"

The spreadsheet’s charset (-b) will default to UTF-8 if not set.

If the csv’s charset (-a) is not set, the CSV file will be created using the same charset as the spreadsheet (which is not the best option, so try ever to use UTF-8).

Some known problems of the program are:

  • It probably will not work with spreadsheets that use formulas. You should before create a sheet with the static content of formula fields copied as numbers and then extract this sheet.
  • A line in the spreadsheet is assumed to be blank if there is nothing in the first column.
  • Some users have reported problems trying to convert a spreadsheet while it was opened in a different application. You should probably make sure that no other programs are working with the spreadsheet while you are converting it.

The script is free software and you can redistribute it and/or modify it under the same terms as Perl interpreter itself.