↓ Archives ↓

Category → Uncategorized

The Random Provider

The simplest data provider is RandomProvider.

You can use RandomProvider to generate all sort of series in a very large set of distributions.

The simplest use is to generate a non-random at all series.

The Data URL:

random://constant/

generate a constant series of ten element each of value 1.0.

The RandomProvider fully supports the Data Instrument Language.

The complete list of base distributions and parameters supported by the RandomProvider are here.

Some examples:

constant random://constant/5.0
uniform random://uniform/low=10,high=20
normal random://normal/loc=.1,scale=.5

The new implementation is here.

coreutils are your friends

Once upon a time there was textutils, a small set of classic Unix utility programs to play with text files.

If in a file you had to print out, or reverse, or select lines, or edit the text stream, paginate it, wrap it, select the beginning or the ending lines and so on, you can use that small commands of the Unix tradition, lasting good old days of Real Programmers.

Nowdays textutils are incorporated in a huge set called coreutils with a lot of other stuff.

They are powerful tools to know even to make your data homeworks, so please make yourself acquainted with them. You’ll be grateful forever.

We’ve already seen a simple example, and many other will come. But remind that coreutils are your friends.

MAC OS X environment variables

I’ve tryed hard to understand the environment variables mechanism for Mac OS X. It seems to me a little mess.

The way Mac OS X applications are launched in Aqua is different from the way in other UNIX windows environments.

In standard UNIX the applications, all, inherit their environment variables from the login shell.

In Mac OS X it is different. GUI Applications, even if ported from Unix/Linux, do not run in the same process environment as an application launched in Terminal. This is an inheritance of NeXTStep. To correct this difference: there is a ‘strange’ file named ~/.MacOSX/environment.plist.

In a freshly installed system neither the ~/.MacOSX directory nor the file environment.plist in it exist. You can create them with:

defaults write ${HOME}/.MacOSX/environment PATH "${HOME}/bin:/usr/bin:/bin:/usr/local/bin"

The same mechanism can be used to make MANPATH, INFOPATH, LC_CTYPE, and other environment variables available.

But this can interfere with settings from the usual UNIX files like the system’s like /etc/profile, /etc/csh.login and also the user’s ~/.profile or ~/.login and the same. You have to decide from which of the two systems the settings used will come. I strongly recommend ~/.MacOSX/environment.plist, because it is so easy to change (edit) and use their key/value pairs saved in XML (on the command line with the defaults and plutil commands or PLTools from http://www.macorchard.com/PLTools, or “Mac like” with /Developer/Applications/Utilities/Property List Editor.app from Apple’s Mac OS X Developer Tools).

Even applications run in X11 are directly affected by this mechanism because the X server itself now runs in Aqua, too.

From XLS to CSV

XLS format is often used to transport data. That’s a boorish behavior. XLS format should never be used to exchange data.

Never.

Most of the time simple CSV file suffice. A friend gave me a two thousand lines dataset like that:

GIS006003 ARI00601P usr9 4
GIS00300G ATD00302V usr8 l
GIS006003 ATD006019 usr10 6
GIS00700V APC007016 usr11 2

In XLS the file was over 20 Mb. Even counting the half of byte lost in ASCII data representation, it was a 1 to 250 waste.

But sadly, XLS format doesn’t only waste space, it seriously compromise the meaning of the data. It’s not so uncommon that numeric fields are misinterpreted as textual and hence don’t count into numeric operations like means or sums. You aren’t able to know if such error occurs if you don’t inspect cell-by-cell the file, and even so there are some nasty inner problem that cannot be inspect visually.

So please DON’T use XLS format to exchange data. This is the first good advice from your data char.

When you ask for data, please don’t accept XLS files if you don’t know which care your guest puts in making up such files.

The received files can be dramatically wrong, you can consider its state from ‘difficult to work out’ to ‘completely unuseful’. And, sure, you can bet you’ll waste time just to reach the data you need, well before you can use them.

And, for the worst, if you don’t do that, you won’t be ever sure the quality of the data you’re working on.

Less is better, hence.

CSV or TXT file format don’t have this problems. (they surely have other problems, but much more compatible with your work I mean).

When translated to text it’s much more simple to find possible errors in data fields, and what’s even better be SURE that you’re working on an mistake-free file (at least from representation errors).

For instance to be SURE that fourth column of the previous example don’t have letters instead of figures you can simple use the command

cut -c 33- example1.txt | xargs | sed 's/[ 0-9]\+//m'

That’s all. Now you know if non-number exists in the last field. (You obviously should adapt that command from case to case).

A little explanation for the braves. The command is a filter that operates line by line.

The first part of the command line (before the first vertical bar, commonly known as pipe) ‘cuts’ the line up to the 33th position (I had to count the column by hand). All the content of the last column is then aligned up in a single line with the xargs command (this is a sort of a side-effect of this command, which is well more useful for other things also), then the sed part cancels spaces and figures from the string leaving in the results only what should not be in (letters or other characters).

Hence if the response is an empty string I’m sure I have just figures as it’s correct, but if I have a non-empty string I should search for the intruder line by line.

That’s isn’t difficult, too. Just a command more. I can get the offending lines with

egrep -n '^.{32}[^ 0-9]' example1.txt

where the command simple finds a non-figure after the 33th column in the file. The result could be something like:
2:GIS00300G ATD00302V usr8 l

where it is reported the line number and the full content with the found character.

Maybe even if you use the unmentionable data software™ to make your data works (and your data char heartily doesn’t recommend you), you should ever use CSV or TXT file to exchange data with your peers or to effectively use commands like ones seen.

Dragging off your data from XLS

Now that you are concerned about leaving your data in XLS files, it’s time to automate the extraction from that cage.

Do not use ‘Save As…’ in the unmentionable data software™, which is not, by any mean, good at this. And it’s a drag to make, file by file, sheet by sheet.

So, please move XLS files to a platform where you can use Perl (more likley Unix, Linux or Mac OS X, but even the unmentionable operating system™ ) and use xls2csv program by Ken Prows.

The use of command is very simple. The options are
-x : filename of the source spreadsheet
-b : the character set the source spreadsheet is in (before)
-c : the filename to save the generated csv file as
-a : the character set the csv file should be converted to (after)
-q : quiet mode
-s : print a list of supported character sets
-h : print help message
-v : get version information
-W : list worksheets in the spreadsheet specified by -x
-w : specify the worksheet name to convert (defaults to the first worksheet)

The following example will convert a spreadsheet that is in the WINDOWS-1252 character set (WinLatin1) and save it as a csv file in the UTF-8 character set.

xls2csv -x "1252spreadsheet.xls" -b WINDOWS-1252 -c "ut8csvfile.csv" -a UTF-8

This example with convert the worksheet named “Users” in the given spreadsheet.

xls2csv -x "multi_worksheet_spreadsheet.xls" -w "Users" -c "users.csv"

The spreadsheet’s charset (-b) will default to UTF-8 if not set.

If the csv’s charset (-a) is not set, the CSV file will be created using the same charset as the spreadsheet (which is not the best option, so try ever to use UTF-8).

Some known problems of the program are:

  • It probably will not work with spreadsheets that use formulas. You should before create a sheet with the static content of formula fields copied as numbers and then extract this sheet.
  • A line in the spreadsheet is assumed to be blank if there is nothing in the first column.
  • Some users have reported problems trying to convert a spreadsheet while it was opened in a different application. You should probably make sure that no other programs are working with the spreadsheet while you are converting it.

The script is free software and you can redistribute it and/or modify it under the same terms as Perl interpreter itself.