↓ Archives ↓

Mac CLI: Simple creation and management of disk images

From Mac command line you can crate a disk image as:

FOLDER=
hdiutil create $FOLDER.dmg -ov -volname "$FOLDER" -fs HFS+ -srcfolder "$FOLDER"

If you want to convert the Read-Only DMG image to a writable sparsebundle you can use:

hdiutil convert $FOLDER.dmg -format UDSB -o $FOLDER.sparsebundle

And you can set the dimension as:

hdiutil resize -4g $FOLDER.sparsebundle

Mac sparsebundles can be cloned with rsync as:

rsync -aNHAXEx --delete --protect-args --fileflags --force-change $FOLDER.sparsebundle /path/to/destination

References:

The Random Provider

The simplest data provider is RandomProvider.

You can use RandomProvider to generate all sort of series in a very large set of distributions.

The simplest use is to generate a non-random at all series.

The Data URL:

random://constant/

generate a constant series of ten element each of value 1.0.

The RandomProvider fully supports the Data Instrument Language.

The complete list of base distributions and parameters supported by the RandomProvider are here.

Some examples:

constant random://constant/5.0
uniform random://uniform/low=10,high=20
normal random://normal/loc=.1,scale=.5

The new implementation is here.

The Data Instrument Language

In e4t a Data URL is subdivided in four parts:

scheme://source/instrument?modifiers

The Data Instrument Language expresses target and qualifiers for third part of a Data URL (instrument).

The target is the mnemonic code (or the expression, if supported by the source) that will be sent to the data provider, whereas qualifiers modifies the nature of the query.

It closely resembles the Datastream symbology, with some minor differences.

TARGET~QUALIFIER~QUALIFIER~QUALIFIER

Request options can be adding qualifier’s to the instrument string. All qualifiers begin with the tilde (~) character. Where order
is important parameters are parsed from the end of the request. Where the request is a Datastream expression including the tilde character (for example, for currency conversion), this is indicated as two adjacent tildes (NYI).

Qualifiers

The following qualifiers are single options which may be combined (with some restrictions) within a single request:

  • ~D Make request a daily time series. This is default for requests < than 5 years.
  • ~W Make request a weekly time series. This is default for requests > than 5 years.
  • ~M Make request a monthly time series. This is default for requests > than 10 years.
  • ~-nY Set start date for time series request to relative number of years ago.
  • ~-nQ Set start date for time series request to relative number of quarters ago.
  • ~-nM Set start date for time series request to relative number of months ago.
  • ~-nW Set start date for time series request to relative number of weeks ago.
  • ~-nD Set start date for time series request to relative number of days ago.
  • ~YYYY-MM-DD Set start date for time series request to absolute date.

  • ~:-nY Set end date for timeseries request to relative number of years ago.
  • ~:-nD Set end date for timeseries request to relative number of days ago.
  • ~:YYYY-MM-D Set end date for timeseries request to absolute date.
  • ~@-nD Set single point time series to relative number of days ie.
    The start date and end date of time seriesrequest are same point.

  • ~@YYYY-MM-DD Set single point time series request to absolute date
    ie. the start date and end
    date of time series request are same point.

  • ~NA=na-value Sets value to be substituted in time series for NaN
    (not a number, or not available condition) with time series.
    Can be either a numeric, for example NA=1,
    or NA=NaN to set the IEEE standard value for NaN.
    By default the DIL handler substitues 0 for NaN in time series
    (which are returned as arrays of double’s).

The code is here.

Python course at codeacademy

Today they wrote to me:

Today, we’re pleased to announce the arrival of a new programming language: Python!

Python is a great language with applications in many different fields. Its clean, readable syntax makes it a favorite for beginning programmers – say goodbye to all of those braces and semicolons.

Python is currently in use at places like Google, NASA, and Disney Animation. Also, it has an active community of developers and offers great module support – this means you can easily use code that others have written to accomplish all kinds of tasks!


Python-Button

Loading a shape file into MYSQL with ogr2ogr

Whit ogr you can load a shape file directly onto an existing database. The syntax is quite simple:

ogr2ogr -f "MySQL" MySQL:"ogr,user=root,host=localhost,password=root" -lco engine=MYISAM Comuni_01.shp

Patstat import scripts for MySQL (201204 version)

I finally released the import scripts for mysql. You can find it on github along with a small documentation on the (beautiful) github page here.

Please note this is for April 2012 version. If you’re importing another release of Patstat you have to change the thing accordingly.

coreutils are your friends

Once upon a time there was textutils, a small set of classic Unix utility programs to play with text files.

If in a file you had to print out, or reverse, or select lines, or edit the text stream, paginate it, wrap it, select the beginning or the ending lines and so on, you can use that small commands of the Unix tradition, lasting good old days of Real Programmers.

Nowdays textutils are incorporated in a huge set called coreutils with a lot of other stuff.

They are powerful tools to know even to make your data homeworks, so please make yourself acquainted with them. You’ll be grateful forever.

We’ve already seen a simple example, and many other will come. But remind that coreutils are your friends.

MAC OS X environment variables

I’ve tryed hard to understand the environment variables mechanism for Mac OS X. It seems to me a little mess.

The way Mac OS X applications are launched in Aqua is different from the way in other UNIX windows environments.

In standard UNIX the applications, all, inherit their environment variables from the login shell.

In Mac OS X it is different. GUI Applications, even if ported from Unix/Linux, do not run in the same process environment as an application launched in Terminal. This is an inheritance of NeXTStep. To correct this difference: there is a ‘strange’ file named ~/.MacOSX/environment.plist.

In a freshly installed system neither the ~/.MacOSX directory nor the file environment.plist in it exist. You can create them with:

defaults write ${HOME}/.MacOSX/environment PATH "${HOME}/bin:/usr/bin:/bin:/usr/local/bin"

The same mechanism can be used to make MANPATH, INFOPATH, LC_CTYPE, and other environment variables available.

But this can interfere with settings from the usual UNIX files like the system’s like /etc/profile, /etc/csh.login and also the user’s ~/.profile or ~/.login and the same. You have to decide from which of the two systems the settings used will come. I strongly recommend ~/.MacOSX/environment.plist, because it is so easy to change (edit) and use their key/value pairs saved in XML (on the command line with the defaults and plutil commands or PLTools from http://www.macorchard.com/PLTools, or “Mac like” with /Developer/Applications/Utilities/Property List Editor.app from Apple’s Mac OS X Developer Tools).

Even applications run in X11 are directly affected by this mechanism because the X server itself now runs in Aqua, too.

From XLS to CSV

XLS format is often used to transport data. That’s a boorish behavior. XLS format should never be used to exchange data.

Never.

Most of the time simple CSV file suffice. A friend gave me a two thousand lines dataset like that:

GIS006003 ARI00601P usr9 4
GIS00300G ATD00302V usr8 l
GIS006003 ATD006019 usr10 6
GIS00700V APC007016 usr11 2

In XLS the file was over 20 Mb. Even counting the half of byte lost in ASCII data representation, it was a 1 to 250 waste.

But sadly, XLS format doesn’t only waste space, it seriously compromise the meaning of the data. It’s not so uncommon that numeric fields are misinterpreted as textual and hence don’t count into numeric operations like means or sums. You aren’t able to know if such error occurs if you don’t inspect cell-by-cell the file, and even so there are some nasty inner problem that cannot be inspect visually.

So please DON’T use XLS format to exchange data. This is the first good advice from your data char.

When you ask for data, please don’t accept XLS files if you don’t know which care your guest puts in making up such files.

The received files can be dramatically wrong, you can consider its state from ‘difficult to work out’ to ‘completely unuseful’. And, sure, you can bet you’ll waste time just to reach the data you need, well before you can use them.

And, for the worst, if you don’t do that, you won’t be ever sure the quality of the data you’re working on.

Less is better, hence.

CSV or TXT file format don’t have this problems. (they surely have other problems, but much more compatible with your work I mean).

When translated to text it’s much more simple to find possible errors in data fields, and what’s even better be SURE that you’re working on an mistake-free file (at least from representation errors).

For instance to be SURE that fourth column of the previous example don’t have letters instead of figures you can simple use the command

cut -c 33- example1.txt | xargs | sed 's/[ 0-9]\+//m'

That’s all. Now you know if non-number exists in the last field. (You obviously should adapt that command from case to case).

A little explanation for the braves. The command is a filter that operates line by line.

The first part of the command line (before the first vertical bar, commonly known as pipe) ‘cuts’ the line up to the 33th position (I had to count the column by hand). All the content of the last column is then aligned up in a single line with the xargs command (this is a sort of a side-effect of this command, which is well more useful for other things also), then the sed part cancels spaces and figures from the string leaving in the results only what should not be in (letters or other characters).

Hence if the response is an empty string I’m sure I have just figures as it’s correct, but if I have a non-empty string I should search for the intruder line by line.

That’s isn’t difficult, too. Just a command more. I can get the offending lines with

egrep -n '^.{32}[^ 0-9]' example1.txt

where the command simple finds a non-figure after the 33th column in the file. The result could be something like:
2:GIS00300G ATD00302V usr8 l

where it is reported the line number and the full content with the found character.

Maybe even if you use the unmentionable data software™ to make your data works (and your data char heartily doesn’t recommend you), you should ever use CSV or TXT file to exchange data with your peers or to effectively use commands like ones seen.

Dragging off your data from XLS

Now that you are concerned about leaving your data in XLS files, it’s time to automate the extraction from that cage.

Do not use ‘Save As…’ in the unmentionable data software™, which is not, by any mean, good at this. And it’s a drag to make, file by file, sheet by sheet.

So, please move XLS files to a platform where you can use Perl (more likley Unix, Linux or Mac OS X, but even the unmentionable operating system™ ) and use xls2csv program by Ken Prows.

The use of command is very simple. The options are
-x : filename of the source spreadsheet
-b : the character set the source spreadsheet is in (before)
-c : the filename to save the generated csv file as
-a : the character set the csv file should be converted to (after)
-q : quiet mode
-s : print a list of supported character sets
-h : print help message
-v : get version information
-W : list worksheets in the spreadsheet specified by -x
-w : specify the worksheet name to convert (defaults to the first worksheet)

The following example will convert a spreadsheet that is in the WINDOWS-1252 character set (WinLatin1) and save it as a csv file in the UTF-8 character set.

xls2csv -x "1252spreadsheet.xls" -b WINDOWS-1252 -c "ut8csvfile.csv" -a UTF-8

This example with convert the worksheet named “Users” in the given spreadsheet.

xls2csv -x "multi_worksheet_spreadsheet.xls" -w "Users" -c "users.csv"

The spreadsheet’s charset (-b) will default to UTF-8 if not set.

If the csv’s charset (-a) is not set, the CSV file will be created using the same charset as the spreadsheet (which is not the best option, so try ever to use UTF-8).

Some known problems of the program are:

  • It probably will not work with spreadsheets that use formulas. You should before create a sheet with the static content of formula fields copied as numbers and then extract this sheet.
  • A line in the spreadsheet is assumed to be blank if there is nothing in the first column.
  • Some users have reported problems trying to convert a spreadsheet while it was opened in a different application. You should probably make sure that no other programs are working with the spreadsheet while you are converting it.

The script is free software and you can redistribute it and/or modify it under the same terms as Perl interpreter itself.