I thought I would put my two cents into this game by describing a microbenchmark inspired by an actual task of mine: convert a file of delimited values to LIBSVM's data format. That is transform each line of
num1 num2 num3 num4 ...
to
num1 1:num2 2:num3 3:num4 ...
I'm omitting details about (LIB)SVM here, but this isn't relevant now. The point is that the first field is kept in the output, but the rest of them are prefixed with 'number of field in line minus one' and a colon. Note that columns may be omitted in the output, but field indexes must be ascending within a line (this is the so-called sparse matrix format).
Of course my first solution is in Python:
for line in file:
cols = line.strip().split()
print " ".join(["%i:%s" %(i, x.strip())
for (i, x) in enumerate(cols)])[2:]
For a file with 100k lines and 128 columns, this runs in 38 seconds on my WINDOWS box.
Not too bad, but there is an itching disturbance, which does not let the urge to rewrite it in awk escape me. I never wrote a single line in awk, so this should be a good start.
And lo, my first awk program:
{
printf("%s", $1)
for (i = 2; i <= NF; i++)
{
printf(" %i:%s", i-1, $i)
}
printf("\n")
}
Let's run it on my WINDOWS box, under CYGWIN, with the same 100k/128cols file as above.
...
Ouch, this took more than 5 and a half minutes!! Shame for awk, go Python. Still it sounds like nonsense. Let's see. Both Python and awk are interpreted, so iterations are supposed to be slow(?) The Python code iterates over the columns with a library string function (join), which is implemented in C so it's blazing. awk uses 'for' extensively, so let's optimize it by unwinding the loop:
{
print $1, "1:"$2 , "2:"$3 , "3:"$4 , [...]
[...] , "125:"$126 , "126:"$127 , "127:"$128
}
This is a common technique used in compiler optimization, see how it works in this case. It WORKS, it runs MUCH faster. Same test file, same WINDOWS box and CYGWIN, 45 seconds. Wow, a 7-fold speedup, but still slower than Python. Is this for real? I don't think so.
Enter GNU/Linux, just for kicks, another box. The Python program runs in 31 seconds, and hey, wait a minute, we didn't have to wait a minute: the awk program is done in 17 seconds. I say, wow...
In a problem this little the use of Python was not too justified. In awk you even get mixable standard input / file handling for free, whereas in Python you must implement it by yourself. But when script portability is a concern, you still should consider Python even for a trivial task like this, because you could get more balanced performance. Of course you could give a try to Psyco, the Python optimizer, but you will have another dependency, and it may as well turn out that you're performance will even be much worse! More on this later.
Right tool for the right task, but know you environment too.
2 comments:
Not a fair comparison, everything under cygwin is slooow.
cygwin probably isn't the issue: they didn't say - or check? - whether they were even using the same implementation and version of awk between the 2 platforms
Post a Comment