Friday, December 12, 2008

Linux From Scratch -- Revisited

Stuff me in a pink gown and call me Paris Hilton.
Last time my LFS build failed it wasn't even binutils but glibc! After 15 minutes of RMTFing I realized that it's called configparms not configparams, dummy.

Tuesday, December 2, 2008

Linux From Scratch -- Failure

Oh well, binutils build has failed. I suppose it has something to do with GCC Pass 1? The wording about the first GCC compilation in the book is a bit misleading, because they are talking about bootstrapping but the make target nowhere tells it to bootstrap.
I guess I will have to come back to this one and contact the good guys on the mailing list. Of course I will do my RTFM before flooding public channels.

Saturday, November 29, 2008

Linux From Scratch -- Host System

So I have downloaded the minimal LFS LiveCD and popped it into my virtual machine under VMWare Player.

As it is advertised it is definitely a valid host environment for the build process, as checked by version-check.sh

After setting up my main file systems, we need to get the source packages. There is a wget input file in the stable book download directory which comes very handy. My slow internet connection makes me have to wait for all of them to arrive.

Linux From Scratch -- Introduction

I've been a long time casual Linux/Unix user. Back before Ubuntu was hype, I got into Gentoo, but never really got too far beyond installing it and emerging some packages. At my work we use Debian, and on my Vista I sure have Cygwin cause bash and the command line stuff is awesome.

But it always bothered me that I'm not really into Linux. Not into like "I use Ubuntu for my daily browsing and document editing chores" or "I'm the sysadmin of a 50-server heterogeneous *nix farm" but somewhere in between.

I'm pretty much tied to Windows as my main working environment, and can't really afford the hassle to migrate all my stuff to Linux (yeah, going from XP 32-bit to Vista 64-bit was a big enough deal already, thank you), so what do I do?

This is where Linux From Scratch comes into play. I mean. If I just really could build a whole Linux system from scratch, well it would really mean at least something, right?

Well. Jump right into it.

Because I'm a newbie, I start with the latest stable version of the LFS book.

First of all I need a host systems which is a simple 2.6 generic Linux VMware Image created by EasyVMX! and booted up with the LFS LiveCD. Since the latest book version is 6.4 and the latest LiveCD version is 6.3 and I don't need no X and stuff, I'm gonna pick the 32-bit "-min" image.

Since I have a slow internet connection, I guess I will have to hook up on things later.

Thursday, September 25, 2008

Sling Blade Runner

ITA Software could be an interesting place to work at. They sure have some interesting puzzles on their career site serving as tools for resume selection / recruitment. One of such puzzles has just gone to their archives which is a big green flag for an open discussion.

The puzzle is called Sling Blade Runner, and it is specified as follows:

"How long a chain of overlapping movie titles, like Sling Blade Runner, can you find?"

Use the following listing of movie titles: MOVIES.LST. Multi-word overlaps, as in "License to Kill a Mockingbird," are allowed. The same title may not be used more than once in a solution. Heuristic solutions that may not always produce the greatest number of titles will be accepted: seek a reasonable tradeoff of efficiency and optimality.

Let's go and try to solve it!

The General Algorithm

The task suggests a reduction to the graph-traversal domain. By translating the specification to a bit more technical level, we can lay ground for an actual implementation. The highest level description of the proposed solution is as follows:

  1. Take the input of a set S of sentences

  2. Build the overlap-graph D of S which is a directed graph D(V, A); where

    • V=S,
    • A = {arc(s1, s2) if s2 overlaps s1 and both elements of S}
  3. Find the longest (vertex disjoint) path P in D

  4. Output P

Breaking down above steps takes us more closer to an actual realization. Following are the main steps in greater detail:

1. Take the input

The input is given as a stream of sentences (titles) separated by line breaks, where

  • a sentence is a non-empty sequence of words separated by non-breaking whitespace, and
  • a word is a non-empty sequence of non-whitespace characters.

For simplicity we won't care too much about different character sets and encodings and casing, and let the implementation environment decide about the proper meaning of whitespaces and characters.

What we do care about is that the input can possibly containt empty sentences that we ignore and multiple instances of the same sentence which we treat as a single value. Note that a sentence does not contain any of the whitespaces which appeared in the original line of input.

2. Build the overlap-graph

The meaning of "overlapping" is outlined in the original description of the task. A more formal definition would be that the sentence s2 overlaps sentence s1 iff any of the (non-empty) suffixes of s1 (including s1) is the prefix of s2 (including s2), where s1 is not s2.

The overlap-graph D then is made up by enumerating each possible p=(s1, s2) pair of S and putting an arc (s1, s2) into D for each p if s2 overlaps s1.

3. Find longest path

An L-long vertex disjoint path in the directed graph G is a sequence of vertices v[1], v[2], ..., v[L-1], v[L] with the following properties:

  • for each 1 <= i <= L, i != j it holds thats v[i] != v[j]
  • for each 1 <= i < L it holds that v[i] is the predecessor of v[i+1] in G (ie. there is an arc (v[i], v[i+1]) in G)

The longest vertex disjoint path P from a source vertex v in G is found by the following BFS-like recursive algorithm:

let n = the source of the search, initially n = v
let visited = vertices already visited by the algorithm, initially empty
define function longest_path(G, n, visited) as
if n is in visited then
return empty_list
for each successor m of n do
path_candidates[m] = longest_path(G, m, visited ++ n)
let mpc = one of the longest paths in path_candidates
return v ++ mpc // the list of nodes on one of the longest paths from n in G

To get the global longest path GLP in D, one of the best results of longest_path(D, n, empty_list) over each n in D must be chosen.

The "one of..." clauses are present because two distinct paths can have the same length.

4. Write output

Print vertices of GLP each vertex on a line.

5. Test

Distinct parts of the algorithm can be tested as separate units: overlap-detection, termination-of-traversal in a cyclic graph, proper graph-building, etc. -- but I'm gonna skip describing those. As far as testing the whole algorithm goes I'm only interested in two things:

  • That any results fed back to the input are given back unchanged.
  • That I would pick the most naive benchmark implementation and every optimization is subject give the same results as this benchmark.

These are the main concerns regarding the evolution of my implementations. I will present actual implementations in future posts, but I won't talk about testing: publishing of any code will imply a that it has passed above tests.

Friday, September 12, 2008

Google Chrome -- The Evil User Interface

I really can't believe things went so far that I'm going to put this kind of rant in ink. I really have to state that I'm still gonna take the Google pill over Microsoft on any day, but heck, maybe they don't learn from the master that well after all. I'm talking about Google Chrome. It may well be that Google did revolutionize things on the server side and took on very well on the web-client side, but God are they unprofessional on the desktop. Unprofessional may not be the most suitable word here, but if they are doing those amateurish things on the desktop on purpose, well, that just makes it sound downright evil.

First of all, I didn't even keep Google Desktop on my machine for more than an hour because frankly, having a web server to let me get easier access to my desktop kinda sounds freaky to me. Turns out that I wasn't overly paranoid at all.

Now they came out with a browser, the user interface of which isn't a very likely candidate to be served by, well a web server -- and I was truly frantic about it. Even got sold on the Chrome comic in a snap. Soon as it was available to download I gave it a run, even took a look at the source code (though not as deep as Scott Hanselman here -- more on this in a minute). I was truly excited only to find out in five minutes that this stuff is not for me, and I believe it's not even for the casual user. Of course there is at least one solid conspiracy theory stating that this browser is not meant for the users. Anyhow, I think that messing up the user experience won't really help a product to be embraced by anybody regardless of the creators' original intent.

All my enthusiasm just vaporized in 30 seconds after the download. Why, the installer shoves the damn thing under "%USERPROFILE%\Local Settings\Application Data". I mean they put the program into a data folder, which not only makes it friggin counterintuitive but raises a bunch of management issues in any sanely managed environment, be it a shared family computer or a corporate environment. (Yeah I know Google support's this von Neumann mindset of everything being just data, but hey...)

Next thing is the implementation of the user interface. Chrome gives you a bigger client area which is good, but this obviously means the removal of the nonclient stuff like the title bar, which is nice for an instant messenger's buddy window or a media player but is kinda frivolous for a web browser which you constantly have in the foreground. It kills interoperability with a whole slew of applications like WindowBlinds or GridMove, and not playing nice with other applications won't make good to your reputation in the hood.

The last thing might sound like nitpicking -- that is even more nitpicking than the previous ones :) but as I've said I went to see the Chrome source through the eyes of Scott Hanselman and found out that Google went down the too-clever path and released a client software which relies on undocumented features of an operating system. Call me a old-fashion but I think this wouldn't even be appropriate for a tiny 2-man shop, let alone for GOOG. It's just bad karma.

Tuesday, September 9, 2008

Microsoft Cloud

The Microsoft Cloud is to be seen drifting by. There has been a lot of buzz about clouds in general and about Microsoft's upcoming cloud strategies (see this very neat overview) Of course lot of these conversations have been vague, which is understandable in light of MS being very quiet about technical details regarding the Cloud. I mean, Their Cloud, because it very well seems like they're coming late to the game just to try and pull an IBM on the other players -- which I think would come together for them very nicely.
Why do I think so?
Well. Coupla months into the PDC and David Chappell comes up with a Clouding Platforms Paper and it finally makes sense. MS is going to give a flying damn about cloud services. They're building a platform, baby. You see? What's better than locking you into a development environment than ultimately locking you into a deployment environment? Hell, they may even give away free cookies, developer-developer-developer t-shirts and "Microsoft Patterns and Practices" handouts, oh just come and run your application straight on their infrastructure.
And oh why, they're pushing DLR like crazy which would make enterprise development even more opaque: expect hardcore rubysts go crazy when their code is ultimately put into the same appdomain with some cheesy vb.net code -- and communicating.

Nevertheless. I guess I've never been looking forward this much to the PDC (if ever) -- mainly because Raymond Chen and secondly because of the unveiling of the MS cloud stuff.

Monday, June 9, 2008

You still have to sell me on WPF

There is an interesting tendency visible on ITJobsWatch regarding Windows Presentation Foundation. It's that the demand for WPF workforce has grown significantly in the past 6 months (as of 2008Q2). I don't really got around peeking into the actual job descriptions there but three quarters of a year has grown the demand for WPF-developer up to 1,2%. Compare this do WinForms' (plumetting) 3,9% you may wonder if WPF is really ready to replace WinForms or maybe it has so many pioneering features that it generates a whole lot of new expectations about user interfaces. I must say that I Don't Thnik So. Of course I'm biased towards Line Of Business applications, I'm really hesitant about WPF when it comes to LOB because I can't really believe that people won't miss WinForm's DataGridView, the tabbed interfaces, the Masked Textbox or the MDI. No sir. It's true that there has been a lot of debate on WPF LOB usage and WPF vs. Winforms and WPF credibility and a lot of promoting of WPF (albeit some spectators say not enough of it) from Microsoft, and I believe that WPF is still sorrounded by a lot of FUD, and I don't really want to name the guilty parties. Let me just highlight one of the highlight featues of WPF-based development: the fact that designers can work more closely with programmers on a WPF project. I can't see why and how this could be a very big achievement in a time where people have learned that designers don't give a heck about functionality and programmer's don't give a heck about user interface design. Not because they are evil, simply because there is a whole impendance mismatch between being the designer-world and the developer-world (and the software usability world before we forget what the game is about) so unless you can spend a lot of money on designing and usability testing your user interface, you're effectively busted. And if you're really busted, it doesn't make too much of a difference which technology you actually use.

Of course there's a lot of value in WPF which has soon to be explored, but I think that we're not there yet. It's really cool tho to have a framework which encourages separation of concerns in an area really really prone to produce ugly and smelly code.

UI design for programmers

I've long been wanting to read User Interface Desing For Programmers by Joel Spolsky. Which is absolutely positively a must-read for every developer who is concerned about developing great user interfaces. Start by learning that "great" == "highly usable" not "bells and whistles all over the place".

The only issue I had with the book was that somehow I thought it indicates that copying an already estabilished design could do you more good than trying to invent something really cool. The notable example here is the Microsoft Office user interface which is the product of hundreds thousand dollars of careful design and usability screening at Microsoft, which magically justifies of Office UI as great. While this might be true, I really don't want people to believe that The One Microsoft Way is always the great way, because Microsoft employees are people too and they can produce shitty user interaction design too.

Anyways, UIDfP is a great book, and if you're interested even more in UI design, then you should definitely also read About Face too.

Friday, May 23, 2008

I Love Emacs

I bought a new laptop last week, and when migrating my data over from the old one, I did realize that information vital to manage my stuff (like time and money and projects and stuff) is scattered around in so many places (Outlook, plain text files, ToDoList, AceMoney, various timesheets in Excel, etc) that it's starting to look unprofessional. :-)
I must admit that I was content with AceMoney, but my project/time-tracking system was what bothered me:
- Use a ToDoList to decompose my projects to a tree of tasks.
- Track my project-time in a Word table (I know it's cheesy, but Word let's you enter current time very quick)
- Cross-reference ToDoList's autogenerated ID's with Word's table
- Report with a little WinForms program, using AntiWord to extract the Word table.
Boy, there's a whole lot of dependencies there, and a whole lot of clumsyness, but well, that's what I got used to.
And that's all past now, because I've found out about Emacs's Org Mode and I'm the happiest person alive to do organize *ALL* my stuff and clock my time with it.

Plus, becoming the Emacs-enthusiast I always only hoped I would become, I'm even migrating my personal accounting from AceMoney to Ledger which has an Emacs mode.

I must say again, that I was content with AceMoney itself, and I think it's a very handy and clean application. For personal finance, it's far better I my book than jGnash or GnuCash, BUT Ledger is the ultimate lean-and-mean solution for you if you live by "less dependencies and more openness the better"-code. And I think you should.

Thursday, May 15, 2008

Jython StringTemplate un(shallow)copyable

I've started porting to Jython a Python application, which uses the Python version of StringTemplate. Right after the start I've bumped into an error when using StringTemplateGroup.
The Python code looks like this:

def go():
g = stringtemplate.StringTemplateGroup("A","B")
g.getInstanceOf("A_TEMPLATE")

When calling go() from Java/Jython, the following exception is raised:

Exception in thread "main" Traceback (innermost last):
...
...
File "C:\DEV\stringtemplate3\groups.py", line 363, in getInstanceOf
File "C:\DEV\stringtemplate3\templates.py", line 391, in getInstanceOf
File "C:\DEV\stringtemplate3\templates.py", line 369, in dup
File "C:\DEV\jython2.2.1\Lib\copy.py", line 78, in copy
Error: un(shallow)copyable object of type

Others have bumped into this error too. It's because Jython doesn't fully implement object copying (in the same sense that CPython doesn't implement it fully -- you cannot just copy arbitraty C objects). It turns out, that this isn't a problem for me, because as soon as I've changed the StringTemplateGroup error listener to my own (copyable) listener, the problem went away.

But there's one caveat. I actually choose to NOT use any custom error listener, because console errors are okay for me, so I've tried and put errors=None in the StringTemplateGroup constructor, which is a no-no, because None means "use the default one", which is exactly what I didn't want right now.

So the final solution was to redefine go like so:

def go():
g = stringtemplate.StringTemplateGroup("A","B")
g.errorListener = None
g.getInstanceOf("A_TEMPLATE")

Works like a charm, so far.

Sunday, May 4, 2008

Know (not) One Editor

I've just started to read the original Pragmatic Programmer book (talk about late-adopting, when did it just came out? last decade??) and there truly is some clever stuff therein. But the nitpicker I am, I've immedieately found something I don't really agree with: the One Editor Tip (#22) The authors say they think is better to know one editor very well, and use it for all editing tasks, which I personally don't think is too clever. At least for me. Because you already know that I Love Emacs, you should also know that every tool has it's place and I don't really want to give up Eclipse's or Visual Studio's services so well tailored to Java/C#. Pragmatic Programmer comes up with this very contrived example, where they indicate that being able to alphabetically sort import statements automatically is a major programmer productivity booster. Give me a break. Use Emacs to organizer your life but use a Java IDE that understands your code, refactors it, generates stubs for you, and such.

Of course I'm just totally mean here, because back in '99 Eclipse and VS weren't even conceived. What I really wanted to point out was, that the more advanced technology becomes, the more specialized knowledge you have to embrace. Back in the days, you were wicked cool with Emacs as an IDE. Nowadays you should switch if you want to avoid unneeded complexity when working with enterprise(y) stuff. And you better know some nano/joe/vi cause it's damn sure not all of your clients will have emacs installed on their machine you have to quickfix *now*

Friday, April 25, 2008

awk vs. Python performance

There are a lot of debates where people (try to) compare the relative performance of programming languages. These conversations range from producing extensive benchmarks to coming up with esoteric metrics about code complexity/maintainability to deriving even more esoteric metrics from (perceived) language/platform popularity.

I thought I would put my two cents into this game by describing a microbenchmark inspired by an actual task of mine: convert a file of delimited values to LIBSVM's data format. That is transform each line of

num1 num2 num3 num4 ...

to

num1 1:num2 2:num3 3:num4 ...

I'm omitting details about (LIB)SVM here, but this isn't relevant now. The point is that the first field is kept in the output, but the rest of them are prefixed with 'number of field in line minus one' and a colon. Note that columns may be omitted in the output, but field indexes must be ascending within a line (this is the so-called sparse matrix format).

Of course my first solution is in Python:

for line in file:
cols = line.strip().split()
print " ".join(["%i:%s" %(i, x.strip())
for (i, x) in enumerate(cols)])[2:]

For a file with 100k lines and 128 columns, this runs in 38 seconds on my WINDOWS box.

Not too bad, but there is an itching disturbance, which does not let the urge to rewrite it in awk escape me. I never wrote a single line in awk, so this should be a good start.
And lo, my first awk program:

{
printf("%s", $1)
for (i = 2; i <= NF; i++)
{
printf(" %i:%s", i-1, $i)
}
printf("\n")
}

Let's run it on my WINDOWS box, under CYGWIN, with the same 100k/128cols file as above.

...

Ouch, this took more than 5 and a half minutes!! Shame for awk, go Python. Still it sounds like nonsense. Let's see. Both Python and awk are interpreted, so iterations are supposed to be slow(?) The Python code iterates over the columns with a library string function (join), which is implemented in C so it's blazing. awk uses 'for' extensively, so let's optimize it by unwinding the loop:

{
print $1, "1:"$2 , "2:"$3 , "3:"$4 , [...]
[...] , "125:"$126 , "126:"$127 , "127:"$128
}

This is a common technique used in compiler optimization, see how it works in this case. It WORKS, it runs MUCH faster. Same test file, same WINDOWS box and CYGWIN, 45 seconds. Wow, a 7-fold speedup, but still slower than Python. Is this for real? I don't think so.

Enter GNU/Linux, just for kicks, another box. The Python program runs in 31 seconds, and hey, wait a minute, we didn't have to wait a minute: the awk program is done in 17 seconds. I say, wow...

In a problem this little the use of Python was not too justified. In awk you even get mixable standard input / file handling for free, whereas in Python you must implement it by yourself. But when script portability is a concern, you still should consider Python even for a trivial task like this, because you could get more balanced performance. Of course you could give a try to Psyco, the Python optimizer, but you will have another dependency, and it may as well turn out that you're performance will even be much worse! More on this later.

Right tool for the right task, but know you environment too.

Monday, February 25, 2008

Shake Ya Tailfeathers (A Haskell Puzzle)

Deciding that the Scala bandwagon is far too hot right now (in the sense that I should wait a bit till I'll be eligible to late-adopt it) I've started learning Haskell instead.

After reading and processing the first some chapters of YAHT and still not getting CPS, I nevertheless wanted to produce something, so came up with this:

It's a puzzle I've learned a long time ago called Tailfeather.
Tailfeather is a number-sequence, where the very first element is lress than the last one, the second element is lress than the second to last element and so on.
The task is to write a function with the specification:

tailfeather :: Int -> Int
{- tailfeather n = b
where 'b' > 1 is the smallest radix in which
the digits written of 'n'
form a Tailfeather sequence of numbers
-}

A very nice, very entry-level stuff. And the very beauty of it all is that implementing this solution was absolutely straighforward: I was able to focus on the actual problem without worrying about the underlying details (like how to represent a number-sequence -- is that an ArrayList or and array?)
Actually there are two sub-problems: A) rewriting a decimal number to another radix and B) determining whether a sequence is a Tailfeather.

Starting with the latter subtask, all you have to do is break a list into three parts, and solve the task recursively, like so:

tailfeather0 (x:xs) =
(x < last xs) &&
(tailfeather0 (init xs))

First we check if the Tailfeather property holds for the first and the last elements, then cut those elements from the list (that is take the init of the tail) and check the remaining part.

Of course the recursion must be terminated (and this next code MUST appear before the previous fragment):

tailfeather0 [] = True
tailfeather0 [x] = True

DONE.

Let's see problem A. If you don't know the proper algorithm for converting from decimal to another radix, it is given here (if the site is down use Wikipedia).
My first naive rewrite of the example provided on the linked page yielded the following:

convert2base2 0 = []
convert2base2 n = (mod n 2) : convert2base2 (div n 2)

Looks tight, the only problem is that it returns the digits in reverse order (starting with the least significant bit). Reversing the list after the whole computation would be an option, instead I do it in place:

convert2base2 n = convert2base2 (div n 2) ++ [mod n 2]

Generalizing the converter to accept any base (don't use this one with b <= 1):

convert2base _ 0 = []
convert2base b n = convert2base b (div n b) ++ [mod n b]

Now that we have our building blocks, let's combine them:

tailfeather n = tailfeather1 2 n

tailfeather1 b n =
if (b == n)
then b+1
else if tailfeather0 (convert2base b n)
then b
else tailfeather1 (b+1) n

That's it.