Thursday, September 25, 2008

Sling Blade Runner

ITA Software could be an interesting place to work at. They sure have some interesting puzzles on their career site serving as tools for resume selection / recruitment. One of such puzzles has just gone to their archives which is a big green flag for an open discussion.

The puzzle is called Sling Blade Runner, and it is specified as follows:

"How long a chain of overlapping movie titles, like Sling Blade Runner, can you find?"

Use the following listing of movie titles: MOVIES.LST. Multi-word overlaps, as in "License to Kill a Mockingbird," are allowed. The same title may not be used more than once in a solution. Heuristic solutions that may not always produce the greatest number of titles will be accepted: seek a reasonable tradeoff of efficiency and optimality.

Let's go and try to solve it!

The General Algorithm

The task suggests a reduction to the graph-traversal domain. By translating the specification to a bit more technical level, we can lay ground for an actual implementation. The highest level description of the proposed solution is as follows:

  1. Take the input of a set S of sentences

  2. Build the overlap-graph D of S which is a directed graph D(V, A); where

    • V=S,
    • A = {arc(s1, s2) if s2 overlaps s1 and both elements of S}
  3. Find the longest (vertex disjoint) path P in D

  4. Output P

Breaking down above steps takes us more closer to an actual realization. Following are the main steps in greater detail:

1. Take the input

The input is given as a stream of sentences (titles) separated by line breaks, where

  • a sentence is a non-empty sequence of words separated by non-breaking whitespace, and
  • a word is a non-empty sequence of non-whitespace characters.

For simplicity we won't care too much about different character sets and encodings and casing, and let the implementation environment decide about the proper meaning of whitespaces and characters.

What we do care about is that the input can possibly containt empty sentences that we ignore and multiple instances of the same sentence which we treat as a single value. Note that a sentence does not contain any of the whitespaces which appeared in the original line of input.

2. Build the overlap-graph

The meaning of "overlapping" is outlined in the original description of the task. A more formal definition would be that the sentence s2 overlaps sentence s1 iff any of the (non-empty) suffixes of s1 (including s1) is the prefix of s2 (including s2), where s1 is not s2.

The overlap-graph D then is made up by enumerating each possible p=(s1, s2) pair of S and putting an arc (s1, s2) into D for each p if s2 overlaps s1.

3. Find longest path

An L-long vertex disjoint path in the directed graph G is a sequence of vertices v[1], v[2], ..., v[L-1], v[L] with the following properties:

  • for each 1 <= i <= L, i != j it holds thats v[i] != v[j]
  • for each 1 <= i < L it holds that v[i] is the predecessor of v[i+1] in G (ie. there is an arc (v[i], v[i+1]) in G)

The longest vertex disjoint path P from a source vertex v in G is found by the following BFS-like recursive algorithm:

let n = the source of the search, initially n = v
let visited = vertices already visited by the algorithm, initially empty
define function longest_path(G, n, visited) as
if n is in visited then
return empty_list
for each successor m of n do
path_candidates[m] = longest_path(G, m, visited ++ n)
let mpc = one of the longest paths in path_candidates
return v ++ mpc // the list of nodes on one of the longest paths from n in G

To get the global longest path GLP in D, one of the best results of longest_path(D, n, empty_list) over each n in D must be chosen.

The "one of..." clauses are present because two distinct paths can have the same length.

4. Write output

Print vertices of GLP each vertex on a line.

5. Test

Distinct parts of the algorithm can be tested as separate units: overlap-detection, termination-of-traversal in a cyclic graph, proper graph-building, etc. -- but I'm gonna skip describing those. As far as testing the whole algorithm goes I'm only interested in two things:

  • That any results fed back to the input are given back unchanged.
  • That I would pick the most naive benchmark implementation and every optimization is subject give the same results as this benchmark.

These are the main concerns regarding the evolution of my implementations. I will present actual implementations in future posts, but I won't talk about testing: publishing of any code will imply a that it has passed above tests.

No comments: