Saturday, November 29, 2008

An Approach to Algorithm Parallelisation

The other day I came across the paper Parallelizing Complex Scans and Reductions lying on a colleague's desk. The first part of the paper discussed how to make a certain family of algorithms run faster on parallel machines and the second half of the paper went on to show how, with some work, the method could be stretched to a wider class of algorithm. What the authors seemed to miss was that the extra work really wasn't necessary and the methods of the first half apply, with no change, to the second half. But don't take this as a criticism! I learnt a whole new way to approach algorithm design, and the trick to making the second half easy uses methods that have become popular in more recent years. Doing a web search I found lots of papers describing something similar to what I did.

This is also a nice example of how the notion of abstraction in computing and the notion of abstraction in mathematics are exactly the same thing. But I'm getting ahead of myself.

So here a direct translation from he paper of some procedural code to find the largest sum that can be made from a subsequence of a sequence. This will be our first implementation of the problem examined in the second half of the paper:


> f1 [] (sofar,max) = (sofar,max)
> f1 (b:bs) (sofar,max) =
> let sofar' = if sofar+b<0
> then 0
> else sofar+b
> max' = if max<sofar'
> then sofar'
> else max
> in f1 bs (sofar',max')


sofar is a running sum that is reset each time it dips below zero, and max keeps track of the best sum so far. We initialise sofar and sum with 0 and -infinity. Here's an example of how to ue it:


> b :: [Double]
> b = [1..5] ++ [5,4..(-10)] ++ [(-2)..6]

> infinity :: Double
> infinity = 1/0
> test1 b = snd $ f1 b (0,-infinity)


Notice how we prime the algorithm with a starting vector. The 0 corresponds to the fact that at the start we've summed over 0 elements and the -infinity corresponds to the fact that we want the first sum computed to be the highest so far at that point.

Test the code with test1 b. We'll use a similar pattern all the way through this code.

You may see the problem with making this parallelisable: we are maintaining running sums so that the final values of sofar and max all depend on what was computed earlier. It's not obvious that we can break this up into pieces. sofar computes sums of subsequences between resets but chopping the array b into pieces might split such subsequences between processors. How can we handle this cleanly?

The first step is to write version two of the above function using max instead of conditionals:


> f2 [] (sofar,max) = (sofar,max)
> f2 (b:bs) (sofar,max) =
> let sofar' = Prelude.max (sofar+b) 0
> max' = Prelude.max max sofar'
> in f2 bs (sofar',max')

> test2 b = snd $ f2 b (0,-infinity)


But that doesn't appear to make things any easier, we've just buried the conditionals inside max, it doesn't make the serial dependency go away.

So let's solve another problem instead. In f2 I'll replace max with addition and addition with multiplication. 0 is the identity for addition so we should replace it with the identity for multiplication, 1. Similarly, -infinity is the identity for max so we should replace that with 0. We get:


> f3 [] (sofar,max) = (sofar,max)
> f3 (b:bs) (sofar,max) =
> let sofar' = sofar*b+1
> max' = max+sofar'
> in f3 bs (sofar',max')

> test3 b = snd $ f3 b (1,0)


That's all very well but (1) it looks no more parallelisable and (2) it's solving the wrong problem. Let's ignore problem (2) for now.

The thing that makes f3 easier to work with is that it's almost a linear function acting on the vector (sofar,max). Linear functions have one very nice property. If f and g are linear then we can compute f(g(x)) by acting with g first, and then applying f. But we can also compose f and g without reference to x giving us another linear function. We only have to know how f and g act on basis elements and we can immediately compute how f . g acts on basis elements. This is usually expressed by writing f and g as matrices. So let's tweak f3 so it's linear in its last argument:


> f4 [] (sofar,max,i) = (sofar,max,i)
> f4 (b:bs) (sofar,max,i) =
> let sofar' = (sofar * b) + i
> max' = max + sofar'
> i' = i
> in f4 bs (sofar',max',i')

> test4 b = let (_,max,_) = f4 b (1,0,1) in max


So now I need to write some code to work with linear functions. I'll do it in a very direct style. Here are some tuples representing basis vectors. (I could have written some fancy vector/matrix code but I don't want to distract from the problem in hand.)


> x,y,z :: Num a => (a,a,a)
> x = (1,0,0)
> y = (0,1,0)
> z = (0,0,1)


And here's some code that computes how a function acts on a basis, in effect finding the matrix for our function with respect to the basis x,y,z:


> matrix f = (f x,f y,f z)


Some simple operations on vectors:


> (a,b,c) .+ (d,e,f) = (a + d,b + e,c + f)
> a .* (b,c,d) = (a * b,a * c,a * d)


And now a little function that, given how f acts on basis elements, can apply f to any vector:


> apply (mx,my,mz) (sofar,max,i) = (sofar .* mx) .+ (max .* my) .+ (i .* mz)


Now we can redo the calculation with f4 by first making the matrix for f4, and then applying that to our starting vector.


> test5 b = let m = matrix (f4 b)
> (_,max,_) = apply m (1,0,1)
> in max


Note how by time we've computed m we've done almost all of the work even though the code hasn't yet touched (1,0,1).

But now comes the first bit of magic. We can split our list b into pieces. Compute the corresponding matrix for each piece on a separate processor, and then apply the resulting matrices to our starting vector.

Let's chop our list of reals into pieces of size n:


> chop n [] = []
> chop n l = let (a,b) = splitAt n l in a : chop n b


We'll use pieces of size 10:


> test6 b = max where
> (_,max,_) = foldr ($) (1,0,1) (reverse b_functions) where
> b_pieces = chop 10 b


The following maps should be replaced with a parallel version. It's easy to do this.


> b_matrices = map (matrix . f4) b_pieces
> b_functions = map apply b_matrices


Great, we've successfully parallelised the code, but it's the wrong algorithm. How can we use this to solve the correct problem? Remember how we replaced max with addition and addition with multiplication. We just have to undo that. That's all! Everything required to prove that the above parallelisation is valid applies over any semiring. At no point did we divide or subtract, and we only used elementary properties of numbers like a*(b+c) = a*b+a*c. That property holds for max and addition. In fact a+max(b,c) = max(a+b,a+c). We don't even have to modify the code. We can just define the max-plus semiring as follows:


> newtype MaxPlus = M Double deriving (Eq,Show)

> instance Num MaxPlus where
> fromInteger 0 = M (-infinity)
> fromInteger 1 = M 0
> fromInteger _ = error "no conversion from general integer"
> M a + M b = M (max a b)
> M a * M b = M (a+b)
> abs _ = error "no abs"
> signum _ = error "no signum"
> negate _ = error "no negate"


And now all we need is


> test7 b = test6 (fmap M b)


(I wonder if ghc is smart enough to completely eliminate that fmap M, after all, on a newtype, M should do zero work.)

And that's a completely parallelised version of the original algorithm.

There is a ton of optimisation that can be performed here. In particular, matrix applies a function to a fixed basis. For the particular function we're using here there's a big win from constant folding. The same constant folding applies in any semiring.

And back to the point I made at the beginning. By abstracting from the reals to a general semiring we are able to make the same code perform multiple duties: it can work on functions linear over many semirings, not just the reals. Mathematicians don't work with abstractions just to make life hell for students - they do so because working with abstract entities allows the same words to be reused in a valid way in many different contexts. This benefits both mathematicians and computer scientists.

Here's a link you can use to find out more on this technique.

Note that in real world usage you wouldn't use lists. -|chop|- would take longer than the rest of the algorithm.

PS A curious aside. I spent ages trying to get ghc to compile this code and getting my homebrew HTML markup code to work reliably on it. But eventually I located the problem. I've just returned from Hawai`i where I wrote most of this code. Just for fun I'd put my keyboard into Hawai`ian mode and forgot about it. When I did that, the single quote key started generating the unicode symbol for the Hawai`ian glottal stop. It looks just like a regular single quote in my current terminal font so it was hard to see anything wrong with the code. But, quite reasonably, ghc and many other tools barf if you try to use one where a regular quote is expected.

6 comments:

Anonymous said...

This reminds me of how the Floyd-Warshall algorithm can be used to compute shortest path, maximum flow, number of paths, probability of reaching destination, and many other things, depending on the weight semiring you use.

sigfpe said...

Yes, the Floyd-Warshall algorithm fits into the same general framework.

Incidentally, I wrote about how you can use a monad to implement a very general version of Floyd-Warshall here. This current problem can be implemented with that monad too - but it might be very hard to keep it efficient.

jkff said...

This is extraordinarily cool, thanks for posting this. Regretfully, I have nothing else substantial to add, but still :)

Cristian said...

Very nice article. Thank you so much!

It reminds me of Guy Blelloch, Prefix Sums and Their Applications

http://www.cs.cmu.edu/~guyb/papers/Ble93.pdf

sigfpe said...

Cristian,

That prefix sum article is great. There's a bunch of stuff in there I'm interested in. Thanks.

Edward Kmett said...

If you are looking for a more typeful derivation of this sort of thing, Dana Xu of ESC/Haskell fame wrote something similiar for her Masters' thesis. A paper based, which I believe is based on that thesis or formed the basis for it:

http://citeseer.ist.psu.edu/old/xu03typebased.html

Blog Archive