Parallel algorithms

Let's begin with a divide-and-conquer algorithm for finding the sum of the elements of a vector of numbers:

(define vector-sum
  (lambda (vec)
    (let ((len (vector-length vec)))
      (if (zero? len)
          0
          (let helper ((start 0)
                       (finish len))
            (if (= (+ start 1) finish)
                (vector-ref vec start)
                (let ((mid (quotient (+ start finish) 2)))
                  (let ((left (helper start mid))
                        (right (helper mid finish)))
                    (+ left right)))))))))

The null vector is split off as a special case, so that the main divide-and-conquer procedure helper can take it as a precondition that start is strictly less than finish. The helper procedure finds the sum of the elements in the section of the vector that lies between position start (inclusive) and position finish (exclusive). If there is only one element in that section, helper returns its value; otherwise, helper bisects the section, finds the sum of each half, and returns the sum of those two partial sums. When helper is initially invoked, start is 0 and finish is the length of the vector, so the value finally returned is the sum of all of the elements of the vector.

In standard Scheme, there is no particular advantage in using the bisection approach: To find the sum of the elements of a vector of length n, the divide-and-conquer algorithm still has to do n - 1 additions. This is not a great improvement over the n additions that are performed when we make a linear pass over the vector to accumulate a sum:

(define vector-sum
  (lambda (vec)
    (let ((len (vector-length vec)))
      (let loop ((index 0)
                 (total 0))
        (if (= index len)
            total
            (loop (+ index 1) (+ total (vector-ref vec index))))))))

However, imagine that we have an extended implementation of Scheme, designed for use on computers with processors that can perform different computations at the same time. This implementation supports an additional kind of Scheme expression, the parallel-let expression, which is just like the let expression except that the values of the expressions in its binding specifications are computed simultaneously, in parallel, on separate processors. (The body of a parallel-let expression is not evaluated, of course, until all of the processors have reported back and the values they return bound to the corresponding identifiers.)

What happens to the running time of the first version of vector-sum if we change the final let-expression into a parallel-let expression?

(define vector-sum
  (lambda (vec)
    (let ((len (vector-length vec)))
      (if (zero? len)
          0
          (let helper ((start 0)
                       (finish len))
            (if (= (+ start 1) finish)
                (vector-ref vec start)
                (let ((mid (quotient (+ start finish) 2)))
                  (parallel-let ((left (helper start mid))
                                 (right (helper mid finish)))
                    (+ left right)))))))))

Now the values of (helper start mid) and (helper mid finish) be computed at the same time instead of one after the other. The total time for the evaluation of both of these expressions is the larger of their separate running times, rather than the sum of their separate running times. Since, we get this advantage at each level of recursion, the overall running time is now proportional not to the length of the vector but to the number of bisections required to reduce that size to 1 -- in other words, to the base-two logarithm of the length of the vector. If the vector is long enough (and if there are enough processors to accommodate the demand -- in theory, we might need one for each element of the vector), this technique yields a significant reduction in running time.

Vector-sum computes only the overall sum of the elements of the vector. Suppose that we want to compute and store the partial sums along the way, so that (for example) if we start with the vector #(3 1 4 1 5 9), we get not just the overall sum, 23, but the vector of partial sums, #(3 4 8 9 14 23). Of course, it's easy to do this in an amount of time proportional to the length of the vector:

(define accumulate-vector
  (lambda (vec)
    (let* ((len (vector-length vec))
           (result (make-vector len)))
      (let loop ((index 0)
                 (total 0))
        (if (= index len)
            result
            (let ((new-total (+ total (vector-ref vec index))))
              (vector-set! result index new-total)
              (loop (+ index 1) new-total)))))))

It might seem obvious in this case that one can't improve on this running time, but in fact parallelism again allows us to speed up the process.

Let gap be the greatest power of 2 less than the length of the vector. First, construct a new vector that is obtained from vec by adding to each element the element that is gap positions to its left. (An element that is in one of the first gap positions of vec, so that there is no element gap positions to its left, should simply be copied into the new vector.)

Here's how this step looks for our sample vector:

gap = 4
vec                                     result of first step
+-----+-----+-----+-----+-----+-----+   +-----+-----+-----+-----+-----+-----+
|     |     |     |     |     |     |   |     |     |     |     |     |     |
|  3  |  1  |  4  |  1  |  5  |  9  |   |  3  |  1  |  4  |  1  |  8  |  10 |
|     |     |     |     |     |     |   |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+   +-----+-----+-----+-----+-----+-----+

Now halve the gap and repeat the process, starting from the result of the first step:

gap = 2
result of first step                    result of second step
+-----+-----+-----+-----+-----+-----+   +-----+-----+-----+-----+-----+-----+
|     |     |     |     |     |     |   |     |     |     |     |     |     |
|  3  |  1  |  4  |  1  |  8  |  10 |   |  3  |  1  |  7  |  2  |  12 |  11 |
|     |     |     |     |     |     |   |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+   +-----+-----+-----+-----+-----+-----+

Halve the gap again and repeat, starting from the result of the previous step:

gap = 1
result of second step                   result of third step
+-----+-----+-----+-----+-----+-----+   +-----+-----+-----+-----+-----+-----+
|     |     |     |     |     |     |   |     |     |     |     |     |     |
|  3  |  1  |  7  |  2  |  12 |  11 |   |  3  |  4  |  8  |  9  |  14 |  23 |
|     |     |     |     |     |     |   |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+   +-----+-----+-----+-----+-----+-----+

In this particular case, we can stop after the third step; in general, we'd keep repeating the process until the step in which gap is 1 has been completed. The number of repetitions is the base-two logarithm of the length of the vector (rounded up to the next integer).

This may seem kind of magical, but there is an invariant that is preserved from step to step and guarantees the correctness of the final result. Express each of the (zero-based) positions in the vector as a binary numeral in i bits, where i is the base-two logarithm of the initial gap. After k steps, the element at position j in the current vector is the sum of all of the elements at positions less than or equal to j of vec whose position numbers differ from j only in their first k bits.

Initially, this invariant holds trivially (with k = 0), since any two distinct position numbers have different binary numerals, so the invariant requires only that each element is the sum of itself. At each subsequent step, the invariant is preserved, since what is added to the element at position j in the kth step is exactly the sum of the elements farther to the left whose position numbers differ from j in their kth bit.

Even though the number of steps is the base-two logarithm of the length of the vector, it might seem that this approach is bound to be slower, since each step involves building an entirely new vector, of the same length as the original. But the construction of each new vector can be parallelized: We can build a new vector of length n in an amount of time that is proportional to the base-two logarithm of n, using the same divide-and-conquer technique as in vector-sum. Here's what the entire process looks like, in our extended implementation of Scheme (we'll assume as a precondition that vec is not the null vector):

(define accumulate-vector
  (lambda (vec)
    (let* ((len (vector-length vec))
           (initial-gap (let seek ((trial 1))
                          (let ((double (* trial 2)))
                            (if (< double len)
                                (seek double)
                                trial)))))
      (let loop ((gap initial-gap)
                 (current vec))
        (if (zero? gap)
            current
            (let ((new (make-vector len)))
              (let helper ((start 0)
                           (finish len))
                (if (= (+ start 1) finish)
                    (vector-set! new start
                      (if (< start gap)
                          (vector-ref current start)
                          (+ (vector-ref current start)
                             (vector-ref current (- start gap)))))
                    (let ((mid (quotient (+ start finish) 2)))
                      (parallel-let ((left (helper start mid))
                                     (right (helper mid finish)))
                        (loop (quotient gap 2) new)))))))))))

Provided that each call to make-vector takes a constant amount of time regardless of the length of its result (which is plausible, since no initialization is needed), the running time of this algorithm as a whole is proportional to the square of the base-two logarithm of the length of vec. Once again, this is an improvement on the straightforward linear version if vec is long enough.


This document is available on the World Wide Web as

http://www.cs.grinnell.edu/~stone/courses/operating-systems/parallel-algorithms.xhtml

created October 26, 2000
last revised October 26, 2000

John David Stone (stone@cs.grinnell.edu)