Let's begin with a divide-and-conquer algorithm for finding the sum of the elements of a vector of numbers:
(define vector-sum
(lambda (vec)
(let ((len (vector-length vec)))
(if (zero? len)
0
(let helper ((start 0)
(finish len))
(if (= (+ start 1) finish)
(vector-ref vec start)
(let ((mid (quotient (+ start finish) 2)))
(let ((left (helper start mid))
(right (helper mid finish)))
(+ left right)))))))))
The null vector is split off as a special case, so that the main
divide-and-conquer procedure helper can take it as a
precondition that start is strictly less than
finish. The helper procedure finds the sum of
the elements in the section of the vector that lies between position
start (inclusive) and position finish
(exclusive). If there is only one element in that section,
helper returns its value; otherwise, helper
bisects the section, finds the sum of each half, and returns the sum of
those two partial sums. When helper is initially invoked,
start is 0 and finish is the length of the
vector, so the value finally returned is the sum of all of the elements of
the vector.
In standard Scheme, there is no particular advantage in using the bisection approach: To find the sum of the elements of a vector of length n, the divide-and-conquer algorithm still has to do n - 1 additions. This is not a great improvement over the n additions that are performed when we make a linear pass over the vector to accumulate a sum:
(define vector-sum
(lambda (vec)
(let ((len (vector-length vec)))
(let loop ((index 0)
(total 0))
(if (= index len)
total
(loop (+ index 1) (+ total (vector-ref vec index))))))))
However, imagine that we have an extended implementation of Scheme,
designed for use on computers with processors that can perform different
computations at the same time. This implementation supports an additional
kind of Scheme expression, the parallel-let expression, which
is just like the let expression except that the values of the
expressions in its binding specifications are computed simultaneously, in
parallel, on separate processors. (The body of a parallel-let
expression is not evaluated, of course, until all of the processors have
reported back and the values they return bound to the corresponding
identifiers.)
What happens to the running time of the first version of
vector-sum if we change the final let-expression
into a parallel-let expression?
(define vector-sum
(lambda (vec)
(let ((len (vector-length vec)))
(if (zero? len)
0
(let helper ((start 0)
(finish len))
(if (= (+ start 1) finish)
(vector-ref vec start)
(let ((mid (quotient (+ start finish) 2)))
(parallel-let ((left (helper start mid))
(right (helper mid finish)))
(+ left right)))))))))
Now the values of (helper start mid) and (helper
mid finish) be computed at the same time instead of one after the
other. The total time for the evaluation of both of these expressions is
the larger of their separate running times, rather than the sum of their
separate running times. Since, we get this advantage at each level of
recursion, the overall running time is now proportional not to the length
of the vector but to the number of bisections required to reduce that size
to 1 -- in other words, to the base-two logarithm of the length of the
vector. If the vector is long enough (and if there are enough processors
to accommodate the demand -- in theory, we might need one for each element
of the vector), this technique yields a significant reduction in running
time.
Vector-sum computes only the overall sum of the elements of
the vector. Suppose that we want to compute and store the partial sums
along the way, so that (for example) if we start with the vector
#(3 1 4 1 5 9), we get not just the
overall sum, 23, but the vector of partial sums,
#(3 4 8 9 14 23). Of course, it's
easy to do this in an amount of time proportional to the length of the
vector:
(define accumulate-vector
(lambda (vec)
(let* ((len (vector-length vec))
(result (make-vector len)))
(let loop ((index 0)
(total 0))
(if (= index len)
result
(let ((new-total (+ total (vector-ref vec index))))
(vector-set! result index new-total)
(loop (+ index 1) new-total)))))))
It might seem obvious in this case that one can't improve on this running time, but in fact parallelism again allows us to speed up the process.
Let gap be the greatest power of 2 less than the length of the
vector. First, construct a new vector that is obtained from
vec by adding to each element the element that is
gap positions to its left. (An element that is in one of the
first gap positions of vec, so that there is no
element gap positions to its left, should simply be copied
into the new vector.)
Here's how this step looks for our sample vector:
gap = 4 vec result of first step +-----+-----+-----+-----+-----+-----+ +-----+-----+-----+-----+-----+-----+ | | | | | | | | | | | | | | | 3 | 1 | 4 | 1 | 5 | 9 | | 3 | 1 | 4 | 1 | 8 | 10 | | | | | | | | | | | | | | | +-----+-----+-----+-----+-----+-----+ +-----+-----+-----+-----+-----+-----+
Now halve the gap and repeat the process, starting from the result of the first step:
gap = 2 result of first step result of second step +-----+-----+-----+-----+-----+-----+ +-----+-----+-----+-----+-----+-----+ | | | | | | | | | | | | | | | 3 | 1 | 4 | 1 | 8 | 10 | | 3 | 1 | 7 | 2 | 12 | 11 | | | | | | | | | | | | | | | +-----+-----+-----+-----+-----+-----+ +-----+-----+-----+-----+-----+-----+
Halve the gap again and repeat, starting from the result of the previous step:
gap = 1 result of second step result of third step +-----+-----+-----+-----+-----+-----+ +-----+-----+-----+-----+-----+-----+ | | | | | | | | | | | | | | | 3 | 1 | 7 | 2 | 12 | 11 | | 3 | 4 | 8 | 9 | 14 | 23 | | | | | | | | | | | | | | | +-----+-----+-----+-----+-----+-----+ +-----+-----+-----+-----+-----+-----+
In this particular case, we can stop after the third step; in general, we'd
keep repeating the process until the step in which gap is 1
has been completed. The number of repetitions is the base-two logarithm of
the length of the vector (rounded up to the next integer).
This may seem kind of magical, but there is an invariant that is preserved
from step to step and guarantees the correctness of the final result.
Express each of the (zero-based) positions in the vector as a binary
numeral in i bits, where i is the base-two logarithm of the
initial gap. After k steps, the element at position j in the
current vector is the sum of all of the elements at positions less than or
equal to j of vec whose position numbers differ from
j only in their first k bits.
Initially, this invariant holds trivially (with k = 0), since any two distinct position numbers have different binary numerals, so the invariant requires only that each element is the sum of itself. At each subsequent step, the invariant is preserved, since what is added to the element at position j in the kth step is exactly the sum of the elements farther to the left whose position numbers differ from j in their kth bit.
Even though the number of steps is the base-two logarithm of the length of
the vector, it might seem that this approach is bound to be slower, since
each step involves building an entirely new vector, of the same length as
the original. But the construction of each new vector can be parallelized:
We can build a new vector of length n in an amount of time that is
proportional to the base-two logarithm of n, using the same
divide-and-conquer technique as in vector-sum. Here's what
the entire process looks like, in our extended implementation of Scheme
(we'll assume as a precondition that vec is not the null
vector):
(define accumulate-vector
(lambda (vec)
(let* ((len (vector-length vec))
(initial-gap (let seek ((trial 1))
(let ((double (* trial 2)))
(if (< double len)
(seek double)
trial)))))
(let loop ((gap initial-gap)
(current vec))
(if (zero? gap)
current
(let ((new (make-vector len)))
(let helper ((start 0)
(finish len))
(if (= (+ start 1) finish)
(vector-set! new start
(if (< start gap)
(vector-ref current start)
(+ (vector-ref current start)
(vector-ref current (- start gap)))))
(let ((mid (quotient (+ start finish) 2)))
(parallel-let ((left (helper start mid))
(right (helper mid finish)))
(loop (quotient gap 2) new)))))))))))
Provided that each call to make-vector takes a constant amount
of time regardless of the length of its result (which is plausible, since
no initialization is needed), the running time of this algorithm as a whole
is proportional to the square of the base-two logarithm of the length of
vec. Once again, this is an improvement on the
straightforward linear version if vec is long enough.
This document is available on the World Wide Web as
http://www.cs.grinnell.edu/~stone/courses/operating-systems/parallel-algorithms.xhtml
created October 26, 2000
last revised October 26, 2000