Merge Sort
Summary:
In a recent
reading and the
corresponding laboratory, we've explored the basics of sorting
using insertion sort. In this reading, we turn to another, faster,
sorting algorithm, merge sort.
The Costs of Insertion Sort
In looking at algorithms, we often ask ourselves how many
steps
the algorithm typically uses. Rather than
looking at every kind of step, we tend to focus on particular
kinds of steps, such as the number of times we have to call
vector-set! or the number of values we look at.
Let's try to look at how much effort the insertion sort algorithm
expends in sorting a list of n values, starting from a
random initial arrangement. Recall that insertion sort uses two lists:
a growing collection of sorted values and a shrinking collection of
values left to examine. At each step, it inserts a value into the
collection of sorted values.
In the worst case, the value we're inserting should be preceded by all the
values in the list we're inserting it into. In that case, we end up
comparing the value to all the elements in the list. Since we insert
into lists from size 1 to n-1, the average list
size is n/2. We do n such
insertions, so the number of comparisons is approxmately
n2/2. (More precisely, it's
the sum 0 + 1 + 2 + ... + n-1, which ends up
being n*(n-1)/2.)
We get this effect, for example, when sorting a list of integers that
is already in order. We insert the smallest, then the next smallest (which
goes to the end, then the next smallest (which goes to the end), and so
on and so forth.
In the best case, the value we're inserting precedes all the
values in the list we're inserting it into. In that case, we end
up only comparing it to the first element of the list. Since there
are n-1 times we insert into a non-empty list,
there are approximately n-1 comparisons.
We get this effect, for example, when sorting a list of integers that
is arranged from smallest to largest, and putting them in the order
largest to smallest.
Since there's such a big difference between the worst case and the best case,
we should probably consider the average
case. There are,
unfortunately, a variety of definitions of average. We'll chose a simple
one. In particular, we'll say that, on average, the insert routine needs
to look through about half of the
elements in the sorted part of the data structure to find the
correct insertion point for each new value it places. The size of
that sorted part increases linearly from 0 to n, so
its average size is n/2 and the average number of
comparisons needed to insert one element is n/4.
Taking all the insertions together, then, the insertion sort
performs about n2/4
comparisons to sort the entire set. That is, we do an
average of n/4 comparisons for each
insert, and we do n inserts, giving
n2/4.
This function grows much more quickly than the size of the input list.
For example, if we have 10 elements, we do about 25 comparisons. If
we have 20 elements, we do about 100 comparisons. If we have 40 elements,
we do about 400 comparisons. And, if we have 100 elements, we do about
2500 comparisons.
Does that really happen? Let's try it with some random
lists.
So, it's not exactly n2/4 steps,
for some lists (and we wouldn't expect it to be) but it's close enough for us
to be confident that it has a shape fairly similar to
n2/4 steps.
Because this function grows so quickly, it becomes quite slow to sort larger
lists (say, with more than 1000 values). Hence, it is valuable to find a
sorting method that performs fewer comparisons per value in the list, even
if it takes more effort to preprocess the list or to write the procedure.
In this reading, we explore one such procedure.
Divide and Conquer
What techinques do we know for making algorithms faster?
As we saw in the case of binary search, it is often profitable to divide
an input in half. We call this technique divide-and-conquer.
The strategy works somewhat differently for different domains. For binary search, once
dividing the list in half, we could throw away half and then
recurse on the other half. Clearly, for sorting, we cannot throw away
part of the list. However, we can still rely on the idea of dividing in
half. That is, we'll divide the list into two halves, sort them, and
then do something with the two result lists.
Here's a sketch of the algorithm in Scheme:
(define new-sort
(lambda (stuff may-precede?)
; If there are only zero or one elements in the list,
; the list is already sorted.
(if (or (null? stuff) (null? (cdr stuff)))
stuff
; Otherwise, split the list in half
(let* ((halves (split stuff))
(firsthalf (car halves))
(secondhalf (cadr halves))
; And sort each half
(sortedfirst (new-sort firsthalf))
(sortedsecond (new-sort secondhalf)))
; Do some more stuff
???))))
Merging
But what do we do once we've sorted the two sublists? We need to
put them back into one list. Traditionally, we refer to the process
of joining two sorted lists as merging. It is
relatively easy to merge two lists: You repeatedly take whichever
element of the two lists should come first. When do you stop?
When you run out of elements in one of the lists, in which case you
use the elements of the remaining list. Putting it all together,
we get the following:
Splitting
We know how to sort if we can split a list into two parts, sort the smaller
lists, and merge the sorted lists. We can sort the smaller lists recursively.
We've just figured out how to merge the two sorted lists.
All that we have left to do is to figure out how to split a list into
two parts. One easy way is to get the length of the list and then
cdr down it for half the elements, accumulating the skipped elements
as you go. Since it's easiest to accumulate a list in reverse order,
we re-reverse it when we're done. (Merge sort doesn't really care
that they're in the original order, but perhaps we want to use
split for other purposes.)
In the corresponding lab,
you'll have an opportunity to consider other ways to split the list. In
that lab, you'll work with a slightly changed version of the code.
Putting It Together
We saw most of the merge-sort procedure above,
but with a bit of code left to fill in. Here's a new version, with
that code filled in (and a few other changes).
An Alternative: From Small Lists to Large Lists
There's an awful lot of recursion going on in merge sort as we
repeatedly split the list again and again and again until we reach
lists of length one. Rather than doing all that recursion, we can
start by building all the lists of length one and then repeatedly
merging pairs of neighboring lists. For example, suppose we start
with sixteen values, each in a list by itself.
((20) (42) (35) (10) (69) (92) (77) (27) (67) (62) (1) (66) (5) (45) (25) (90))
When we merge neighbors, we get sorted lists of two elements. At some
places such as when we merge (20) and (42),
the elements stay in their respective order. At other places, such
as when we merge (35) and (10), we need to
swap order to build ordered lists of two elements.
((20 42) (10 35) (69 92) (27 77) (62 67) (1 66) (5 45) (25 90))
Now we can merge these sorted lists of two elements into sorted lists
of four elements. For example, when we merge (20 42)
and (10 35), we first take the 10 from the second list,
then the 20 from the first list, then the 35 from the second list,
then the 42 that is all that's left.
((10 20 35 42) (27 69 77 92) (1 62 66 67) (5 25 45 90))
We can merge these sorted lists of four elements into sorted lists
of eight elements.
((10 20 27 35 42 69 77 92) (1 5 25 45 62 66 67 90))
Finally, we can merge these sorted lists of eight elements into one
sorted list of sixteen elements.
((1 5 10 20 25 27 35 42 45 62 66 67 69 77 90 92))
Now we have a list of one list, so we take the car to extract the list.
(1 5 10 20 25 27 35 42 45 62 66 67 69 77 90 92)
Translating this technique into code is fairly easy.
We use one helper, merge-pairs to merge neighboring pairs.
We use a second helper, repeat-merge to repeatedly call
merge-pairs until there are no more pairs to merge.
The Costs of Merge Sort
At the beginning of this reading, we saw that insertion sort takes
approximately n2/4
steps to sort a list of n elements. How long does
merge sort take? We'll look at new-merge-sort,
since it's easier to analyze. However, since it does essentially the
same thing as the original merge-sort, just in
a slightly different order, the running time will be similar.
We'll do our analysis in a few steps. First, we will consider the
number of steps in each call to merge-pairs. Next,
we will consider the number of times repeat-merge
calls merge-pairs. Finally, we'll put the two
together. To make things easier, we'll assume that n
(the number of elements in the list) is a power of two.
Initially, repeat-merge calls
merge-pairs on n lists
of length 1 to merge them into n/2 lists of
length 2. Building a list of length 2 takes approximately two
steps, so merge-pairs takes approximately
n steps to do its
first set of merges.
Next, repeat-merge calls
merge-pairs on n/2 lists
of length 2 to merge them into n/4 lists of
length 4. Building a merged list of length 4 takes approximately
four steps, so merge-pairs takes approximately
n steps to build n/4 list of
length 4.
Each time, repeat-merge
calls repeat-merge to merge
n/2k
lists of length 2k into
n/2k+1
lists of length 2k+1.
A little math suggests that this once again takes approximately
n steps.
So far, so good. Now, how many times do we call
merge-pairs? We go from lists of length 1, to lists
of length 2, to lists of length 4, to lists of length 8, ...,
to lists of length n/4, to lists of length
n/2, to one list of length n.
How many times did we call merge-pairs?
The number of times we need to multiply 2 by itself to get
n. As we've noted before, the formal name for that
value is log2n.
To conclude, merge sort repeats a step of nsteps
log2n
times. Hence, it takes approximately
nlog2n
steps.
Is this much better than insertion sort, which took approximately
n2/4?
Here's a chart that will help you compare various running times.
| n |
log2n |
n2 |
n2/4 |
nlog2n |
| 10 | 3.3 | 100 | 25 | 33 |
| 20 | 4.3 | 400 | 100 | 86 |
| 30 | 4.9 | 900 | 225 | 147 |
| 40 | 5.3 | 1,600 | 400 | 212 |
| 80 | 6.3 | 6,400 | 1,600 | 506 |
| 100 | 6.6 | 10,000 | 2,500 | 660 |
| 500 | 9.0 | 250,000 | 62,500 | 4,483 |
| 1000 | 10.0 | 1,000,000 | 250,000 | 10,000 |
As you can see, although the two sorting algorithms start out taking
approximately the same time, as the length of the list grows, the
relative cost of using insertion sort becomes a bigger and bigger ratio
of the cost of using merge sort.
Does merge sort really take about nlog2n? Let's see.
So, it does a bit better than predicted for already sorted (or
backwards-sorted) lists. (Can
you figure out why?) For random lists, the estimate is pretty good. For
large lists, merge sort clearly beats insertion sort.
Documenting Merge Sort
You may have noted that we have not yet written the documentation for
merge sort. Why not? Because it's basically the same as the documentation
for any other sorting routine.