Sorting
Summary:
We consider a variety of techniques used to put a list or vector in
order, using a binary comparison operation to determine the ordering
of pairs of elements.
The Problem of Sorting
Sorting a collection of values -- arranging them in a fixed order, usually
alphabetical or numerical -- is one of the most common computing
applications. When the number of values is even moderately large, sorting
is such a tiresome, error-prone, and time-consuming process for human
beings that the programmer should automate it whenever possible. For this
reason, computer scientists have studied this application with extreme care
and thoroughness.
One of the clear results of their investigations is that no one
algorithm for sorting is best in all cases. Which approach is best
depends on whether one is sorting a small collection or a large
one, on whether the individual elements occupy a lot of storage
(so that moving them around in memory is time-consuming), on how
easy or hard it is to compare two elements to figure out which
one should precede the other, and so on. In this course we'll be
looking at two of the most generally useful algorithms for sorting:
insertion sort, which is the subject of this
reading, and merge sort, which we will consider
in another reading. In our general exploration of sorting, we may
also discuss other sorting algorithms.
Imagine first that we're given a collection of values and a rule
for arranging them. The values might actually be stored in a
list, vector, or file. Let's assume first that they are in a list.
The rule for arranging them typically takes the form of a predicate with
two parameters that can be applied to any two values in the collection
to determine whether the first of them could precede the second when
the values have been sorted. (For example, if one wants to sort a
collection of real numbers into ascending numerical order, the rule
should be the predicate <=; if one wants to
sort a collection of strings into alphabetical order, ignoring case,
the rule should be string-ci<=?; if one wants
to sort a collection of real numbers into descending numerical order,
the rule should be >=; and so on.)
The Insertion Sort Algorithm
As one might guess, the key operation in insertion sort is that
of insertion. We envision separating the
elements to be sorted into two collections: those that are not yet
sorted, and those that are already sorted. We repeatedly
insert one value from the unsorted collection into
the proper place in the sorted collection.
The way in which we represent the unsorted and sorted collections
has an effect on the way in which we implement the insertion
operation. Let's start by representing collections as lists.
Midway through the sorting process, we have one list of unsorted
values (the values we have not yet processed) and a second, sorted,
list. To insert a value into the sorted list, we step over the
elements that should precede the value, stopping when we find an
element that should follow the value (or when we run out of elements).
We then cons the new value onto that section of the list and rebuild
the prior elements.
Inserting Elements
We'll start by considering the specific case of sorting a list of
real numbers in ascending order.
We begin with the procedure used to insert a new value into a list that
is already in order. Because we're inserting the numbers in ascending
order, we can use <= as the ordering predicate.
In English: If the list into which the new element is to be inserted
is empty, return a list containing only the new element. If the new
element can precede the first element of the existing list, then,
since the existing list is assumed to be sorted already, it must also
be able to precede every element of the existing
list, so attach the new element onto the front of the existing list
and return the result. Otherwise, we haven't yet found the place,
so issue a recursive call to insert the new element into the cdr of
the current list, then re-attach its car at the beginning of the result.
Sorting a list
Now let's return to the overall process of sorting an entire list.
The insertion sort algorithm simply takes up the elements of the
list to be sorted one by one and inserts each one into a new list,
which is initially empty:
Insertion Sort, Generalized
Okay, let's now implement the two forms of generalized insertion sort.
This time, we'll make the insert procedure local,
since it need not be used outside of insertion sort.
What changes do we need to make for keyed insertion sort? Not many.
The only differences are (a) we need to add the get-key
parameter and (b) every time we used may-precede?, we
must now add calls to get-key. Fortunately, there's
only one call to may-precede?, so the update is
small.
Of course, this code is quite close to that of list-insertion-sort.
Hence, rather than doing a cut-paste-edit on that code, we might instead
call list-insertion-sort, providing
it with a predicate that extracts keys and then compares.
Why use the second strategy? One obvious reason is that
it's smaller. A second is that it may actually be less work
to write this version than to figure out what to change in
list-insertion-sort. But the best reason to use
this strategy (that is, calling another function with a well-designed
parameter, rather than cutting, pasting, and changing) is that it
makes updates much easier. If we realize that we made a mistake in
list-insertion-sort or simply find a better way
to do some part of it, we only have one procedure to update, rather
than two.
Sorting a Vector
Of course, as we saw in our exploration of binary search, in order to do
binary search, we need a sorted vector, rather than
a sorted list. To sort a vector, we could turn
that vector into a list, sort the list, and then build a new vector.
However, in sorting a vector, the goal of sorting is often to
overwrite the old arrangement of those values with
a new, sorted arrangement of the same values. This type of sorting
is often called in-place sorting. Let's consider
how we might use the ideas of insertion sort to sort a vector in place.
As you may recall, insertion sort requires us to have two collections:
one of which is sorted and the other of which is unsorted. We will
partition the original vector into two sub-vectors: a sorted sub-vector,
in which all of the elements are in the correct order relative to one
another, and an unsorted sub-vector in which the elements are still
in their original positions. The two sub-vectors are not actually
separated; instead, we just keep track of a boundary between them
inside the original vector. Items to the left of the boundary are
in the sorted sub-vector; items to its right, in the unsorted one.
Initially the boundary is at the left end of the vector. The plan is
to shift the boundary, one position at a time, to the right end. When it arrives,
the entire vector has been sorted.
Here's the plan for the main algorithm.
The insert! procedure takes four parameters: an element
to be inserted into the sorted part of the vector, the vector itself,
the current boundary position, and the comparison procedure. The new
element can be inserted at any position up to and including the current
boundary position, but it must be placed in the correct order relative
to elements to the left of that boundary. This means that any elements
that should follow the new one should be shifted one position to the
right in order to make room for the new one. (Elements that precede
the new one can keep their current positions.)
How does this work? We assume that there's a space
at position
pos of the vector. (That is, that we can safely insert
something there without removing anything from the vector.) We know
that the condition holds at the beginning from the description. That is,
the postcondition specifically ignores the value that was in the
boundary position. We also know that the conditional holds from the way
insert! was called from insertion-sort!, since
the boundary is initially the position of the value we insert.
Now, what do we do? If the position is at the left end of the vector,
there's nothing smaller in the vector, so we just put the new value there.
If the thing to the left of the current position is smaller, we know we've
reached the right place, so we put the value there. In every other case,
the value to the left is larger than the value we want to insert, so we
shift that value right (into the pos position) and continue
working one position to the left. Since we've copied the value to the right,
it is remains safe to insert something in the position just vacated
(that is, pos-1).
You will have a chance to explore this procedure further in
the laboratory.