Search Algorithms
Summary:
We consider a typical problem of computing
and a variety of algorithms used to solve that problem.
Introduction
To search a data structure is to examine its elements
one-by-one until either (a) an element that has a desired property is
found or (b) it can be concluded that the structure contains no element
with that property. For instance, one might search a vector of integers
for an even element, or a vector of pairs for a pair having the string
"elephant" as its cdr.
You've already encountered a number of forms of searching in Scheme.
For example, you've searched lists using assoc.
You've also written more general procedures that find multiple elements
with particular properties or that find elements based on more than
just the car of an element list.
We're now reading to think about a more general form of searching,
one in which we specify the criterion for searching as a procedure
value, rather than hard-coding the particular criterion in the
structure of the search.
Sequential Search
In a linear data structure -- such as a flat list, a vector, or a
file -- there is an obvious algorithm for conducting a search: Start at
the beginning of the data structure and traverse it, testing each element.
Eventually one will either find an element that has the desired property
or reach the end of the structure without finding such an element,
thus conclusively proving that there is no such element. We used such
a strategy for searching association lists. Here are a
few alternate versions of the algorithm.
> (define sample (vector 1 3 5 7 8 11 13))
> (vector-sequential-search sample even?)
4 ; The position of 8
> (vector-sequential-search sample (right-section = 12))
#f
> (vector-sequential-search sample (left-section < 9))
5 ; The position of 11
Alternative Return Values
These search procedures return #f if the search is
unsuccessful. The first returns the matched value if the search is
successful. The second returns returns the position in the specified
vector at which the desired element can be found. There are many variants
of this idea: One might, for instance, prefer to signal an error or display
a diagnostic message if a search is unsuccessful. In the case of a
successful search, one might simply return #t, if all that is
needed is an indication of whether an element having the desired property
is present in or absent from the list.
Searching For Keyed Values
One of the most common real-world
searching problems is that
of searching a collection of compound values for one which matches a
particular portion of the value, known as the key.
For example, we might search a phone book for a phone number using a
person's name as the key or we might search a phone book for a person
using the number as key. As you've probably noted, association lists
implement this kind of searching, as long as we use the first value of a list
as the key for that list.
If each value in the list or vector to search is a list, and the key
is the first element of that list, and we are searching for strict
equality, then we can use assoc to search the
list. However, if we might want to search using the second element
as the key, or a combination of elements as the key, then we might
want to make a get-key procedure a parameter
to our search procedure.
For example, consider the named objects from the lab on association
lists. As you may recall, we represent each object as a list of values,
the first of which is the name, the second of which is the type (ellipse
or rectangle), the third of which is the color, the remaining specify
the boundaries (left, right, top, bottom). To search by name, we use
car for get-key. To search by
type, we use cadr for get-key.
To search by color, we use caddr for
get-key.
; Search drawing for something named "vc"
> (keyed-list-sequential-search drawing car "vc")
; Search drawing for an ellipse
> (keyed-list-sequential-search drawing cadr "ellipse")
; Search drawing for something white
> (keyed-list-sequential-search drawing caddr "white")
Binary Search
The sequential search algorithms just described can be quite slow if the
data structure to be searched is large. If one has a number of searches
to carry out in the same data structure, it is often more efficient to
pre-process
the values, sorting them and transferring
them to a vector, before starting those searches. The reason is that one
can then use the much faster binary search algorithm.
Binary search is a more specialized algorithm than sequential search.
It requires a random-access structure, such as a vector, as opposed to
one that offers only sequential access, such as a list. Binary search
is limited to the kind of test in which one is looking for a particular
value that has a unique relative position in some ordering. For instance,
one could use a binary search to look for an element equal to 12 in a
vector of integers ordered from smallest to largest, since 12 is uniquely
located between integers less than 12 and integers greater than 12;
but one wouldn't use binary search to look for an even integer, since
the even integers don't have a unique position in any natural ordering
of the integers.
Note that this means that we have to organize the vector based on the
kind of value we want to search for. If we want to search a vector
of named objects by name, we organize it alphabetically by name.
If we want to search a vector of named objects by color, we organize
it alphabetically by color. If we want to search a vector of named
objects by width, we organize it numerically by width.
In binary search, we keep track of the vector, the value searched for,
and the lower and upper bounds of the region still of interest. The key
idea is to divide the region of interest of the sorted vector into two
approximately equal parts, examining the element at the point of division
to determine which of the parts must contain the value sought.
There are usually three possibilities for the relationship between the
value sought and the element at the point of division.
The key sought is the key of the element at
the point of division. The search has succeeded.
The key sought cannot follow the key of the element at the point of
division in the ordering that was used to sort the vector. In this
case, the value sought must be in a position with a lower index that
the element at the point of division (if it is present at all) -- in
other words, it must be in the left half of the region of interest.
The search procedure invokes itself recursively to search just the
left half of that region.
The value sought cannot precede the element at the point of division.
In this case, the value sought must be in a higher-indexed position
-- in the right half of the region -- if it is present at all.
The search procedure invokes itself recursively to search just the
right half of the region.
There is one other way in which the recursion can terminate: If, in some
recursive call, the region to be searched contains no elements at all,
then the search obviously cannot succeed and the procedure should take
the appropriate failure action.
Here, then, is the basic binary-search algorithm. The identifiers
lower-bound and upper-bound
denote the starting and ending positions of the region of the vector
within which the value sought must lie, if it is present at all.
(We use the convention that the starting and ending positions are
inclusive in that they are positions within the
vector that we must include in the search.)
An Example
So, how do we use binary search to search a sorted vector? It depends
on what the vector contains. Let's suppose it contains a list of
named objects, sorted by name. Here's one such vector.
Now, binary-search has four parameters: a vector to search,
the procedure that extracts a key from each element in the vector,
the procedure used to compare keys, and a key to search for. For this
example, the vector to search will be objects-by-name
and the name to search for will be whatever name we want. To get the
name from an entry, we use car. To compare two names,
we use string-ci<=?.
So, to find out the index of the entry for the object named "Heather",
we would write something like the following:
> (binary-search objects-by-name car string-ci<=? "Heather")
8
> (vector-ref objects 8)
("Heather" "rectangle" "white" 100 140 35 50)
To make it easier for people who don't want to write so much, we might
wrap that instruction into a more-specific procedure that looks up
objects by name, returning objects, rather than indices.
Let's watch it work.
> (lookup-object objects-by-name "Heather")
("Heather" "rectangle" "white" 100 140 35 50)
> (lookup-object objects-by-name "Sam")
("Sam" "ellipse" "white" 20 120 35 40)
> (lookup-object objects-by-name "Janet")
("Janet" "ellipse" "black" 60 70 5 20)
Another Example: Searching for Primes
Let's take a detour into a traditional mathematical problem:
Given a number, n, how do you decide if n is prime?
As you might expect, there are a number of ways to determine whether or
not a value is prime. Since we know a lot of primes, for small primes
the easiest technique is to search through a vector of known primes.
We could, of course, use a sequential search technique to look for a value
in this vector. However, binary search is much more efficient. What procedure
should we use for get-key? Well, each value is its own key, so
we use (lambda (x) x). The values are ordered numerically, so
we use < for less-than.
For example,
; Is 231 a prime?
> (binary-search small-primes (lambda (x) x) < 231)
-1 ; No
; Is 241 a prime?
> (binary-search small-primes (lambda (x) x) < 241)
52 ; Yes, it's prime number 52
; How many primes are there less than 1000?
> (vector-length small-primes)
168
In procedure form, we might write
(define is-small-prime?
(lambda (candidate)
(not (= -1 (binary-search small-primes (lambda (x) x) < candidate)))))
Now, how many recursive calls do we do in determining whether or not a
candidate value is a small prime? If we were doing a sequential search,
we'd need to look at all 168 primes less than 1000, so approximately
168 recursive calls would be necessary. In binary search, we split the
168 into two groups of approximately 84 (one step), split one of those
groups of 84 into two groups of 42 (another step), split one of those
groups into two groups of 21 (another step), split one of those groups
of 21 into two groups of 20 (we'll assume that we don't find the value),
split 10 into 5, 5 into 2, 2 into 1, and then either find it or don't.
That's only about six recursive calls. Much better than the 168.
Now, suppose we listed another 168 or so primes. In sequential search,
we would now have to do 336 recursive calls. With binary search,
we'd only have to do one more recursive call (splitting the 336 or so
primes into two groups of 168).
This slow growth in the number of recursive calls (that is, when you double
the number of elements to search, you double the number of recursive calls
in sequential search, but only add one to the number of recursive calls in
binary search) is one of the reasons that computer scientists love binary
search.
Verifying That a Vector is Sorted
For binary-search to work correctly, we need to
have a sorted vector. Checking that a vector is sorted will require
looking at every neighboring pair of values, so it is not something we
want to do every time we call binary search. However, it is helpful
to have such a procedure available.
Here are some tests for the vectors we defined earlier.
> (vector-sorted? small-primes id <)
#t
> (vector-sorted? objects-by-name car string-ci<=?)
#t
> (vector-sorted? objects-by-name cadr string-ci<=?)
#f