Exercise #4: Preparing a word index

Course links

A word index for a text file is an alphabetical list of the words that occur in that file, together with an indication, for each word, of the number of each line in which the word occurs. A word, for this purpose, is defined as a non-null string of adjacent alphabetic characters preceded by a non-alphabetic character (or the beginning of the file) and followed by a non-alphabetic character (or the end of the file).

Since the texts for which word indices are compiled are fairly long, an unabridged word index would include very large and cumbersome lists for common function words such as ``the'' and ``of.'' Usually, therefore, there is a list of stop words that are to be ignored, or at least handled differently, during the indexing process. The file /home/stone/courses/scheme/examples/stop-list is an alphabetical list of seventy-five stop words for English (each one on a separate line).

For instance, suppose that we have a text file containing A. E. Housman's poem ``The oracles'':

'Tis mute, the word they went to hear on high Dodona mountain
    When winds were in the oakenshaws and all the cauldrons tolled,
And mute's the midland navel-stone beside the singing fountain,
    And echoes list to silence now where gods told lies of old.
 
I took my question to the shrine that has not ceased from speaking,
    The heart within, that tells the truth and tells it twice as plain;
And from the cave of oracles I heard the priestess shrieking
    That she and I should surely die and never live again.
 
Oh priestess, what you cry is clear, and sound good sense I think it;
    But let the screaming echoes rest, and froth your mouth no more.
'Tis true there's better boose than brine, but he that drowns must drink it;
    And oh, my lass, the news is news that men have heard before.
 
The King with half the East at heel is marched from lands of morning;
    Their fighters drink the rivers up, their shafts benight the air.
And he that stands will die for nought, and home there's no returning.
    The Spartans on the sea-wet rock sat down and combed their hair.

Here is a word index for this file, omitting the stop words:

again       9
air         17
before      14
benight     17
beside      3
better      13
boose       13
brine       13
cauldrons   2
cave        8
ceased      6
clear       11
combed      19
cry         11
die         9, 18
dodona      1
drink       13, 17
drowns      13
east        16
echoes      4, 12
fighters    17
fountain    3
froth       12
gods        4
good        11
hair        19
half        16
has         6
have        14
hear        1
heard       8, 14
heart       7
heel        16
high        1
home        18
is          11, 14, 16
king        16
lands       16
lass        14
let         12
lies        4
list        4
live        9
marched     16
men         14
midland     3
morning     16
mountain    1
mouth       12
must        13
mute        1, 3
navel       3
never       9
news        14
no          12, 18
not         6
nought      18
oakenshaws  2
oh          11, 14
old         4
oracles     8
plain       7
priestess   8, 11
question    6
rest        12
returning   18
rivers      17
rock        19
sat         19
screaming   12
sea         19
sense       11
shafts      17
should      9
shrieking   8
shrine      6
silence     4
singing     3
sound       11
spartans    19
speaking    6
stands      18
stone       3
surely      9
tells       7
think       11
tis         1, 13
told        4
tolled      2
took        6
true        13
truth       7
twice       7
went        1
were        2
wet         19
will        18
winds       2
within      7
word        1

Notice that, as a consequence of the way words are identified, punctuation and whitespace act only as separators and do not appear in the index. The difference between upper- and lower-case letters is ignored; for instance, both the occurrence of the word `Oh' at the beginning of line 11 and the occurrence of `oh' in line 14 are included in the entry for `oh'. If a word occurs more than once in a line, as does the word `news' in line 14, the line number is nevertheless listed only once in the index entry for that word.

The assignment is to write a Scheme procedure that, given the name of a text file, creates a new file containing an abridged word index (that is, one containing no entries for the stop words) for that file. The procedure should take as its argument a string that identifies the text file to be indexed. The name of the new file should be the same as that of the specified file, except with the extension .index attached at the end; for instance, if our sample file above is called oracles.txt, the index file should be oracles.txt.index.

For each test run that you submit, include the original text file and the index file along with your log of the interaction with DrScheme.

This exercise will be due at 2:15 p.m. on Friday, April 29.