A word index for a text file is an alphabetical list of the words that occur in that file, together with an indication, for each word, of the number of each line in which the word occurs. A word, for this purpose, is defined as a non-null string of adjacent alphabetic characters preceded by a non-alphabetic character (or the beginning of the file) and followed by a non-alphabetic character (or the end of the file). For instance, suppose that we have a text file containing one of Percy Bysshe Shelley's sonnets:
I met a traveller from an antique land Who said: Two vast and trunkless legs of stone Stand in the desert. Near them, on the sand, Half sunk, a shattered visage lies, whose frown, And wrinkled lip, and sneer of cold command, Tell that its sculptor well those passions read Which yet survive, stamped on these lifeless things, The hand that mocked them, and the heart that fed; And on the pedestal these words appear: ``My name is Ozymandias, king of kings: Look on my works, ye Mighty, and despair!'' Nothing beside remains. Round the decay Of that colossal wreck, boundless and bare The lone and level sands stretch far away.
Here is the word index for this file:
a 1, 4 an 1 and 2, 5, 8, 9, 11, 13, 14 antique 1 appear 9 away 14 bare 13 beside 12 boundless 13 cold 5 colossal 13 command 5 decay 12 desert 3 despair 11 far 14 fed 8 from 1 frown 4 half 4 hand 8 heart 8 i 1 in 3 is 10 its 6 king 10 kings 10 land 1 legs 2 level 14 lies 4 lifeless 7 lip 5 lone 14 look 11 met 1 mighty 11 mocked 8 my 10, 11 name 10 near 3 nothing 12 of 2, 5, 10, 13 on 3, 7, 9, 11 ozymandias 10 passions 6 pedestal 9 read 6 remains 12 round 12 said 2 sand 3 sands 14 sculptor 6 shattered 4 sneer 5 stamped 7 stand 3 stone 2 stretch 14 sunk 4 survive 7 tell 6 that 6, 8, 13 the 3, 8, 9, 12, 14 them 3, 8 these 7, 9 things 7 those 6 traveller 1 trunkless 2 two 2 vast 2 visage 4 well 6 which 7 who 2 whose 4 words 9 works 11 wreck 13 wrinkled 5 ye 11 yet 7
Notice that, as a consequence of the way words are identified, punctuation and whitespace act only as separators and do not appear in the index. The difference between upper- and lower-case letters is ignored; for instance, the occurrence of the word `And' at the beginning of line 5 is included in the list of occurrences of `and'. If a word occurs more than once in a line, as does the word `that' in line 8, the line number is nevertheless listed only once in the index entry for that word.
The assignment is to write a Scheme procedure that, as a side effect, creates a new file containing a word index for a specified text file. The procedure should take as its argument a string that identifies the text file to be indexed. The name of the new file should be the same as that of the specified file, except with the extension .index attached at the end; for instance, if our sample file above is called Ozymandias.txt, the index file should be Ozymandias.txt.index.
For each test run that you submit, include the original text file and the index file along with your log of the interaction with DrScheme.
This document is available on the World Wide Web as
http://www.cs.grinnell.edu/~stone/courses/scheme/exercises/word-index.xhtml
created November 20, 2001
last revised November 20, 2001