Laboratory Exercises For Computer Science 151

An Introduction to Files

An Introduction to Files

Goals: This laboratory exercise introduces the motivation for using file storage and outlines a general approach for processing files.

Some Motivations for File Storage

Up to this point, all programs developed for this course have stored their data in the computer's main memory. This has the advantage that the computer can process data in main memory more quickly and efficiently that it can if data are stored elsewhere.

On the other hand, storing data in main memory also creates some difficulties. For example, all data must be provided to the program each time the program is run, and the amount of data is limited by the size of main memory. In addition, except for information specified in define statements, any data passed through parameters or computed during procedure execution are destroyed after each procedure is completed executing, and data are not stored from one use of Scheme to the next.

This lab introduces files as a mechanism to overcome some of these desadvantages. Specifically, disk files provied a way to store information in bulk over a long period of time. In applications, we may give up some efficiency, as data on files normally must be read into main memory before the data can be used. Similarly, any results we want stored permanently must be written out explicitly to a disk file. While this storing and retrieving of data requires some specific work, the use of files does overcome the disadvantages mentioned above.

To introduce some basic ideas, this lab begins by developing solutions for two problems. In each case, we assume that data have been placed in a file using an editor (such as xemacs). That is, we assume that we have typed data into an editor and saved the data in a file which we have named. As with other files we edit, we could go back to our editor to revise or expand the data whenever we wanted.

For the problems that follow, the lab first describes a solution by using pseudocode -- a step-by-step outline of what we need to do to solve the problem. Then, the lab develops the Scheme implementations of those solutions.


Problem 1: Find the sum of the numbers in a given file.

Solution Outline: The general form for this problem follows a common approach for much file processing. First, we open the file; then, we read and process data; finally, we close the file and print the results. The pseudocode solution that follows adds a few more details:

  1. Open the file containing the numbers.
  2. Initialize a running total to 0.
  3. Until the end of the file is reached:
    1. Read in the next number from the file.
    2. Add it to the running total.
  4. Close the file.
  5. Report the final value of the running total.

Solution in Scheme: Here are the built-in Scheme procedures, together with some programming hints, that can help us do various parts of this job:

Here is the Scheme implementation of the pseudocode:
(define sum-of-file
  (lambda (source-file-name)
    (let ((source (open-input-file source-file-name)))
                                        ; Open the file.
      (let loop ((total 0)              ; Initialize the running total.
                 (next (read source)))  ; Try to read a number.
        (if (eof-object? next)          ; If you get the end-of-file object,
            (begin
              (close-input-port source) ; close the file
              total)                    ; and report the final total.
            (loop (+ next total)        ; Otherwise, add the number to
                                        ;    the running total,
                  (read source)))))))   ; try to read another number,
                                        ;    and repeat the loop.
A typical interaction using this procedure would look like this:
> (sum-of-file "/home/walker/151s/labs/file1.dat")
200
The file /home/walker/151s/labs/file1.dat contains the four numbers
50
50
75
25


  1. Compare this program with the pseudocode solution given earlier. Identify how each part of the pseudocode is reflected in the Scheme code.

  2. Trace this program for the data file given. Be sure you can explain how it works.

  3. Explain why the expression (read source) appears twice in the above code. What is the purpose of each appearance of this expression?


Problem 2: Given a file containing numbers, with one or more numbers on each line, compute the average of the numbers on each line and write it in a new file. (In other words, the new file should contain one number for each line of the given file -- the average of the numbers on that line.)

Solution Outline: The pseudocode solution to this problem is:

  1. Open the input file.
  2. Open the output file.
  3. Until the end of the file is reached:
    1. Initialize a running total of the numbers on the current line to 0.
    2. Initialize a tally of those numbers to 0.
    3. Until the end of a line is reached:
      1. Read in a number from the input file.
      2. Add it to the running total.
      3. Add 1 to the tally.
    4. Compute the average.
    5. Write it to the output file.
  4. Close the input file.
  5. Close the output file.

Solution in Scheme: To detect the end of a line in Scheme, we need a procedure that has not yet been introduced: peek-char. The peek-char procedure takes one argument, an input port, and returns the first unread character that can be accessed through that port. It does not actually read that character or extract it from the port -- it just peeks at it to see what it will be when and if it is (subsequently) read in.

If the value returned by (peek-char source) is the newline character, #\newline, then we know that we're at the end of the line and can proceed to average the numbers that we've encountered. (This is also a good time to read in and discard the newline character, so that we can start into the next line of numbers without encountering it again.)

Like (read source) and (read-char source), (peek-char source) returns the end-of-file object if there are no more characters in the file. Here's a procedure that manages the inner loop of the pseudocode shown above. It reads in one line of numbers from the input port source and writes their average to the output port.

(define average-line
  (lambda (source target)
    (let loop2 ((total 0)                   ; Initialize the running total
                (tally 0)                   ; Initialize the tally.
                (ch (peek-char source)))    ; Peek at the next character.
      (if (char=? ch #\newline)             ; If it's a newline character,
          (begin
            (read-char source)              ; discard it,
            (write (/ total tally) target)  ; compute the average and write
                                            ;    it to the target file,
            (newline target))               ; and terminate the line in the
                                            ;    target file.
          (let ((next (read source)))       ; Otherwise, read a number.
            (loop2 (+ total next)           ; Add it to the running total.
                   (+ tally 1)              ; Add 1 to the tally.
                   (peek-char source))))))) ; Peek at the next character
                                            ;    and repeat the loop.

The Scheme implementation of the solution to problem #2 is now easy to write: Open up the files, call average-line once for each line of the input file, and finally close the files:

(define average-each-line
  (lambda (source-file-name target-file-name)
    (let ((source (open-input-file source-file-name))
                                               ; Open the input file.
          (target (open-output-file target-file-name)))
                                               ; Open the output file.
      (let loop1 ((ch (peek-char source)))     ; Peek at the next character.
        (if (eof-object? ch)                   ; If you get the eof-object,
            (begin
              (close-input-port source)        ; close the input file
              (close-output-port target))      ; and the output file.
            (begin                             ; Otherwise,
              (average-line source target)     ; read the numbers on one
                                               ;    line and compute and
                                               ;    write their average.
              (loop1 (peek-char source)))))))) ; Peek at the next character
                                               ;    and repeat the loop.

Here's what a typical invocation of this procedure looks like:

> (average-each-line "/home/walker/151s/labs/file2.dat" "lab.output")
Nothing shows up on screen, because the last operation performed by average-each-line is the call to close-output-port, which returns an unspecified value (and Chez Scheme doesn't bother to print unspecified values). All the action takes place off stage, in the files: If the input file contains three lines of numbers -- say, for instance,
40 90 100 60 90
50 25 75
30 90 60 10 80 50 70 40 20
-- then the program will create the output file lab.output, looking like this:
76
50
50
but the creation of this file is invisible to the interactive Scheme user.


  1. After running the procedure as shown above, look at the data printed in lab.output to be sure it contains what you expect. This might be done in either two ways: you could open the file with an editor such as XEmacs or you could view the file in an hpterm window with the cat or more commands.

  2. Explain how each part of the pseudocode is reflected in the Scheme code.

  3. Trace this program for the data file given. Be sure you can explain how it works.

  4. Explain in your own words what peek-char does and why it is used here.


  1. Write and test a Scheme procedure that takes two arguments -- the name of an input file containing zero or more integers, and the name of an output file to be created by the procedure -- and copies each integer from the input file to the output file if it is in the range from 0 to 99. Values outside of this range should be read in but not copied out again. The idea is that this procedure will act as a filter, ensuring that only the values that are in the correct range will make it into the output file.

    Line breaks in the input file should be ignored. In the output file, arrange for each integer to be printed on a line by itself.


This document is available on the World Wide Web as

http://www.math.grin.edu/~walker/courses/151.fa98/lab-file-intro.html

created March 11, 1997 by John D. Stone
last revised October 8, 1998 by Henry M. Walker