Laboratory Exercises For Computer Science 151

Processing File Data Line-By-Line

Processing File Data Line-By-Line

Goals: This laboratory exercise introduces strategies for processing file data, when the data are organized line-by-line.


Text Files

In contrast to files considered as data streams, text files consider data as organized into lines. For example, the file /home/walker/151s/labs/ia-cities.dat contains information about the sixty largest cities and towns in Iowa: their names and populations, as determined by the 1990 and 1980 censuses. A portion of the file looks like this:


Fairfield         9768    9428
Fort Dodge       25894   29423
Fort Madison     11614   13520
Grinnell          8902    8868
Independence      5972    6392
Indianola        11340   10843
Iowa City        59735   50508

To clarify this format, the town name appears first on a line, followed by the population given by the 1990 census, and then by the 1980 census population. As is typical in text files, data for each city are organized by line.


Problem 1:

As a second example, file /home/walker/151s/labs/file2.dat has the form

40 90 100 60 90
50 25 75
30 90 60 10 80 50 70 40 20

If the numbers on a particular line are related to each other, then it is natural to ask the following type of question:

Problem Statement: Given a file containing numbers, with one or more numbers on each line, compute the average of the numbers on each line and write it in a new file. (In other words, the new file should contain one number for each line of the given file -- the average of the numbers on that line.)

For example, given the above data, the program should create an output file looking like this:

76
50
50

High-Level Solution Outline: When data are organized line-by-line, a natural solution is given by the following pseudocode:

  1. Open the input file.
  2. Open the output file.
  3. Until the end of the file is reached:
    1. Read and process a line of input
  4. Close the input file.
  5. Close the output file.

While steps 1-2 and 4-5 are quite similar to what we have encountered with file streams, step 3 may be done in several ways.

Solution Outline -- Version 1: In this approach, we read and process a line in three separate steps, as shown in this, somewhat more detailed, pseudocode solution:

  1. Open the input file.
  2. Open the output file.
  3. Until the end of the file is reached:
    1. Read a line of input, putting the numbers onto a list
    2. Average the numbers in the list
    3. Write the computed average to the output file.
  4. Close the input file.
  5. Close the output file.

Solution in Scheme: To detect the end of a line in Scheme, we need a procedure that has not yet been introduced: peek-char. The peek-char procedure takes one argument, an input port, and returns the first unread character that can be accessed through that port. It does not actually read that character or extract it from the port -- it just peeks at it to see what it will be when and if it is (subsequently) read in.

If the value returned by (peek-char source) is the newline character, #\newline, then we know that we're at the end of the line and can proceed to average the numbers that we've encountered. (This is also a good time to read in and discard the newline character, so that we can start into the next line of numbers without encountering it again.)

Like (read source), (peek-char source) returns the end-of-file object if there are no more characters in the file. Here's a procedure that implements step 3A of the pseudocode shown above. It reads in one line of numbers from the input port and returns that line as a list of numbers.

; Return a list of the numbers found on a line of a file
(define read-line
  (lambda (source)
    (let readloop (             ; tail recursive procedure
                   (lst '())    ; Initialize return list
                  )
      (if (char=? (peek-char source) #\newline)
          (begin                ; when newline found
            (read-char source)  ; Read and discard newline
            (reverse lst)       ; return our list in order of reading
          )
	  ; Else add the current number to the front of list and recurse
          (readloop (cons (read source) lst))
       )
     )
  )
)

For step 3B, we write a tail recursive procedure to average the numbers in a list -- following what has become a familiar pattern. One implementation looks like this:

; Return the average of a list of numbers with N list elements.

(define average-list
  (lambda (lst)
    (let loop ((ls lst) (sum 0) (count 0))
      (if (null? ls) 
          (/ sum count)
          (loop (cdr ls) (+ sum (car ls)) (+ 1 count))
      )
    )
  )
)

The Scheme implementation of the solution to problem 1 is now easy to write: Open up the files, call average-line once for each line of the input file, and finally close the files:

(define average-each-line
  (lambda (source-file-name target-file-name)
    (let ((source (open-input-file source-file-name))   ; Open the input file.
          (target (open-output-file target-file-name))) ; Open the output file.
      (let loop1 ((ch (peek-char source)))     ; Peek at the next character.
        (if (eof-object? ch)                   ; If you get the eof-object,
            (begin
              (close-input-port source)        ; close the input file
              (close-output-port target))      ; and the output file.
            (begin                             ; Otherwise,
              (write (average-list (read-line source)) target)
	      (newline target)
              (loop1 (peek-char source))
	    )
        )
      )
    )
  )
)

Here's what a typical invocation of this procedure looks like:

> (average-each-line "/home/walker/151s/labs/file2.dat" "lab.output")

Nothing shows up on screen, because the last operation performed by average-each-line is the call to close-output-port, which returns an unspecified value (and Chez Scheme doesn't bother to print unspecified values). All the action takes place off stage, in the files: If the input file contains the three lines of numbers described earlier, then the program will create the output file lab.output with the required averages.

Note that the creation of this file is invisible to the interactive Scheme user.


  1. After running the procedure as shown above, look at the data printed in lab.output to be sure it contains what you expect. This might be done in either two ways: you could open the file with an editor such as XEmacs or you could view the file in an hpterm window with the cat or more commands.

  2. Explain how each part of the pseudocode is reflected in the Scheme code.

  3. Trace this program for the data file given. Be sure you can explain how it works.

  4. Explain in your own words what peek-char does and why it is used here.

  5. In the above discussion, procedures read-line, average-list, and average-each-line are defined globally as separate procedures. Rewrite procedure average-each-line, so that read-line and average-list are defined locally within average-each-line.


Solution Outline -- Version 2: A second approach involves reading an entire line and then extracting the parts relevant to the problem. The main steps are given in the following pseudocode, which expands step 3 of the high level solution outline shown earlier.

  1. Open the input file.
  2. Open the output file.
  3. Until the end of the file is reached:
    1. Read a line of input as a string of characters
    2. Extract successive numbers from the line, maintaining a running sum and item count
    3. Compute and write the average to the output file.
  4. Close the input file.
  5. Close the output file.

Solution in Scheme: This approach requires the introduction of no new Scheme procedures. Thus, we focus on steps 3A and 3B for reading a line and extracting successive numbers.

To read a line, we read character-by-character until reaching the #\newline character or the end of the file:

(define read-line-string
   (lambda (source)
      (let loop ((ch-list '()) (ch (read-char source)))
         (if (or (eof-object? ch) (char=? ch #\newline))
              (list->string (reverse ch-list))
              (loop (cons ch ch-list) (read-char source))
         )
      )
   )
)

For step 3B, the main work is to retrieve a number or word from the beginning of the string which has been read. One way to accomplish this is to first scan the start of the string, throwing away any white space (e.g., spaces or tabs). Then, we can continue scanning until we find the end of the word. In writing out the details, we may scan positions within the string read, or we may change the string to a list and extract the relevant parts of the list.

When we work within a string, it is convenient to have one procedure, find-word-start which scans character-by-character to find the beginning of a number or word. A second procedure, find-word-end then continues the search to find the end of that word. The following code illustrates this approach:

(define get-word
   (lambda (str)
      (letrec ((find-word-start  ; find starting index of first word in str
                  (lambda ()
                    (let loop ((index 0) (len (string-length str)))
                       (cond ((= index len) index)
                             ((char-whitespace? (string-ref str index))
                                   (loop (+ index 1) len))
                             (else index)
                        )
                    )
                  ))
               (find-word-end   ; find ending index of first word in str
                  (lambda (start-index)
                    (let loop ((index start-index) (len (string-length str)))
                        (cond ((= index len) index)
                              ((char-whitespace? (string-ref str index)) index)
                              (else (loop (+ index 1) len))
                        )
                    )
                  ))
                )
        (let* ((first-word-index (find-word-start))
               (end-word-index (find-word-end first-word-index)))
           (substring str first-word-index end-word-index)
        )
      )
   )
)

When using lists, an outer named let expression removes initial white space. Once the first character of the number or word is found, we collect characers until the end of the number or word is reached.

(define get-word
   (lambda (str)
      (let find-first-nonwhite ((lst (string->list str)))
         (cond ((null? lst) "")
               ((char-whitespace? (car lst))
                    (find-first-nonwhite (cdr lst)))
               (else (let getword ((old-lst lst) (new-lst '()))
                         (if (or (null? old-lst)
                                 (char-whitespace? (car old-lst)))
                             (list->string (reverse new-lst))
                             (getword (cdr old-lst) 
                                      (cons (car old-lst) new-lst))
                         )
                     ))
         )
      )
   )
)


  1. Check each of these versions of get-word works correctly by testing them on the strings "12345 67890 a bc def" and " 3.141592 pi 2.71828 ".


Analogously to get-word, we may define a procedure chop-word which takes a string as parameter and which returns the string with the first word removed. Thus, chop-word would yield the following results:

(chop-word "12345 67890 a bc def") ==> " 67890 a bc def"
(chop-word "   3.141592    pi 2.71828  ") ==> "    pi 2.71828  "


  1. Write two versions of chop-word, using the two approaches to get-word as a model.

    Note: To get chop-word from the string version of get-word requires a change in only one line. Also, the list version of chop-word can be considerably simpler than the corresponding version of get-word -- at least for the getword named let expression.


Performing steps 3A, 3B, and 3C combines read-line-string, get-word and chop-word, as follows:

(define process-line
   (lambda (source target)
      (let ((str (read-line-string source)))
         (let loop ((number (get-word str))
                    (rest (chop-word str))
                    (count 0) 
                    (sum 0))
            (if (zero? (string-length number))
                (begin
                   (write (/ sum count) target)
                   (newline target)
                )
                (loop (get-word rest) (chop-word rest) 
                      (+ count 1) (+ sum (string->number number)))
            )
         )
      )
   )
)
With all of this work done by process-line and the helping procedures already discussed, the main procedure is quite straightforward:
(define average-each-line
  (lambda (source-file-name target-file-name)
    (let ((source (open-input-file source-file-name))   ; Open the input file.
          (target (open-output-file target-file-name))) ; Open the output file.
      (let loop1 ((ch (peek-char source)))     ; Peek at the next character.
        (if (eof-object? ch)                   ; If you get the eof-object,
            (begin
              (close-input-port source)        ; close the input file
              (close-output-port target))      ; and the output file.
            (begin
              (process-line source target)
              (loop1 (peek-char source)))
        )
      )
    )
  )
)
  1. Check that the resulting program works correctly by using the command:

    (average-each-line "/home/walker/151s/labs/file2.dat"  "new-output")
    
  2. Modify process-line to find and print the maximum number on each line of the file, rather than the average.


In developing the second solution to Problem 1, we identified several useful procedures, including read-line-string, get-word and chop-word. With minor changes in process-line and average-each-line, a similar approach may be used to solve a wide range of file-based problems. The following provides another example:

Problem 2:

As noted at the beginning of this lab, file /home/walker/151s/labs/ia-cities.dat contains information about the sixty largest cities and towns in Iowa, including the following lines:


Fairfield         9768    9428
Fort Dodge       25894   29423
Fort Madison     11614   13520
Grinnell          8902    8868
Independence      5972    6392
Indianola        11340   10843
Iowa City        59735   50508

More precisely, each line of the file contains information about a specific town or city. Within a line, the city name appears left-justified within the first 16 characters of the line, the population from the 1990 census appears right justified in columns 17 through 22, and the population from the 1980 census appears in columns 25-30. Columns 23 and 24 are always blank.

Write a program which reads each city name and which determines the percentage increase or decrease of the population from 1980 to 1990. Print the results in a table as follows:


                Percent
City             Change

.
.
.

Fairfield       3.61
Fort Dodge      -11.99
Fort Madison    -14.1
Grinnell        0.38

Solution Outline: Since this problem asks for results to be printed to the screen rather than placed in a file, a solution need not consider an output file. Otherwise, a general solution can the same overall form as used previously for text files. The pseudocode follows:

  1. Open the input file.
  2. Print titles for the table.
  3. Until the end of the file is reached:
    1. Read a line of input as a string of characters
    2. From the line, extract:
      1. city name (first 16 characters)
      2. 1990 census figure (next number)
      3. 1980 census figure (final number)
    3. Compute amount of population change.
    4. Compute percentage population change.
    5. Write the city name and perchange change.
  4. Close the input file.

Solution in Scheme: The implementation in Scheme requires changes only in process-line and average-each-line. Procedure process-line is revised to follow step 3 of the above outline for a specific city. We remove the output file material from average-each-line, add the printing of a title, and change the procedure's name to the more appropriate compute-city-changes.

The revised procedures follow:

(define process-line
   (lambda (source)
      (let* ((str (read-line-string source))
             (city-name (substring str 0 16))
             (after-city (substring str 17 (string-length str)))
             (pop-1990 (string->number (get-word after-city)))
             (after-90 (chop-word after-city))
             (pop-1980 (string->number (get-word after-90)))
             (pop-change (- pop-1990 pop-1980))
             (percent-change (* 100.0 (/ pop-change pop-1980)))
             (formatted-per-chg (/ (round (* 100.0 percent-change)) 100.0))
            )
        (display city-name)
        (display formatted-per-chg)
        (newline)
      )
   )
)

(define compute-city-changes
   (lambda (source-file-name)
    (let ((source (open-input-file source-file-name)))
      (newline)
      (display "               Percent")   ; Print table titles
      (newline)
      (display "City            Change")
      (newline)

      (let loop1 ((ch (peek-char source))) ; Peek at the next character.
        (if (eof-object? ch)               ; If you get the eof-object,
            (close-input-port source)      ; close the input file
            (begin                         ; Otherwise, process line
              (process-line source)
              (loop1 (peek-char source)))
        )
      )
    )
  )
)
  1. Check that the resulting program works correctly by using the command:

    (compute-city-changes "/home/walker/151s/labs/ia-cities.dat")
    
  2. Within process-line, the computation for city-name uses the substring procedure, rather than get-word. Could get-word be used instead? Briefly justify your answer.

  3. Write a brief description of how process-line handles the computations for a city. Be sure your discussion includes a comment about the computation for variable formatted-per-chg.

  4. Modify compute-city-changes and/or process-line, so that, for each city, it prints the city, the 1990 and 1980 census figures, and the percentage population change.

    Hint: This is extremely easy!

  5. Modify compute-city-changes and/or process-line, so that the program prints the names of only those cities whose populations increased between 1980 and 1990.


This document is available on the World Wide Web as

http://www.math.grin.edu/~walker/courses/151.sp99/lab-files-as-lines.html

created April 4, 1999 by Clif Flynt and Henry M. Walker
last revised October 20, 2000 by Henry M. Walker