When an experimenter has made several observations of the values of two quantities x and y and believes that the relation between them is described by some linear equation y = mx + b, she often applies the method of least squares to determine the choice of m and b that best fits her observations. Here's a brief explanation of this method with an example of its application:
We begin with thirty-two observations about the weight and measured fuel consumption of automobiles. Each of the observations involves a different model of automobile.
| Model | Weight (in kg) | Fuel consumption (in ml/km) |
|---|---|---|
| Buick Century | 1337 | 80 |
| Buick Park Avenue | 1603 | 92 |
| Buick Regal | 1575 | 92 |
| Buick Skylark | 1262 | 86 |
| Cadillac Allante | 1708 | 119 |
| Cadillac Fleetwood | 1981 | 101 |
| Chevrolet Beretta | 1202 | 83 |
| Chevrolet Cavalier | 1146 | 80 |
| Chevrolet Corsica | 1209 | 73 |
| Chevrolet Lumina | 1530 | 85 |
| Chrysler Le Baron | 1299 | 91 |
| Chrysler Le Baron GTC | 1365 | 87 |
| Dodge Daytona | 1261 | 88 |
| Dodge Intrepid | 1504 | 89 |
| Dodge Spirit | 1265 | 87 |
| Eagle Vision | 1492 | 89 |
| Ford Escort | 1070 | 75 |
| Ford Mustang | 1259 | 103 |
| Ford Probe | 1188 | 81 |
| Ford Taurus | 1476 | 83 |
| Geo Metro | 748 | 48 |
| Lincoln Continental | 1646 | 97 |
| Mercury Grand Marquis | 1716 | 96 |
| Mercury Topaz | 1180 | 90 |
| Oldsmobile Achieva | 1232 | 85 |
| Oldsmobile Cutlass Calais | 1357 | 95 |
| Oldsmobile Cutlass Supreme | 1521 | 84 |
| Oldsmobile 98 | 1677 | 92 |
| Plymouth Acclaim | 1263 | 87 |
| Pontiac Grand Am | 1272 | 78 |
| Pontiac Sunbird | 1280 | 89 |
| Saturn SL | 1217 | 71 |
Here's a plot of the observed weight and fuel consumption for each car:
That there is a linear relationship between an automobile's weight and its fuel consumption is at least plausible. One could draw a line on the preceding plot that would be on or near most of the data points:
The method of least squares is a way of choosing which line to draw. It calculates the slope and y-intercept (the coefficients m and b in the equation y = mx + b) of the line that minimizes the sum of the squares of the vertical distances between the data points and the line. (The vertical distances are squared so that even one large mismatch between the observed value of y and the value that would have been predicted from the equation of the line and the observed value of x is heavily penalized.)
The derivation of the formulas for calculating m and b involves calculus, so I'm not going to present it here. The formulas themselves are relatively easy to understand. One begins by computing five values directly from the observations:
n, the number of observations that were made (32, in the example above).
xsum, the sum of the observed x values -- the weights of the automobiles, in the example: 1337 + 1603 +... + 1217 = 43841.
ysum, the sum of the observed y values -- the fuel consumption statistics: 80 + 92 + ... + 71 = 2776.
xsqsum, the sum of the squares of the observed x values: 13372 + 16032 + ... + 12172 = 61805287.
xysum, the sum of the products of corresponding x and y values: (1337)(80) + (1603)(92) + ... + (1217)(71) = 3865857.
The slope m of the desired line is
and the y-intercept b of the desired line is
In the example, m = (32 * 3865857 - 43841 * 2776)/(32 * 61805287 - 438412) = 2004808/55735903, or approximately 0.036; then b = (2776 - (2004808/55735903) * 43841) / 32, which works out to be about 37.5. So the equation of the desired line is y = 0.036 x + 37.5. (This is the line shown on the second plot above.)
The exercise is to write a stand-alone Scheme program that prompts the user for any positive number of observations of the values of two related quantities, then calculates and prints out the equation of the line that best fits those observations, according to the method of least squares.
In collecting the data from the user, your program should prompt the user
appropriately at each step. It should recognize the symbol
end as a sentinel indicating that no more observations are
available. It should refuse to accept any input from the user other than
a real number or the symbol end, printing a warning message
and repeating the prompt if it receives such input. It should signal an
error if the user supplies the symbol end before providing any
observations or after supplying the first value in a pair.
Here is what a typical run of the program might look like:
bourbaki% scheme least-squares.ss Chez Scheme Version 5.0c Copyright (c) 1994 Cadence Research Systems Type in real numbers giving the observed values of two related quantities. x[1]: 1337 y[1]: 80 x[2]: 1603 y[2]: ninety-two The input must be in the form of a numeral. y[2]: 92 ... x[32]: 1217 y[32]: 71 x[33]: end Calculating the coefficients of the linear equation ... y = 0.03596977696763969 x + 37.47028149880338
This document is available on the World Wide Web as
http://www.math.grin.edu/courses/Scheme/fall-1997/exercise-6.html
created October 15, 1997
last revised October 15, 1997