Introduction to Statistics (MAT/SST 115.03 2008S)
Primary: [Front Door] [Syllabus] [Current Outline] [R] - [Academic Honesty] [Instructions]
Groupings: [Applets] [Assignments] [Data] [Examples] [Handouts] [Labs] [Outlines] [Projects] [Readings] [Solutions]
External Links: [R Front Door] [SamR's Front Door]
R provides the lm function to compute the
least-squares regression line. (The “lm” stands
for “linear model”.) You need to provide it with a
paired set of vectors, which you create with the ~
operation.
lm(response ~ explanatory)
For example, if we had a data frame called People with
one column called FootLength and another called
Height, we might compute the coefficients as follows.
(We get somewhat different values than given in Activity 28-1 because
we're not working with exactly the same data set.)
>lm(People$Height~People$FootLength)Call:lm(formula = People$Height ~ People$FootLength)Coefficients:(Intercept) People$FootLength38.668 1.022
That's a lot of text, and not in a particularly usable format.
Fortunately, we can use the coef to grab
the values from the result.
>coef(lm(People$Height~People$FootLength))(Intercept) People$FootLength38.668071 1.022173
We can even grab and name the two coefficients.
>ab = coef(lm(People$Height~People$FootLength))>a = ab[1]>b = ab[2]>a(Intercept)38.66807>bPeople$FootLength1.022173
We can then use those values in predictions, such as predicting the height (in inches) of someone with a foot size of 28 centimeters. (Your guess is as good as mine as to why they switch units.)
>a + 29*b(Intercept)68.31109
Yeah, that “(Intercept)” is annoying. Ignore it for now.
We can even plot the line, using abline(a,b).
In this case, we want to put it on a scatterplot of height vs. foot
length.
>plot(People$Height ~ People$FootLength, main="Height (in Inches) vs. Foot Length (in cm)")>abline(a,b)
You can load the data with
HousePrices = read.csv("/home/rebelsky/Stats115/Data/HousePricesAG.csv")
The columns are Address, Price,
Bedrooms, Bathrooms, and Size.
You can plot house price vs. size (without the regression line) with
plot(HousePrices$Price ~ HousePrices$Size, ylab = "House Price (in $)", xlab = "House Size (in sq. ft.)")
For this problem, you should simply use R as a calculator, entering the values in the formulae.
Since this is your first time using lm, we'll go
through all of the steps. First, we just ask R for the summary.
That summary should be enough to confirm your answer.
lm(HousePrices$Price ~ HousePrices$Size)
That summary should be enough to confirm your answer. However, you may find it helpful to have the intercept (a) and slope (b) in variables, so we'll do that, too.
ab = coef(lm(HousePrices$Price ~ HousePrices$Size)) a = ab[1] b = ab[2]
Now we see why it was useful to put a and b
in variables.
a + b*1242
In case you missed it, the description of the proportion of variability explainted by the least squares line is given in the text on the top of p. 579.
Let's start by gathering the data, building the scatterplot, computing the parameters of the least-squares line, and plotting that line. Since we're using the plot to explore data, and not for presentations, we won't worry about labels.
TrotSpeeds = read.csv("/home/rebelsky/Stats115/Data/TrotSpeeds.csv")
plot(TrotSpeeds$Trot.Speed ~ TrotSpeeds$Body.Mass)
ab = coef(lm(TrotSpeeds$Trot.Speed ~ TrotSpeeds$Body.Mass))
a = ab[1]
b = ab[2]
abline(a,b)
We'll also compute the r2 value.
r = cor(TrotSpeeds$Trot.Speed, TrotSpeeds$Body.Mass) r^2
Okay, the first thing we have to do is compute the residuals. So, we need to predict the values and subtract those predicted values from the observed values.
predicted = a + b*TrotSpeeds$Body.Mass residuals = TrotSpeeds$Trot.Speed - predicted
Now, we're ready to plot. You should be able to figure out the plot command yourself. Remember, the form is
plot(response~explanatory)
You may find it useful to add a horizontal line for the residual of 0.
abline(h=0)
We'll start by computing the logs.
log10BodyMass = log10(TrotSpeeds$Body.Mass)
You can create the plot with
plot(TrotSpeeds$Trot.Speed ~ log10BodyMass)
The R is fairly straightforward.
lm(TrotSpeeds$Trot.Speed ~ log10BodyMass) ab = coef(lm(TrotSpeeds$Trot.Speed ~ log10BodyMass)) a = ab[1] b = ab[2] abline(a,b)
The value of r2 is computed by
r = cor(TrotSpeeds$Trot.Speed, log10BodyMass) r^2
This plot is a bit subtle, since the residuals are computed from the log (base 10) of the body mass, but the X axis should still be the original body mass.
predicted = a + b * log10BodyMass residuals = TrotSpeeds$Trot.Speed - predicted plot(residuals ~ TrotSpeeds$Body.Mass) abline(h=0)
We'll load the data using our standard strategy.
TBP = read.csv("/home/rebelsky/Stats115/Data/TextbookPrices.csv")
R is happy to make you a grid of scatterplots, using each pair of explanatory/response variable.
plot(TBP)
If you'd rather do the individual scatterpots, we can write
X11() plot(TBP$Price ~ TBP$Pages) X11() plot(TBP$Price ~ TBP$Year)
Since this is a self-check exercise, you should figure out how do this and the remaining problems using the prior answers.
Primary: [Front Door] [Syllabus] [Current Outline] [R] - [Academic Honesty] [Instructions]
Groupings: [Applets] [Assignments] [Data] [Examples] [Handouts] [Labs] [Outlines] [Projects] [Readings] [Solutions]
External Links: [R Front Door] [SamR's Front Door]
Copyright (c) 2007-8 Samuel A. Rebelsky.
This work is licensed under a Creative Commons
Attribution-NonCommercial 2.5 License. To view a copy of this
license, visit http://creativecommons.org/licenses/by-nc/2.5/
or send a letter to Creative Commons, 543 Howard Street, 5th Floor,
San Francisco, California, 94105, USA.