Introduction to Statistics (MAT/SST 115.03 2008S)
Primary: [Front Door] [Syllabus] [Current Outline] [R] - [Academic Honesty] [Instructions]
Groupings: [Applets] [Assignments] [Data] [Examples] [Handouts] [Labs] [Outlines] [Projects] [Readings] [Solutions]
External Links: [R Front Door] [SamR's Front Door]
In R, you use the cor function to
find the correlation between two samples. You can call
cor on a data frame with two columns. You can also
call cor on two vectors. (Since the correlation
coefficient is symmetrical, it doesn't really matter which one you
enter first.)
Unfortunately, cor does not like NA values, so
you have to remove them from the frame before calling
cor. (Removing them from the vectors is harder,
since you want to remove values from the same place in both vectors.
In cases in which either vector has an NA value, combine them into a
data frame first.) The na.omit function does
the hard work for you.
In this activity, you will be working with data from the file
Cars99.csv. As you should be able to guess
by now, you can load those data with
Cars99 = read.csv("/home/rebelsky/Stats115/Data/Cars99.csv")
Let's learn a little about the data set.
> summary(Cars99)Model Page.Number City.MPG Highway.MPG Fuel.CapacityAcura Integra: 1 Min. : 61.0 Min. :17.00 Min. :23.00 Min. :10.30Acura RL : 1 1st Qu.:100.0 1st Qu.:19.00 1st Qu.:26.00 1st Qu.:14.50Acura TL : 1 Median :160.0 Median :20.50 Median :28.50 Median :16.20Audi A4 : 1 Mean :153.9 Mean :20.96 Mean :28.74 Mean :16.38Audi A6 : 1 3rd Qu.:207.0 3rd Qu.:23.00 3rd Qu.:30.75 3rd Qu.:18.43Audi A8 : 1 Max. :249.0 Max. :30.00 Max. :38.00 Max. :23.70(Other) :103 NA's : 3.00 NA's : 3.00 NA's : 1.00Weight Front.Weight Acceleration.0.to.30 Acceleration.0.to.60Min. :1845 Min. :46.00 Min. : 2.400 Min. : 5.6001st Qu.:2845 1st Qu.:59.00 1st Qu.: 3.300 1st Qu.: 8.800Median :3175 Median :62.00 Median : 3.500 Median : 9.500Mean :3186 Mean :60.41 Mean : 3.548 Mean : 9.7333rd Qu.:3545 3rd Qu.:63.00 3rd Qu.: 3.900 3rd Qu.:10.900Max. :4145 Max. :65.00 Max. : 4.500 Max. :12.500NA's : 1.00 NA's :36.000 NA's :36.000Time.for.Quarter.Mile TypeMin. :14.10 family :281st Qu.:16.80 large :12Median :17.40 luxury :13Mean :17.40 small :253rd Qu.:18.20 sports :16Max. :19.10 upscale:15NA's :36.00
Note that many variables have some NA values.
Because the data set has NA values, we'll need to do a bit of cleanup first. (Yay!) First, we'll extract the two columns of interest.
tmp = data.frame(TfQM = Cars99$Time.for.Quarter.Mile, Weight=Cars99$Weight)
Next, we'll remove the rows with an NA value.
tmp = na.omit(tmp)
Finally, we'll compute the correlation coefficients.
cor(tmp)
We could also express that more concisely as
cor(na.omit(data.frame(Cars99$Time.for.Quarter.Mile, Cars99$Weight)))
That is,
We'll use a similar strategy in future activities.
Here are the first few computations (for B, C, and D). You should be able to figure out the rest. Remember to look at p. 536 to figure out what columns to use.
cor(na.omit(data.frame(Cars99$Acceleration.0.to.60, Cars99$Time.for.Quarter.Mile))) cor(na.omit(data.frame(Cars99$Page.Number, Cars99$Fuel.Capacity))) cor(na.omit(data.frame(Cars99$Weight, Cars99$City.MPG)))
Primary: [Front Door] [Syllabus] [Current Outline] [R] - [Academic Honesty] [Instructions]
Groupings: [Applets] [Assignments] [Data] [Examples] [Handouts] [Labs] [Outlines] [Projects] [Readings] [Solutions]
External Links: [R Front Door] [SamR's Front Door]
Copyright (c) 2007-8 Samuel A. Rebelsky.
This work is licensed under a Creative Commons
Attribution-NonCommercial 2.5 License. To view a copy of this
license, visit http://creativecommons.org/licenses/by-nc/2.5/
or send a letter to Creative Commons, 543 Howard Street, 5th Floor,
San Francisco, California, 94105, USA.