Lab 1: R Refresher
Objective:1) Reminder of the basics of R
2) Moving data from Excel to R
3) Simple linear model – fitting and prediction
Moving Data from Excel to S-Plus and BackThe tools we’re going to need to deal with statistical models are in R. What we need, what all statisticians need, is data, and lots of it; and lots of the data in the world comes in Excel spreadsheets.
R can readCSV or tab-separated files with the read.table(), read.delim(), read.csv() family of functions. (It can also import and export .XLSand .XLSXfiles directly using add-on packages like xlsx, and it can also read other formats, like data in SQL databases like Access or Oracle, with yet other packages.) . For small spreadsheets, it’s straightforward to bring them into in R via the Windows clipboard. The read.table() and write.table()can be used with file = "clipboard".
Try this: open the beer spreadsheet, copy it to the clipboard, then use this:
beer <- read.delim ("clipboard")
That command creates beer, but it doesn’t display it. To display it we can just type beer (or, if we want fewer rows, beer[1:10,] or head(beer).)
ExplorationWhen you get data, your first step is often to look at it in whatever ways are possible. plot(beer) plots all pairs of variables; there may be too many variables for this to be useful here. But we can learn something from summary (beer$Alcohol) and cor (beer$Alcohol, beer$Calories) and table (beer$Original, beer$Light) and sapply (beer, class). In this example we’re interested in the relationship between Calories and Alcohol. To plot those two things any of these will work:
plot (beer$Alcohol, beer$Calories) # x, comma, y
with (beer, plot (Alcohol, Calories)) # save on typing
plot (Calories ~ Alcohol, data = beer) # "formula" form
with (beer, plot (Calories ~ Alcohol))
Linear Model with the lm() FunctionWe saw the lm() function in the first course. Let us run the linear model in which Calories is modelled by Alcohol in the data frame beer. Here Calories is our response (Y) variable and Alcohol is a predictor (X) variable. I’ve added an extra pair of parentheses here, which makes the result not only get assigned to beer.lm, but also prints it out.
(beer.lm <- lm (Calories ~ Alcohol, data = beer))
What is the output telling you? Answer: the coefficients of the least-squares line – but that’s just the part that prints out.
What happens when you call plot (beer.lm, ask=T) ? You get a bunch of the plots that could be useful as diagnostics, to decide whether the model is believeable. These are useful, but that’s not what we’re looking for right now. The important point is that different things react differently to the plot() command. The same of true of some other hgeneric functions like summary() and predict().
More complicated models: We can add other terms to this model as well. For example, the command lm (Calories ~ Alcohol + Light) adds the categorical variable Light into the model. The resulting coefficient shows the predicted effect associated with changing from Light to Nonlight, holding Calories constant.
Adding the Response Surface:Let’s go back to the simple regression model. The “response surface” is the value of Y (Calories) predicted by the model for each value of X (Alcohol). There’s a simple way to draw this, but for my purposes I want to use a slightly more complicated setup here. Let’s create a new data set with zillions of possible values of Alcohol. Then we will use the predict() function to generate the predicted values of Calories, and plot them.
newbeer <- data.frame (Alcohol = seq (2.4, 5.5, len=500))
#
# That command created a 500x1 data frame with a
# column named Alcohol. Predict() requires that column
# names in the new data match the names in the old.
#
lines (newbeer$Alcohol, predict (beer.lm, newbeer))
Notes on R Libraries:
Lots of stuff in R comes in the form of packages or libraries. Packages need to be installed (one time) before they can be used. If you are connected to the internet, the install.packages() command will fetch packages for you. By default it will install them in the proper place, but if you don’t have permission to write to that place, you can specify another place with the lib= argument. So, for example,
install.packages ("ipred", lib="c:/mydir")
will install ipred (and any packages it depends on) into c:\mydir. (Remember that R uses the forward slash, or a double backslash, as the directory separator!)
To use the library you need to attach it. You will have to do this every time you start R (although there’s a way to make it happen automatically – see the help for .First().) Normally we will use a command like library ("ipred"), but when the package was installed elsewhere, as above, we specify that location with a command like
library ("ipred", lib="c:/mydir")