Introduction to R

Statistics Outreach Center

Short Course

Topics Covered:

  • R software and programming language
  • R Interface
  • Basic R syntax
  • Entering Data via .csv files
  • Descriptive Statistics, Graphs, and Charts
  • Inferential Statistics
  • T-tests
  • One-way ANOVA
  • Correlation and Regression
  • Brief Regression Diagnostics
  • Additional R Resources

Introduction

This short course is designed as an introduction to R for those who have never used R. It will provide basic instruction in the topics listed on the cover page. You will not be an expert at R at the end of this course, rather, it is the hope that this presentation will motivate a desire to further pursue R. It will also provide many resources for learning additional features of R. This short course will use an example dataset that can be downloaded from the SOC website (http://www.education.uiowa.edu/centers/statoutreach/short-courses ).

Getting Started

Open R by clicking on the desktop, searching through start menu, or through Citrix Virtual Desktop. You can also download R through .

Section 1: R Software and Programming Language

R is an open-source implementation of the language S (stands for Statistics). R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. Its name comes from the names of the first two R authors, and it is partly a play on the name of S (wikipedia). It first appeared in 1993 and has had an active development community ever since. Currently, there are over 4,000 user designed packages for R. Pace (2012) lists five reasons to learn and use R:

  1. R is open source and completely free. It is the de facto standard and preferred program of many professional statisticians and researchers in a variety of fields. R community members regularly contribute packages to increase R’s functionality.
  2. R is as good as (often better than) commercially available statistical packages like SPSS, SAS, and Minitab.
  3. R has extensive statistical and graphing capabilities. R provides hundreds of built-in statistical functions as well as its own built-in programming language.
  4. R is used in teaching and performing computational statistics. It is the language of choice for many academics who teach computational statistics.
  5. Getting help from the R user community is easy. There are readily available online tutorials, data sets, and discussion forums about R.

I have also seen other reports indicating that R is the one program that can do more than any other program. Theoretically, this is certainly possible, as one has the flexibility to write a program to perform one’s intended function. There are also a number of higher level statistical analyses that only R can do, for example, Student Growth Percentiles (SGPs). Also, in my experience, knowing R connotes a high level of ability, and I’ve had employers tell me that because I know R, they assume I can use any other statistics program.

Section 2: R Interface

R is code driven. Code is either directly entered into the console or run from a script file.

This is your workspace. You type code directly into the R Console.

After typing your line(s) of code, press Enter, and your output will appear directly below your code. All of your output appears in the R console, except graphs, which will appear in their own boxes

Script files are very useful because they allow you to write a lot of code in a text file and save it for later use.

Create Script File

To make a new script file: Go to File --> New Script and then a text box will appear.

Run Code from Script

To run code that is in your Script file, you must highlight all the lines of code you want to run, then you can:

Right click on your mouse, then select “Run line or selection.”

OR

Press Control and R at the same time.

OR

Select the “Run line or selection” icon.

Saving a Script

To save a script file: Go to File --> “Save” or “Save as” or click on the Disk Icon

To open a saved script file: Go to File --> Open Script or click the Open Folder Icon

Section 3: Basic R Syntax

This section will walk the user through some basic R syntax, demonstrating R’s built in functions and flexibility, and give the user an opportunity to run some simple R code.

Basic Commands in R

R can do everything a graphing calculator can do (I like to think of it as a calculator on steroids!). The following are a number of basic computations in R. Note that multiplication uses the asterisk (*). R does not require a semicolon at the end of the statements (like SAS).

Arithmetic functions

SYMBOL / OPERATION
+ / Addition
- / Subtraction
* / Multiplication
/ / Division
^ / Power

Sample: What is in R?

Solution: 8*(4/(6^3))-2 OR 8*(4/6^3)-2

NOT 8(4/6^3)-2

YOUR TURN!

Use R to answer the following questions:

What is 7x9x3?

Nine to the 3rd power divided by 2 ?

What is (2/4)x6 ?

What is 24 divided by 6 ?

Built in functions

FUNCTION / OPERATION
log(x) / Natural log of x
exp(x) / ex
log(x,n) / Log to base n of x
log10(x) / Log to base 10 of x
pi / π (3.141593)
sqrt(x) / Square root of x
factorial(x) / x!
choose(n,x) / Binomial coefficient (“n choose x”)
floor(x) / Greatest integer < x
ceiling(x) / Smallest integer > x
round(x, digits=0) / Rounds x to nearest integer
cos(x) / Cosine of x in radians
sin(x) / Sine of x in radians
tan(x) / Tangent of x in radians
abs(x) / Absolute value of x

Use R to answer the following questions:

What is the log of 29?

What is the square root of 256?

What is the tangent of 7.25?

Data Structure and Types in R

Vectors

Create a numerical vector in R using the following code:

samplevector <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

“c” is used to concatenate the numbers into a vector. Numbers are separated by a comma.

“<-“ is used to assign the numbers into the vector named “samplevector.” The equal sign (=) can also be used for assignment.

samplevector is an arbitrary name used to ‘hold’ our data. It is our ‘data bucket.’

Character data

Vectors can also hold character data, e.g.,

charactervector <- c(“one”, “two”, “three”)

Characters require quotations around them, otherwise the statement will not run.

Data Frames

A data frame contains multiple columns and rows (similar to matrices). Columns can be of different data type (i.e., numeric vs. character) and can be given names (e.g., weight, height, etc..)

556WeightHeightGender

8431506“M”

3531655.5“F”

59101455.7“F”

Syntax to create above dataframe:

a <- c(150,165,145)

b <- c(6, 5.5, 5.7)

c <- c(“M”,”F”,”F”)

mydataframe <- data.frame(a,b,c)# this syntax creates the dataframe

names(mydataframe) <- c(“Weight”,”Height”,”Gender”) # these are the variable names

The hashtag symbol is used to comment in R, which means that anything following the hashtag will not be run. (#)

Referencing a variable in a dataframe

To reference a variable in a dataframe, you type the dataframe name, followed by a dollar sign ($), and then the variable name, for example, typing:

mydataframe$Gender

…will return the Gender column of the dataframe named ‘mydataframe.’

See for more information on variable creation and dataset manipulation and management.

Basic Statistic Functions

FUNCTION / OPERATION
mean(x) / Mean of a vector x
median(x) / Median of x
sd(x) / Standard deviation of x
var(x) / Variance of x
IQR(x) / Interquartile range
range(x) / Min and max values of x
min(x) / Minimum of x
max(x) / Maximum of x
sum(x) / Sums all the elements of x
fivenum(x) / Five number summary of x
summary(x) / Five number summary plus mean of x
nrow(x) / Gives the number of rows
length(x) / Gives the length of the vector x

Using our dataset ‘samplevector’ and ‘mydataframe’ answer the following questions:

What is the mean of samplevector? Height in mydataframe?

Length of samplevector?

Five number summary?

Bonus question: Perform a log transformation on samplevector and save it to a vector named logsamplevector.

Section 4: Importing Data

This section will walk the user through entering data through a .csv file. In the previous example, we entered data manually in the form of a vector through the ‘c’ function. In normal practice, you will almost always enter data from an external file, often in the form of an excel spreadsheet. However, R does not do a good job at importing .xls files, so they need to be converted to .csv files first.

4.1 Downloading and formatting data to be ‘R friendly’

  1. Go to SOC short course website (http://www.education.uiowa.edu/centers/statoutreach/short-courses)
  2. Download dataset from SPSS course (NOT R course)
  3. R does not like .xls files and the excel formatting, so we have to format this file to be R friendly. This is incredibly common in practice, as often you will get data from different sources. I’ve seen some sources say that up to 90% of statistician time is spent prepping and cleaning the data!!!
  4. First, we need to save the file as a .csv file. Go to “save as” and save as a “.csv comma delimited” file. Save on the desktop.
  5. Second, we need to remove the excel formatting (i.e., the dollar signs). Select the two columns with dollar signs ($).
  6. Then, Select the “Home” tab, then go to the “Editing” section (on the upper right) and use the dropdown “Clear” menu to select “Clear Formats.”
  7. Re-save the file. Now we are ready to move to R!

4.2 Entering the data file into R

  1. Change the directory of R to match the desktop.
  2. Go to “File” then “change dir…” and then “C” -> “User” -> “Desktop” OR use the R function setwd: setwd(“C:/Users/User/Desktop/”)(you’ll need to change the “User” part to match the user on your computer)
  3. Once the directory is set, you can load the data into R
  4. Use the following code to load R:

practicedata <- read.csv(“employee_data.csv”)

attach(practicedata)

head(practicedata)

If this was entered correctly, you should see the first few lines of the dataset (the head function does this). The first line reads the data from the external file into a dataframe that we have arbitrarily named “practicedata.” The attach function simplifies future syntax in that you do not have to reference the dataframe when you run analyses, and this is recommended when you only use a single data source. For a more extended discussion on the use of attach, see the following link:

  1. Congratulations! You have loaded data into R!

A quick note on missing values: R codes missing values as “NA”, null values as “NULL” and non-numerical numbers as “NaN,” so if your incoming data set has these as characters, R reads them as missing values. The reason I bring this up is that I once had a dataset where the individual data points were coded using letters, and one of the data points had the ID of “NA”, which R automatically dropped, so when I ran the analysis, I got an error message! So be mindful of these values when you import data.

Section 5: Descriptive Statistics

This section will walk the user through using R functions to yield simple descriptive statistics. Some of the functions used in section 3 will be used.

5.1 Descriptive Statistics for the Whole Dataset

Now that the data is entered (if you haven’t entered data, see section 4), we will calculate statistics such as the mean, standard deviation, five number summary, and some graphs and charts. Also, remember that our data is attached. If you can’t remember what this means, see the previous section.

Run the following code:

mean(salary)

sd(salary)

hist(salary)

boxplot(salary)

plot(salary~educ)

boxplot(salary~educ)

Your Turn!

Now try getting the mean, standard deviation, and histogram for “salbegin.”

Try using some of the functions from section 3 to get more descriptive info.

5.2 Descriptive Statistics for Subsets of the Data

The previous example only provided information for the entire dataset. This next example will show you how to select subsets of the data to find more detailed information for variables such as gender or job experience. This code is also more syntactically advanced and will be explained in further detail.

Remember our data is still attached?? Now we will practice with detached data. If you try to run the code from section 5.1 after detaching, you will notice it will no longer work, but now the dataset is in a more flexible format. Many R purists will only work with detached data.

Run the following code:

detach(practicedata)

mean(practicedata$salary) # this is how you would get the mean for detached data.

maledata <- subset(practicedata, gender=="m")

mean(maledata$salary) # this gets the mean salary for males

femaledata <- subset(practicedata, gender=="f")

mean(femaledata$salary) # this gets the mean salary for females

boxplot(salary~gender, data=practicedata)

Notice here that when the data is detached, the syntax becomes more complicated, that is, you have to specify the dataframe for every function. This is indicated by the syntax “dataframe$variable,” and you can see the mean salary code for the entire data set takes on this form. The subset function requires you to indicate the data set it takes data from (e.g., practicedata), and then it requires you to indicate which variable subcomponent is required to subset the data (e.g., gender==”m”), and these two components are separated by the comma.

The subset function can take a number of different forms, for example, the following code creates a subset out of the individuals who earn below $40000:

lowearners <- subset(practicedata, salary <= 40000)

You can also apply transformations to your data and view the results, e.g.,

plot(log(salary)~educ, data=practicedata)

Section 6: Inferential Statistics

This section will guide the user through some inferential techniques in R, namely t-tests, one way ANOVA, and regression problems.

6.1 T-tests

R’s default t-test assumes unequal variance and assumes an independent 2 sample. (see for a nice breakdown of the R t-tests)

For our research question, Does the population mean for salary differ for males and females? we will use the independent-samples t-test.

t.test(salary~gender, data=practicedata)

As you can see, the difference is significant.

6.2 One-way ANOVA

For the following research question, we ask if there is significant difference in mean salary across the levels of education.

summary(aov(salary~educ, data=practicedata))

Again, we notice here that a significant difference exists across levels of education. This could be followed up with a post-hoc analysis, such as the Tukey test. Unfortunately, R needs the factors to be defined as characters (not as numbers), so the current dataset would need to be re-coded in excel before we could get fine-tuned post hoc analyses.

6.3 Correlation and Regression

Here we can evaluate the correlation between starting salary and current salary:

Detached code

cor(practicedata$salary,practicedata$salbegin)

plot(practicedata$salary,practicedata$salbegin)

Attached code

cor(salary,salbegin)

plot(salary,salbegin)

Then we can create a linear model to obtain further information, such as predictive coefficients and R-squared values

fit <- lm(salary ~ salbegin, data=practicedata)

summary(fit)

fit <- lm(salary ~ salbegin + educ, data=practicedata)

summary(fit)

Notice that adding education as a variable positively predicts current salary, and uniquely predicts ~3% of the variance above and beyond starting salary.

The following code provides some plots that assess the regression assumptions of linearity, normality, equal variance, and outliers.

par(mfrow=c(2,2))

plot(fit)

Additional R Resources

The following is a list of resources that I’ve found useful. One trick I’ve noticed while paging through R resources is that some authors/bloggers are notoriously difficult to understand. What I’ve listed here are resources I’ve found to be fairly understandable. These are all free books through the uiowa library system that you can download. I’m sure there are materials you can pay for, but honestly, I’m of the belief you can learn whatever you want concerning R through free materials. Additionally, some of the websites I’ve listed are incredibly helpful and can be used as quick reference guides, or for especially tricky/esoteric questions (most notably stats.stackexchange).

E-books (free through lib.uiowa.edu)

Albert, J. & Rizzo, M. (2012). R by example. Springer. New York.

Cowles, M. K. (2013). Applied bayesian statistics. Springer. New York.

Cowpertwait, P. S. P., & Metcalfe, A. V. (2009). Introductory time series with R. Springer, New York.

Lafaye de Micheaux, P., Drouilhet, R. & Liquet, B. (2013). The R software. Springer. New York.

Pace, L. (2012). Beginning R: An introduction to statistical programming. Apress. New York.

Zuur, A. F., Leno, E. N., & Meesters, E. H. W. G. (2009). A beginner’s guide to R. Springer. New York.

Internet Resources

Comprehensive guide:

List of all packages:

List of packages for psychometrics:

Blogs and Other Internet Resources I frequently use

SOC Contact Info:

www.education.uiowa.edu/soc