R for MATH1530- All the sections and the special section

R is a very powerful programming language to perform statistical analysis. Here we restrict ourselves to commands that are useful for the topics we learn in any section of MATH1530. R is free and students can download it to their own computers at home. R is also available in all the computers of the university and through citrix. Usually there is more than one way to do something with statistical software, we will focus on the most simple way in each case.

Data files used (they are ANSI text files) : pulserate.txt, CHCH2014.txt, drugsurv.txt

available from http://faculty.etsu.edu/seier/IntroStatsBioBook.htm

right click on the name of the files and save them in your computer or Z: or Q: drive.

STARTING

1.  Where can I get R from ? http://www.r-project.org there are many free books and manuals available there too. Updated versions are frequently available.

2.  How to make my life easier?

a)  With File>Change directory indicate where your data files (if any ) are. In that way you don’t need to specify the drive later when you read files

b)  If you want to keep the commands in a document, save them in a text file (you could use the extension .txt or .R ) instead of a Word .doc file because sometimes Word changes the font of the quotes and R does not recognize them. Remember R is case sensitive

READING DATA

3.  How to read data? You create an object (any name you want) and assign there the data but you can input the data in several ways

a)  Typing data in the session window. This is practical if we have few observations

nbschips<-c( 27,15,15,16,16,24, 27,23,26,22,22,18,22,22,20,20,20,24,24,25,30, 27, 20)

b)  Reading the data from a file with a single column of numbers. As example we will use the pulse rate of 210 students stored in, save the txt file in the directory that you indicated to R that your data were going to be in and read it from R

pulse<-scan('pulserate.txt')

c)  Reading the data from a file with several variables (you can use simple or double quotes)

cookies<-read.table("chch2014.dat",header=TRUE)

attach(cookies) ## to attach the names of variables to the data

names(cookies) ## to look at the name of the variables

d)  Typing the data in a worksheet. First create an empty data frame

mydata<-data.frame()

then use Editor>data editor and type the name to see the worksheet appear, type in the data

When done, close the data window. To save the data in a file type

write.table(mydata,'nameoffile')

PLOTTING AND CALCULATING BASIC STATISTICS

4.  Histograms, boxplots and stem and leaf displays

hist(pulse)

boxplot(pulse)

stem(pulse)

you can even make them prettier by adding color

hist(pulse,col='salmon')

5.  Descriptive statistics

mean(pulse)

median(pulse)

sd(pulse)

var(pulse)

summary(pulse)

6.  Comparing groups (cookies data)

boxplot(chips~brand)

by(chips,brand,summary) ## gets means + five number summary

by(chips,brand,sd) ## calculates standard deviations per group

by(chips,brand,mean) ## calculates the mean per group

7.  Scatter plot , correlation & regression (example altitude of residence & red blood cells)

x<-c(0,1840,2200,2200,5000,5200,5750,7400,8650,10740,12000,12200,12300,14200,14800,14900,17500)

y<-c(4.93,4.75,5.40,4.65,5.42,6.55,5.99,5.39,5.44,5.82,7.50,5.67, 6.31,7.05,6.46,6.66,7.37)

plot(x,y)

cor(x,y)

lm(y~x)

abline(lm(y~x))

of course we can make the plot prettier controlling the size, type and color of the icons used, the title of the plot etc. When you want to know all the options you have type ‘help(plot)’

8.  Tables and plots for one categorical variable

table(brand)

pie(table(brand))

barplot(table(brand))

9.  Tables and plots for two categorical variables

mydrugs<-read.table('drugsurv.dat',header=TRUE)

attach(mydrugs)

table(GENDER,Marijuana)

barplot(table(GENDER,Marijuana))

barplot(table(GENDER,Marijuana),beside=TRUE)

PROBABILITY DISTRIBUTIONS

There are 4 magic letters with regard to probability distributions in R : d, p, q ,r

d calculates f(x) (or p(x) in the case of discrete distributions)

p calculates the cumulative probability

q calculates the quantile for a given probability

r generates random numbers from a distribution

10.  Normal Distribution

dnorm(x,u,s) calculates f(x) for a normal with mean u and standard deviation s

pnorm(x,u,s) calculates P(X≤x)

qnorm(p,u,s) calculates the value of x such that P(X≤x)=p

rnorm(n,u,s) generates n values from the N(u,s) distribution

11.  Binomial Distribution

dbinom(x,n,p) calculates p(x) for a binomial with parameters n and p

pbinom(x,n,p) calculates P(X≤x)

qbinom(p,n,p) calculates the value of x such that P(X≤x)=p

rbinom(m,n,p) generates m values from the B(n,p) distribution

If you want to create a binomial table, for example, for n=10 and p=0.37 , you just type

x<-0:10 ## to create the sequence of values from 0 to 10

px<-dbinom(x,10,0.37) ## to calculate the probabilities for each value of x

mytable<-cbind(x,px) ## to form a table by binding the 2 colums

mytable ## to see the table

TESTING HYPOTHESES AND CONFIDENCE INTERVALS

12.  Selecting a random sample

k<-1:1000 ## creating a sampling frame for 1000 individuals

mysample<-sample(k,20) ## selecting a random sample of size 20

13.  T-test for mean. Example : Ho: u=25 Ha: u≠25

nbschips<-c( 27,15,15,16,16,24, 27,23,26,22,22,18,22,22,20,20,20,24,24,25,30, 27, 20)

t.test(nbschips,alternative=c("two.sided"),mu=25)

14.  Paired T-test (example: pressure tolerated before and after treatment with cherry juice)

before<-c(2.3,2.6,2.5,2,2.4,2.4,2.1,2.5,2,2.2)

after<-c(4.3,4.6,4.9,3.8,4.3,4.2,4.1,4.0,3.9,4.3)

t.test(before,y=after,alternative=c("less"),paired=TRUE)

15.  Two sample t-test. Ho: u1=u2 Ha: u1 ≠u2

cookies<-read.table("chch2014.dat",header=TRUE)

attach(cookies) ## to attach the names of variables to the data

t.test(chips~brand, alternative=c("two.sided"),mu=0,paired=FALSE)

16.  Test for one proportion Assume that you want to test Ho:p=0.25 vs Ha: p<0.25, and that in a sample of 2466 you find 574 successes

binom.test(574,2466,p=0.25,alternative= c("less"))

17.  Test for two proportions Ho: p1=p2 Ha:p1>p2, data are from the polio vaccine example

prop.test(x=c(142,52),n=c(200000,200000), alternative= c("greater"))

18.  Goodness of fit test

obs<-c(315,108,101,32) # read observed frequencies

prob<-c(9/16,3/16,3/16,1/16) ## model probabilities

chisq.test(obs,p=prob)

19.  Chisquare test of independence or homogeneity (for raw and tabulated data)

chisq.test(table(GENDER,Marijuana)) ## for raw data

thetable<-matrix(c(315,108,101,32),nc=2) ## if given a table enter counts by columns

chisq.test(thetable)

20.  Test of normality

shapiro.test(nbschips)

qqnorm(nbschips)

qqline(nbschips)

In the special sections of MATH 1530, we use R for other topics as well, in particular for bootstrapping and randomization test. Note: R code to apply bootstrapping and randomization test available from

http://faculty.etsu.edu/seier/RcommCh3.txt

Here is a short review of the material of bootstrap and randomization test that includes the code in R

Randomization test

We introduce the randomization test to compare two populations using a hands on activity and then we apply it by using R-code that mimics the hands-on activity. To apply it to other data sets they simply have to change the data. Randomization tests are used in introductory statistics courses in several parts of the USA and there is even a user friendly software statkey developed by the authors of ‘Unlocking the power of Statistics’ (by 5 members of the Lock family) that they generously put it in the internet to be used by anybody.

The phytate story

The data

The activity

We can not base our conclusions in only 20 regroupings, we can do many more using the computer.

R-code available at http://faculty.etsu.edu/seier/RcommCh3.txt

The code:

## RANDOMIZATION TEST to compare groups A and B

## when the alternative is Ha:u(A)>u(B)

A<-c(6.07,7.20,6.61,9.69, 9.45,8.95,8.72,6.07)

B<-c(2.57,2.39,2.51,2.57,1.80,2.37,6.28,2.91)

nrep<-10000 # number of regroupings

## NO MORE INPUT IS NECESSARY

n1<-length(A) ## number of observations in A

n2<-length(B) ## number of observations in B

n<-n1+n2 ## total number of observations

meanA<-mean(A) ; meanB<-mean(B)

truedif<-meanA-meanB ## difference of the two means

## performs the randomization test

alldata<-append(A,B) ## merges the two groups

chips<-1:n ## creates n chips

difmeans<-double(nrep) ## creates storage space

for (i in 1:nrep){ ## commands will be executed 10000 times

chipsgroup1<-sample(chips,n1,replace=FALSE) ## selects group 1

chipsgroup2<-chips[-chipsgroup1] ## rest goes to group 2

group1<-(alldata[chipsgroup1]) ## values in group 1

group2<-(alldata[chipsgroup2]) ## values in group 2

difmeans[i]<-mean(group1)-mean(group2) ## difference in means

}

## assigns value 1 to the reg-roupings in which the

## difference of means is equal to or greater than the difference

## the A and B groups

numgreater<-difmeans>= truedif ## for two-sided alternative hypothesis

## use abs(difmeans)>= abs(truedif) in previous line

## calculates the proportion or approximated p-value

pvalue<-mean(numgreater) ; pvalue

stripchart(difmeans, method='stack',pch=1,at=0,main=" ",ylim=c(-0.3,2))

arrows(truedif,-0.2,truedif,0,col="red",lwd=2)

text(truedif,-0.25,c(truedif))

Bootstrapping

The rhododendron story

The sample

Sampling with replacement (from the sample, using the same n) and ‘bootstrap replicates’ of the statistic

The idea of the percentile confidence interval

In this table it is very easy for the students to understand why when you increase confidence the interval becomes wider.

There are packages in R to do boostrapping, you can also use statkey, but we use our code on

http://faculty.etsu.edu/seier/RcommCh3.txt

The beauty of bootstrap, you can build confidence intervals for things other than the mean without need of formulas, just change the statistic in the code. You can also change the data to use the code for another example)

The code

## PERCENTILE BOOTSTRAP CONFIDENCE INTERVAL FOR THE MEAN

## INPUT :

x<-c(9.70,10.40,8.90,10.70,9.85,10.30,9.00 ,10.00,10.45,8.65)

nboot<-2000 ## number of bootstrap replicates

confidence<-0.95

## NO MORE INPUT IS NEEDED

n<-length(x)

sampm<-double(nboot) ## prepares storage space

for (i in 1:nboot){ ## repeat 2000 times

subsam<-sample(x,n,replace=TRUE) ## bootstrap sample

sampm[i]<-mean(subsam) ## stores the sample mean

}

## calculates the endpoints of the confidence interval

low<-quantile(sampm,(1-confidence)/2)

hi<-quantile(sampm,confidence+(1-confidence)/2)

hist(sampm, xlab= "bootstrap replicates ", main=" ")

## marks the endpoints of the interval

arrows(low, 200,low,0,col="red")

arrows(hi, 200,hi,0,col="red")

## prints the interval

low; hi

## prints mean and sd for sample and bootstrap replicates

mean(x); sd(x)

mean(sampm); sd(sampm)

Using confidence intervals to test hypotheses

If we use bootstrap to calculate confidence intervals for correlations , the bootstrap samples we need to sample pairs (not the variables y and x separately)

## BOOTSTRAPPING THE CORRELATION COEFFICIENT

## INPUT

x<-c(0,1840,2200,2200,5000,5200,5750,7400,8650,10740,12000,

12200,12300,14200,14800,14900,17500)

y<-c(4.93,4.75,5.40,4.65,5.42,6.55,5.99,5.39,5.44,5.82,7.50,5.67,

6.31,7.05,6.46,6.66,7.37)

nboot<-2000 ## number of bootstrap replicates

confidence<-0.95

## NO MORE INPUT IS NEEDED

n<-length(x) # counts number of observations

who<-seq(1,n,by=1) #creates a list labels 1:n

rboot<-double(nboot) # creates storage space

## now we select random samples from the labels

## and identify the corresponding values of x and y

for (i in 1:nboot){

subsam<-sample(who,n,replace=TRUE)

xwho<-(x[subsam])

ywho<-(y[subsam])

rboot[i]<-cor(xwho,ywho)

}

low<-quantile(rboot,(1-confidence)/2) ## lower end of interval

high<-quantile(rboot,confidence+(1-confidence)/2) ## upper end

par(mfcol=c(1,2)) ## 2 plots in figure

plot(x,y,pch=19) ## scatterplot

hist(rboot, main= ' ',xlab= 'bootstrap replicates of r',xlim=c(0,1))

arrows(low,-100,low ,0,col="red")

arrows(high,-100, high ,0,col="red")

low;high ## prints the interval

Remember that hypothesis about the value of one parameter can be tested using a confidence interval for the parameter. In the example below Ho:ud=0 and a confidence interval for the mean difference is calculated that obviously does not include the value 0 and thus we reject the null hypothesis.

Edith Seier – September 18-2015