R for MATH1530- All the sections and the special section
R is a very powerful programming language to perform statistical analysis. Here we restrict ourselves to commands that are useful for the topics we learn in any section of MATH1530. R is free and students can download it to their own computers at home. R is also available in all the computers of the university and through citrix. Usually there is more than one way to do something with statistical software, we will focus on the most simple way in each case.
Data files used (they are ANSI text files) : pulserate.txt, CHCH2014.txt, drugsurv.txt
available from http://faculty.etsu.edu/seier/IntroStatsBioBook.htm
right click on the name of the files and save them in your computer or Z: or Q: drive.
STARTING
1. Where can I get R from ? http://www.r-project.org there are many free books and manuals available there too. Updated versions are frequently available.
2. How to make my life easier?
a) With File>Change directory indicate where your data files (if any ) are. In that way you don’t need to specify the drive later when you read files
b) If you want to keep the commands in a document, save them in a text file (you could use the extension .txt or .R ) instead of a Word .doc file because sometimes Word changes the font of the quotes and R does not recognize them. Remember R is case sensitive
READING DATA
3. How to read data? You create an object (any name you want) and assign there the data but you can input the data in several ways
a) Typing data in the session window. This is practical if we have few observations
nbschips<-c( 27,15,15,16,16,24, 27,23,26,22,22,18,22,22,20,20,20,24,24,25,30, 27, 20)
b) Reading the data from a file with a single column of numbers. As example we will use the pulse rate of 210 students stored in, save the txt file in the directory that you indicated to R that your data were going to be in and read it from R
pulse<-scan('pulserate.txt')
c) Reading the data from a file with several variables (you can use simple or double quotes)
cookies<-read.table("chch2014.dat",header=TRUE)
attach(cookies) ## to attach the names of variables to the data
names(cookies) ## to look at the name of the variables
d) Typing the data in a worksheet. First create an empty data frame
mydata<-data.frame()
then use Editor>data editor and type the name to see the worksheet appear, type in the data
When done, close the data window. To save the data in a file type
write.table(mydata,'nameoffile')
PLOTTING AND CALCULATING BASIC STATISTICS
4. Histograms, boxplots and stem and leaf displays
hist(pulse)
boxplot(pulse)
stem(pulse)
you can even make them prettier by adding color
hist(pulse,col='salmon')
5. Descriptive statistics
mean(pulse)
median(pulse)
sd(pulse)
var(pulse)
summary(pulse)
6. Comparing groups (cookies data)
boxplot(chips~brand)
by(chips,brand,summary) ## gets means + five number summary
by(chips,brand,sd) ## calculates standard deviations per group
by(chips,brand,mean) ## calculates the mean per group
7. Scatter plot , correlation & regression (example altitude of residence & red blood cells)
x<-c(0,1840,2200,2200,5000,5200,5750,7400,8650,10740,12000,12200,12300,14200,14800,14900,17500)
y<-c(4.93,4.75,5.40,4.65,5.42,6.55,5.99,5.39,5.44,5.82,7.50,5.67, 6.31,7.05,6.46,6.66,7.37)
plot(x,y)
cor(x,y)
lm(y~x)
abline(lm(y~x))
of course we can make the plot prettier controlling the size, type and color of the icons used, the title of the plot etc. When you want to know all the options you have type ‘help(plot)’
8. Tables and plots for one categorical variable
table(brand)
pie(table(brand))
barplot(table(brand))
9. Tables and plots for two categorical variables
mydrugs<-read.table('drugsurv.dat',header=TRUE)
attach(mydrugs)
table(GENDER,Marijuana)
barplot(table(GENDER,Marijuana))
barplot(table(GENDER,Marijuana),beside=TRUE)
PROBABILITY DISTRIBUTIONS
There are 4 magic letters with regard to probability distributions in R : d, p, q ,r
d calculates f(x) (or p(x) in the case of discrete distributions)
p calculates the cumulative probability
q calculates the quantile for a given probability
r generates random numbers from a distribution
10. Normal Distribution
dnorm(x,u,s) calculates f(x) for a normal with mean u and standard deviation s
pnorm(x,u,s) calculates P(X≤x)
qnorm(p,u,s) calculates the value of x such that P(X≤x)=p
rnorm(n,u,s) generates n values from the N(u,s) distribution
11. Binomial Distribution
dbinom(x,n,p) calculates p(x) for a binomial with parameters n and p
pbinom(x,n,p) calculates P(X≤x)
qbinom(p,n,p) calculates the value of x such that P(X≤x)=p
rbinom(m,n,p) generates m values from the B(n,p) distribution
If you want to create a binomial table, for example, for n=10 and p=0.37 , you just type
x<-0:10 ## to create the sequence of values from 0 to 10
px<-dbinom(x,10,0.37) ## to calculate the probabilities for each value of x
mytable<-cbind(x,px) ## to form a table by binding the 2 colums
mytable ## to see the table
TESTING HYPOTHESES AND CONFIDENCE INTERVALS
12. Selecting a random sample
k<-1:1000 ## creating a sampling frame for 1000 individuals
mysample<-sample(k,20) ## selecting a random sample of size 20
13. T-test for mean. Example : Ho: u=25 Ha: u≠25
nbschips<-c( 27,15,15,16,16,24, 27,23,26,22,22,18,22,22,20,20,20,24,24,25,30, 27, 20)
t.test(nbschips,alternative=c("two.sided"),mu=25)
14. Paired T-test (example: pressure tolerated before and after treatment with cherry juice)
before<-c(2.3,2.6,2.5,2,2.4,2.4,2.1,2.5,2,2.2)
after<-c(4.3,4.6,4.9,3.8,4.3,4.2,4.1,4.0,3.9,4.3)
t.test(before,y=after,alternative=c("less"),paired=TRUE)
15. Two sample t-test. Ho: u1=u2 Ha: u1 ≠u2
cookies<-read.table("chch2014.dat",header=TRUE)
attach(cookies) ## to attach the names of variables to the data
t.test(chips~brand, alternative=c("two.sided"),mu=0,paired=FALSE)
16. Test for one proportion Assume that you want to test Ho:p=0.25 vs Ha: p<0.25, and that in a sample of 2466 you find 574 successes
binom.test(574,2466,p=0.25,alternative= c("less"))
17. Test for two proportions Ho: p1=p2 Ha:p1>p2, data are from the polio vaccine example
prop.test(x=c(142,52),n=c(200000,200000), alternative= c("greater"))
18. Goodness of fit test
obs<-c(315,108,101,32) # read observed frequencies
prob<-c(9/16,3/16,3/16,1/16) ## model probabilities
chisq.test(obs,p=prob)
19. Chisquare test of independence or homogeneity (for raw and tabulated data)
chisq.test(table(GENDER,Marijuana)) ## for raw data
thetable<-matrix(c(315,108,101,32),nc=2) ## if given a table enter counts by columns
chisq.test(thetable)
20. Test of normality
shapiro.test(nbschips)
qqnorm(nbschips)
qqline(nbschips)
In the special sections of MATH 1530, we use R for other topics as well, in particular for bootstrapping and randomization test. Note: R code to apply bootstrapping and randomization test available from
http://faculty.etsu.edu/seier/RcommCh3.txt
Here is a short review of the material of bootstrap and randomization test that includes the code in R
Randomization test
We introduce the randomization test to compare two populations using a hands on activity and then we apply it by using R-code that mimics the hands-on activity. To apply it to other data sets they simply have to change the data. Randomization tests are used in introductory statistics courses in several parts of the USA and there is even a user friendly software statkey developed by the authors of ‘Unlocking the power of Statistics’ (by 5 members of the Lock family) that they generously put it in the internet to be used by anybody.
The phytate story
The data
The activity
We can not base our conclusions in only 20 regroupings, we can do many more using the computer.
R-code available at http://faculty.etsu.edu/seier/RcommCh3.txt
The code:
## RANDOMIZATION TEST to compare groups A and B
## when the alternative is Ha:u(A)>u(B)
A<-c(6.07,7.20,6.61,9.69, 9.45,8.95,8.72,6.07)
B<-c(2.57,2.39,2.51,2.57,1.80,2.37,6.28,2.91)
nrep<-10000 # number of regroupings
## NO MORE INPUT IS NECESSARY
n1<-length(A) ## number of observations in A
n2<-length(B) ## number of observations in B
n<-n1+n2 ## total number of observations
meanA<-mean(A) ; meanB<-mean(B)
truedif<-meanA-meanB ## difference of the two means
## performs the randomization test
alldata<-append(A,B) ## merges the two groups
chips<-1:n ## creates n chips
difmeans<-double(nrep) ## creates storage space
for (i in 1:nrep){ ## commands will be executed 10000 times
chipsgroup1<-sample(chips,n1,replace=FALSE) ## selects group 1
chipsgroup2<-chips[-chipsgroup1] ## rest goes to group 2
group1<-(alldata[chipsgroup1]) ## values in group 1
group2<-(alldata[chipsgroup2]) ## values in group 2
difmeans[i]<-mean(group1)-mean(group2) ## difference in means
}
## assigns value 1 to the reg-roupings in which the
## difference of means is equal to or greater than the difference
## the A and B groups
numgreater<-difmeans>= truedif ## for two-sided alternative hypothesis
## use abs(difmeans)>= abs(truedif) in previous line
## calculates the proportion or approximated p-value
pvalue<-mean(numgreater) ; pvalue
stripchart(difmeans, method='stack',pch=1,at=0,main=" ",ylim=c(-0.3,2))
arrows(truedif,-0.2,truedif,0,col="red",lwd=2)
text(truedif,-0.25,c(truedif))
Bootstrapping
The rhododendron story
The sample
Sampling with replacement (from the sample, using the same n) and ‘bootstrap replicates’ of the statistic
The idea of the percentile confidence interval
In this table it is very easy for the students to understand why when you increase confidence the interval becomes wider.
There are packages in R to do boostrapping, you can also use statkey, but we use our code on
http://faculty.etsu.edu/seier/RcommCh3.txt
The beauty of bootstrap, you can build confidence intervals for things other than the mean without need of formulas, just change the statistic in the code. You can also change the data to use the code for another example)
The code
## PERCENTILE BOOTSTRAP CONFIDENCE INTERVAL FOR THE MEAN
## INPUT :
x<-c(9.70,10.40,8.90,10.70,9.85,10.30,9.00 ,10.00,10.45,8.65)
nboot<-2000 ## number of bootstrap replicates
confidence<-0.95
## NO MORE INPUT IS NEEDED
n<-length(x)
sampm<-double(nboot) ## prepares storage space
for (i in 1:nboot){ ## repeat 2000 times
subsam<-sample(x,n,replace=TRUE) ## bootstrap sample
sampm[i]<-mean(subsam) ## stores the sample mean
}
## calculates the endpoints of the confidence interval
low<-quantile(sampm,(1-confidence)/2)
hi<-quantile(sampm,confidence+(1-confidence)/2)
hist(sampm, xlab= "bootstrap replicates ", main=" ")
## marks the endpoints of the interval
arrows(low, 200,low,0,col="red")
arrows(hi, 200,hi,0,col="red")
## prints the interval
low; hi
## prints mean and sd for sample and bootstrap replicates
mean(x); sd(x)
mean(sampm); sd(sampm)
Using confidence intervals to test hypotheses
If we use bootstrap to calculate confidence intervals for correlations , the bootstrap samples we need to sample pairs (not the variables y and x separately)
## BOOTSTRAPPING THE CORRELATION COEFFICIENT
## INPUT
x<-c(0,1840,2200,2200,5000,5200,5750,7400,8650,10740,12000,
12200,12300,14200,14800,14900,17500)
y<-c(4.93,4.75,5.40,4.65,5.42,6.55,5.99,5.39,5.44,5.82,7.50,5.67,
6.31,7.05,6.46,6.66,7.37)
nboot<-2000 ## number of bootstrap replicates
confidence<-0.95
## NO MORE INPUT IS NEEDED
n<-length(x) # counts number of observations
who<-seq(1,n,by=1) #creates a list labels 1:n
rboot<-double(nboot) # creates storage space
## now we select random samples from the labels
## and identify the corresponding values of x and y
for (i in 1:nboot){
subsam<-sample(who,n,replace=TRUE)
xwho<-(x[subsam])
ywho<-(y[subsam])
rboot[i]<-cor(xwho,ywho)
}
low<-quantile(rboot,(1-confidence)/2) ## lower end of interval
high<-quantile(rboot,confidence+(1-confidence)/2) ## upper end
par(mfcol=c(1,2)) ## 2 plots in figure
plot(x,y,pch=19) ## scatterplot
hist(rboot, main= ' ',xlab= 'bootstrap replicates of r',xlim=c(0,1))
arrows(low,-100,low ,0,col="red")
arrows(high,-100, high ,0,col="red")
low;high ## prints the interval
Remember that hypothesis about the value of one parameter can be tested using a confidence interval for the parameter. In the example below Ho:ud=0 and a confidence interval for the mean difference is calculated that obviously does not include the value 0 and thus we reject the null hypothesis.
Edith Seier – September 18-2015