Mission 10. Fake Data and Power Analyses. 12 November

Didrik Vanhoenacker, nov 2009

Mission 10. Fake data and Power analyses. 12 November

Learning goals

· Realise what is possible to analyse and what is not.

· Be able to fake data in R.

· Be able to make a simple power analyses.

Data collection

· Fake your own data!

Mission

· Make an excel file with data you make up yourself.

· Fake data with the random number generators in R.

· Do power analyses with one anova type relationship and/or one regression type relationship. In principal you could also make power analyses with two explanatory variables at the same time. Or more. But you are not likely to have time for this today.

Hand in

· A Mini Report with an expected graph, how large sample you will take, and what power you have to detect your hypothesised (main) effects.

Mail it to me, , and to your feed-back pal.

Feedback pal

· Pick one that have tried something different from you.

2x2 type of study NEW – Web version only!

If your project will be a 2x2 type of study, try first some other data with a continuous response variable in this exercise (to get a feeling for power analyses). Then you can test the power of your study like this.

1. Use the M3 manual and fake reasonable data in the 2x2 matrix. Test and check p-value. Change data manually in the 2x2 matrix and test again a few times.

2. Use the built in R function power.prop.test(). Specify sample size (per group of the explanatory variable), proportion of for example fly attacked rose hips in the first group, and in the second group.

power.prop.test(n=20,p1=0.7,p2=0.3)

or calculate sample size directly given a specific level of power:

power.prop.test(power=0.8,p1=0.7,p2=0.3)

Binomial type of study NEW – Web version only!

First manually test some reasonable data with the binom.test() function.

binom.test(7,30)

or calculate sample size directly given a specific level of power:

n<-8

prop.Y<-0.6

p.values<-NULL

for (i in 1:1000){

myrandomsample<-sample(c("Y","N"),size=n,replace=T,prob=c(prop.Y,1-prop.Y))

p.values[i]<-binom.test(sum(myrandomsample=="Y"),n)$p.value

}

sum(p.values<0.05)/1000

1. Fake data in excel

1. Create a new excel document.

2. Write your variable names in the first row.

3. Enter data by hand.

4. Use values that seem appropriate. Perhaps 20-30 rows.

5. Save your excel file.

6. Open and attach your excel file in R.

7. Make appropriate graph(s).

8. Test your variables.

2. How to create data in R – a short introduction

A vector is a sequence of numbers, letters, or words. Like a column in excel.

You combine things to a vector with c().

c(3,4,34,2,1)

c("good","good","bad")

You save vectors with the "arrow": <- [ less than + dash ]

lichendiameter <- c(3,4,34,2,1)

The thing to the right is put in a "box" with the name/header to the left. If you want to look in the box, just write its name:

lichendiameter

It is easy to calculate with an entire vector. If you e.g., have measured the lichen diameter in cm, but want it in mm, you just multiply all values with 10:

lich.mm <- lichendiameter * 10

lich.mm

Or if you'd like to logtransform:

lich.log <- log(lich.mm)

lich.log

Easy as a tiny pancake you could also get some statistics of your vector/box:

mean(lich.mm) # mean

sd(lich.mm) # standard deviation

median(lich.mm) # median

sum(lich.mm) # the sum of all values

length(lich.mm) # how many numbers that are in the vector/box

By the way the mean is:

sum(lich.mm)/length(lich.mm)

3. Get random numbers from a normal distribution

Very many things in nature are normally distributed. Remember when we used the additive effect of herbivores and mykorrhiza to get a normally distributed effect on plant size. In R it is very easy to get random numbers from a normal distribution. The following code snip gives 100 random numbers from a normal distribution with mean 10 and standard deviation 2.

normal.randomnumbers <- rnorm(100,mean=10,sd=2)

Check the distribution with a histogram:

hist(normal.randomnumbers)

Realise that these are random numbers and the mean of your hundred numbers will not be exactly 10.

mean(normal.randomnumbers)

4. Fake data for a continuous response and a categorical explanatory (anova)

Consider to quit and restart R now so that your previous data doesn't interfere with your new data. And DON’T save workspace!

· Start with a vector for your categorical explanatory variable. For example:
treespecies <- as.factor(rep(c("oak","birch"),each=20))
The function rep() repeats c("oak","birch") 20 times each. The function as.factor() says that this should be a categorical variable (= a factor). Check the result:
treespecies

· Then create normally distributed lichensizes on the 20 oaks.
Use the function rnorm() and specify

o How many oaks you will measure lichens on (20, if you didn't change the number above)?

o What is the mean diameter for the lichen diameter on oaks?

o How large variation (in standard deviations) has the lichens. This is pretty hard to nail intuitively, but think like this: ± 2 sd (standard deviations) contains about 95 % of all lichens. If you skip the smallest and the largest (i.e., the 5 % most extreme lichen sizes), how large are then the smallest "normal" lichens and the largest "normal" lichens. sd≈(largest"normal"smallest"normal")/4.

o Example
If almost all normal men are between 160 and 200 cm tall, then the mean is about 180 cm and sd = (200-160)/4 = 10. So the code for 20 male lengths would be:
man <- rnorm(20, mean=180, sd=10)
or, if you don't want any decimals
man <- round(rnorm(20, mean=180, sd=10))

For oak lichens it'll be something like this (but add values for mean and sd!):

oak.lichens <- round(rnorm(20, mean= ,sd= ))

· Do the same things for lichens on birch, but change the mean if you think that the lichens are larger or smaller on birch compared to oak. You can use the same sd, or at least a similar sd. Like this (but add values for mean and sd!):
birch.lichens <- round(rnorm(20, mean= ,sd= ))

· Then use c() to combine your response variables and check the result:
lich.mm <- c(oak.lichens, birch.lichens)
lich.mm

· Make an anova-graph (stripchart) and check your fake data! Is there any difference?

stripchart(lich.mm~treespecies,vertical=T,method="jitter",jitter=0.1,pch=19)

· Test your fake data!

summary(aov(lich.mm ~ treespecies))

All right, now when you know how to do this, here is the entire code snip. Make it appropriate for your data and change relevant values. Then you can paste in to R many times and see what's happening with your graph and p-value. The red stuff (= per group n) must of course be the same number!

treespecies <- as.factor(rep(c("oak","birch"),each=20))

oak.lichens <- round(rnorm(20, mean=10 ,sd=2 ))

birch.lichens <- round(rnorm(20, mean=10 ,sd=2 ))

lich.mm <- c(oak.lichens, birch.lichens)

stripchart(lich.mm~treespecies,vertical=T,method="jitter",jitter=0.1,pch=19)
summary(aov(lich.mm ~ treespecies))

Test other combinations of mean, variation (sd), and sample sizes and see how this affects the graph and test results.

Then, if you want to make a power analysis with 1000 runs, see page 7-9.

Fake data for a continuous response and a continuous explanatory (regression)

Consider to quit and restart R now so that your previous data doesn't interfere with your new data.

· First specify the mean and variation (sd) for your response variable (e.g., lichsize) and your explanatory variable (e.g., treecircum), and the sample size (n), e.g, the number of trees you will check. n = 30 is a good starting point.
Remember: ± 2 sd (standard deviations) contains about 95 % of all lichens. If you skip the smallest and the largest (i.e., the 5 % most extreme lichen sizes), how large are then the smallest "normal" lichens and the largest "normal" lichens. sd≈(largest"normal"smallest"normal")/4. Example: If almost all normal men are between 160 and 200 cm tall, then the mean is about 180 cm and sd = (200-160)/4 = 10.
n <-
mean.x <-
sd.x <-
mean.y <-
sd.y <-

· Then decide how large proportion (from 0.0 to 1.0) of the response variable (e.g., lichen size) variation that is explained by the explanatory variable (e.g., tree size). This is the R2.
R2 <-

· Then copy-paste the entire code block below (or write EXACTLY the same). This code uses the package MASS to create a random response and a random explanatory with the properties you just specified.
library(MASS)
kovarians<- matrix(c(sd.y^2,rep(sqrt(R2)*sd.x*sd.y,2),sd.x^2),2,2)
y.and.x <- mvrnorm(n=n,mu=c(mean.y,mean.x),Sigma=kovarians)
y <- y.and.x[,1]
x <- y.and.x[,2]

· Check it with a graph.
plot(y~x)

· And test it.
summary(lm(y~x))

n <-
mean.x <-
sd.x <-
mean.y <-
sd.y <-
R2 <-
kovarians<- matrix(c(sd.y^2,rep(sqrt(R2)*sd.x*sd.y,2),sd.x^2),2,2)
y.and.x <- mvrnorm(n=n,mu=c(mean.y,mean.x),Sigma=kovarians)
y <- y.and.x[,1]
x <- y.and.x[,2]
plot(y~x)
summary(lm(y~x))

Test other combinations of mean, variation (sd), and sample sizes and see how this affects the graph and test results. Then if you want to make a power analysis with 6 or 1000 runs, see page 7-9.

Logistic regression NEW – WEB version only

If you have a logistic regression type of study, this is a more complicated. First use another data (regression type) for this exercise to get a feeling for power analyses. Then you can try this... A pretty easy (but not very solid) solution is to first fake a regression type data. You can then consider the response variable (y) to be for example the continuous variation in attractiveness of rose hips to flies. Then you can use a threshold value for whether this attractiveness will result in an attack.

library(MASS)

n <-
mean.x <-
sd.x <-
mean.y <-
sd.y <-
R2 <-
kovarians<- matrix(c(sd.y^2,rep(sqrt(R2)*sd.x*sd.y,2),sd.x^2),2,2)
y.and.x <- mvrnorm(n=n,mu=c(mean.y,mean.x),Sigma=kovarians)
y <- y.and.x[,1]

y <- factor(ifelse(y<median(y),"0","1"))
x <- y.and.x[,2]
stripchart(x~y,method="jitter",jitter=.1,pch=19)
anova(glm(y~x,binomial),test="Chisq")

as.numeric(unlist(anova(glm(y~x,binomial),test="Chisq"))[10])

However, specifying relevant values for mean.y and especially sd.y is of course not easy. Also the threshold value of the median is perhaps not reasonable. This will assume that half of your rose hips for example, will be attacked. Perhaps only 20 % is a more reasonable value. Then power will be lower...

The last courier row picks out the p-value from the anova output. Useful if you would like to make a 100 times for loop... (but as this is a pretty simple, but non-solid solution I don't really recommend that...)

Didrik Vanhoenacker nov 2008

featuring David Berger as six-pack graph-code-hacker

Graphical "power"-analyses with R for regression

Specify sample size (n), mean and variation (sd) as above.

library(MASS)

n <-
mean.x <-
sd.x <-
mean.y <-
sd.y <-
R2 <-

Then paste the entire code below. It will give you 6 random data sets each time and show a graph panel with all six graphs, so that you will be able to see how important by chance effects are. You can paste the code below over and over again (without changing the specifications above) and see how chance effects (= random sampling really) affect the relationship. Then you can also compare different sample sizes and realize when chance effects have the largest effect.

kovarians<- matrix(c(sd.y^2,rep(sqrt(R2)*sd.x*sd.y,2),sd.x^2),2,2)
y<-matrix(,n,6)
x<-matrix(,n,6)
par(mfrow=c(3,2))
for(i in 1:6){
y.and.x <- (mvrnorm(n=n,mu=c(mean.y,mean.x),Sigma=kovarians))
y[,i] <- y.and.x[,1]
x[,i] <- y.and.x[,2]

}

for(i in 1:6){
plot(x[,i],y[,i],pch=19,xlab="treecircum",ylab="lichsize",xlim=range(x), ylim=range(y))
abline(lm(y[,i]~x[,i]),lwd=2,col="red")

}

Power analyses with simulations in R for regression

Specify sample size (n), mean and variation (sd) as above.

library(MASS)

n <-
mean.x <-
sd.x <-
mean.y <-
sd.y <-
R2 <-

Then paste the entire code block below. This will give you 1000 randomised data sets. Every time it will make a regression-test and save the p-value in a box (vector) named power.pr.

kovarians<- matrix(c(sd.y^2,rep(sqrt(R2)*sd.x*sd.y,2),sd.x^2),2,2)

power.pr<-NULL
for (i in 1:1000){
y.and.x <- (mvrnorm(n=n,mu=c(mean.y,mean.x),Sigma=kovarians))

y<-y.and.x[,1]
x<-y.and.x[,2]
power.pr[i]<-summary(lm(y~x))$coef[8]
}

When R is done with the creating and testing of all 1000 random data sets, you can see how many times the test was significant, i.e, below 0.05.

sum(power.pr<0.05)
Divide by 1000 and you will see (in %) how likely you are to get a significant effect, given the means, variation (sd), and sample size you specified.

sum(power.pr<0.05)/1000

Power is the probability to get a significant result, given that there indeed is a certain difference. Change the sample size (n) and try again. How large sample size do you need?

# Shitty histogram code (only for code nerds, not for a statistical driving licence!)

powr<-length(power.pr[power.pr<0.05])/1000

hist(power.pr,breaks=20,col=c("red",rep("white",19)),xaxt="n", main="1000 simulations with my precious effect", xlab="p-values", ylim=c(0,1000))

axis(1,c(0,.5,1),labels=c(0,0.5,1))

axis(1,at=0.05,lwd=2,col="red",col.axis="red",font=2,tcl=-2,padj=2)

text(0.5,300,bquote(paste("Power ",phantom() %~~%phantom(.),.(powr))), adj=0)

Power analyses with simulations in R for anova

Here you gotta be careful. Change sample size in each group (the red stuff), variation (the blue stuff) and means in each group (de turquoise stuff). Then paste the entire code block. This will give you 1000 randomised data sets. Every time it will make an anova-test and save the p-value in a box (vector) named power.p.

power.p<-NULL
for (i in 1:1000){
x<-as.factor(rep(c("oak","birch"), each=10))
y<-c(rnorm(10,mean=20,sd=3),rnorm(10,mean=17,sd=3))
power.p[i]<-summary(lm(y~x))$coef[8]
}

When R is done with the creating and testing of all 1000 random data sets, you can see how many times the test was significant, i.e, below 0.05.

sum(power.p<0.05)
Divide by 1000 and you will see (in %) how likely you are to get a significant effect, given the means, variation (sd), and sample size you specified.

sum(power.p<0.05)/1000

Power is the probability to get a significant result, given that there indeed is a certain difference. Change the sample size (n) and try again. How large sample size do you need?

# Shitty histogram code (only for code nerds, not for a statistical driving licence!)

pow<-length(power.p[power.p<0.05])/1000