Mission 11. Problematic distributions. 17 nov

Learning goals

·  Be able to handle data sets with problematic distributions in different ways.

Data collection

IF your project data has a continuous response, you can use that.

·  Most non-parametric tests are designed for ONE explanatory variable.
à Choose one explanatory variable.

IF your project data DOES NOT have a continuous response, use an old data set for this exercise. For example the xanthoria data set. BUT if you make a regression type study use ONLY one of the species (delete the other in excel). If you make an anova type study, EITHER use lichen species as the explanatory OR use ONLY one of the species (delete the other in excel).

·  Most non-parametric tests are designed for ONE explanatory variable.
à Choose one explanatory variable.

Mission

·  Test your data with a non-parametric test, a bootstrap test, and a test on a binarized response.

Hand in

·  One Mini Report on either a regression type study or an anova type study.

Mail the Mini-report to me, , and perhaps to your feed-back pal.

Feedback pal

·  Only if you want today. Then pick one yourself.

Regression type study

Import and attach your data set in R.

Make a regression type of graph!

Test the model assumption. The better the model assumption graphs look, the less different will the result from the different test methods be. Ok? This is why it may be important to use other methods when the assumption sucks!

Non-parametric test

There are two common non-parametric regression types: Kendall and Spearman.

Kendall

Kendall, in principle, checks the slope of all possible lines between all possible pairs of data points, and test the slope of the line with the median slope. However, in practice, the test statistic Tau is not the same as the slope.

cor.test(x,y, method="kendall")

To illustrate this graphically, you can do like this. Change 3 x, 2 y and one dataname:

plot(y~x,pch=19)

data.medianreg<-yourdata[order(x,decreasing=T),]

xxm<-data.medianreg$x

yym<-data.medianreg$y

xm<-dist(xxm)

ym<-as.dist(matrix(kronecker(yym,yym,"-"), ncol=length(yym)),upper=F,diag=F)

b1<-median(ym/xm)

a1<-median(yym-xxm*b1)

abline(a1,b1,col="red",lwd=2)

If you want to compare this median line with an ordinary parametric regression line, you may add such a one.

abline(lm(y~x),lwd=2,col="blue")

Spearman

Spearman instead calculates the ranks of all observations and makes a regression of the ranked data.

cor.test(x,y, method="spearman")

To illustrate this graphically, you can do like this:

plot(rank(y)~rank(x),pch=19)

abline(lm(rank(y)~rank(x)),lwd=2,col="red")

Bootstrap

1.  Open and attach your data set.

2.  First check the relationship for your actual data set (exchange y and x, or rename your variables). coef() extracts the coefficients and the second [2] coefficient is the slope.
slope<-coef(lm(y~x))[2]
slope
and graphically
plot(y~x)
abline(lm(y~x),col="blue",lwd=2)

3.  A bootstrap picks a random data point with both its y and x values. Then the datapoint is put back. This is repeated until you have the same number of data points as in the original data set. However the bootstrapped data will "miss" some data points and have duplets or triplets of some data point.
boot.d <- yourdata[sample(row.names(yourdata),replace=T),]
Check the slope of the relationship in the bootstrapped data (Estimate of your x).
summary(lm(boot.d$y~boot.d$x))
To extract the slope you want the second coefficient (=Estimate):
boot.lut <- coef(lm(boot.d$y~boot.d$x))[2]
boot.lut
Is it correct? To extract values from summaries and models is important, but sometimes quite hard. boot.lut is just my name on the box. Lutning means slope in Swedish…

4.  Well you have to do this at least a 1000 times. This requires a for loop. First you have to create an empty vector (=column) where you can store the results. In the for loop, each loop (i), in each of the thousand times (1:1000), bootstrap the data and save the slope of the relationship between the bootstrapped x and y (boot.d$x and boot.d$y).
boot.lut <- NULL
for (i in 1:1000) {
boot.d <- yourdata[sample(row.names(yourdata),replace=T),]
boot.lut[i] <- coef(lm(boot.d$y~boot.d$x))[2]
}

5.  You can take a look at all random relationship-slope by just typing boot.lut or fix(boot.lut) but it's easier to watch a histogram.
hist(boot.lut)

6.  The 95 % slope values in boot.lut that are closest to the slope of the actual data set represent the confidence interval of the slope. If these values overlap zero, then there is a risk that there is no REAL relationship. If these 95 % slope values DO NOT overlap zero then it is very likely that there is a real relationship. The [c(25,975)] extract the value 25 and 975, that is the 2.5 % most extreme values in each tail.
sort(boot.lut)[c(25,975)]

7.  To get a nice graphical illustration of this, make a histogram and then add vertical lines to illustrate the bootstrapped 95 % confidence interval (blue dashed), the slope from your actual real data (blue), and the zero value (red).
hist(boot.lut, xlim=c(0.8*min(c(0,boot.lut)), 1.2*max(c(0,boot.lut))))
abline(v=sort(boot.lut)[c(25,975)],col="blue",lty=2,lwd=2)
abline(v=slope, col="blue", lwd=4)
abline(v=0, col="red", lwd=2)

What are your conclusions? Can you trust the result from your study?

Logistic regression on a binary transformed response

Binarize your data

Either use a biological meaningful limit for dividing your data in two categories. For example:

bin.y<-ifelse(y==0,0,1)

bin.y<-factor(bin.y)

Or divide your data above or below the median:

bin.y<-ifelse(y < median(y),0,1)

bin.y<-factor(bin.y)

Then make a logistic regression graph.

And test with generalized linear model with family=binomial.

Permutation

You have already tried this. If you want to practice with this data set, get the manual from the second day.

Anova type study

Import and attach your data set in R.

Make an anova type of graph!

Test the model assumption. The better the model assumption graphs look, the less different will the result from the different test methods be. Ok? This is why it may be important to use other methods when the assumption sucks!

Non-parametric test

For anova type studies you test whether the medians in the groups differ significantly. This is done by comparing ranked data. You could use either Mann-Whittney U-test (only two groups = non-parametric equivalent to a t-test) or Kruskal-Wallis test (two or many groups).

Make the test

kruskal.test(y~x)

Make a graph

stripchart(y~x, method="jitter",jitter=.1,vertical=T,pch=19)
med<-tapply(y,x,median)
a<-length(levels(x))
q25<-tapply(y,x,quantile,probs=0.25)
q75<-tapply(y,x,quantile,probs=0.75)
for (i in 1:a) {rect(i-0.2,q25[i],i+0.2,q75[i], border="red")}
for (i in 1:a) {lines(c(i-0.2,i+0.2), c(med[i],med[i]), col="red", lwd=3)}

This graph shows theMedian (fat line) and the box contains half of all data points (the middle ones).

Bootstrap

This code only works for 2 groups.

1.  Open and attach your data set.

2.  First check the differences between groups in your actual data set (exchange y and x, and "A" and "B"). The "A" and "B" are just the group names; yours might be "R.crispus", "Shade", or whatever.

difference<-mean(y[x=="A"]) - mean(y[x=="B"])

difference

3.  A bootstrap picks a random data point with both its y and x values. Then the datapoint is put back. This is repeated until you have the same number of data points as in the original data set. However the bootstrapped data will "miss" some data points and have duplets or triplets of some data point.
boot.d <- yourdata[sample(row.names(yourdata),replace=T),]
Check the difference between your two groups.
boot.diff <- mean(boot.d$y[boot.d$x=="A"]) - mean(boot.d$y[boot.d$x=="B"])
boot.diff
This should be the same as the Estimate of your explanatory from a linear model:
summary(lm(boot.d$y~boot.d$x))
Is it correct? To extract values from summaries and models is important, but sometimes quite hard. In this case we just calculate the difference by ourself.

4.  Well you have to do this at least a 1000 times. This requires a for loop. First you have to create an empty vector (=column) where you can store the results. In the for loop, each loop (i), in each of the thousand times (1:1000), bootstrap the data and save the difference between the two groups.
boot.diff <- NULL
for (i in 1:1000) {
boot.d <- yourdata[sample(row.names(yourdata),replace=T),]
boot.diff[i] <- mean(boot.d$y[boot.d$x=="A"]) - mean(boot.d$y[boot.d$x=="B"])
}

5.  You can take a look at all random relationship-slope by just typing boot.diff or fix(boot.diff) but it's easier to watch a histogram.
hist(boot.diff)

6.  The 95 % difference values in boot.diff that are closest to the difference of the actual data set represent the confidence interval of the difference. If these values overlap zero, then there is a risk that there is no REAL difference. If these 95 % difference values DO NOT overlap zero then it is very likely that there is a real difference. The [c(25,975)] extract the value 25 and 975, that is the 2.5 % most extreme values in each tail.
sort(boot.diff)[c(25,975)]

7.  To get a nice graphical illustration of this, make a histogram and then add vertical lines to illustrate the bootstrapped 95 % confidence interval (blue dashed), the difference from your actual real data (blue), and the zero value (red).
hist(boot.diff, xlim=c(0.8*min(c(0,boot.diff)), 1.2*max(c(0,boot.diff))))
abline(v=sort(boot.diff)[c(25,975)],col="blue",lty=2,lwd=2)
abline(v=difference, col="blue", lwd=4)
abline(v=0, col="red", lwd=2)


What are your conclusions? Can you trust the result from your study?

Fishers exact test on a binary transformed response

This is only easily interpretable for 2 groups, but in principle you can test 2x3 or 5x4 tables as well.

Binarize your data

Either use a biological meaningful limit for dividing your data in two categories. For example:

bin.y<-ifelse(y==0,0,1)

Or divide your data above or below the median:

bin.y<-ifelse(y < median(y),0,1)

Create a matrix (2x2 table) with you binarized data. The tapply function counts the number of ones and zeroes in each group of x.

mat<-tapply(bin.y,list(bin.y,x),length)

mat

Make a barplot

barplot(mat,beside=T)

And test with Fisher's exact test.

Permutation

You have already tried this. If you want to practice with this data set, get the manual from the second day.