EMBO 2005. Short Tutorial in MADE4 – Ordination of Microarray Data
The files you will need for this tutorial are in
This document is on that web site too. It’s a good idea to open a copy of this so you can copy (and paste) commands between this document and R.
For the first part of this tutorial we will use a subset of the primatefibroblast gene expression from Karaman et al., Genome Research 2004. This study examines 3 groups, human, bonobo and gorilla expression profiles on Affymetrix HG_U95Av2 chips. This dataset contains 46 chips and is available in the Bioconducor library fibroEset (MAS5.0 data), and the web site (raw cel files).
But for this tutorial we will look at 11 chips which have been normalised using vsn. For information I have included details of how I normalised these, at the end of the tutorial.
Task 1. Initial data Exploration
Download the vsn normalised 11 gene expression profiles from the web site. Read this into R. The data are stored as comma separated files, which is readable by Excel. To load in R:
data.vsn<- read.csv(“data.vsn.csv”)
library(affy)
library(made4)
overview(data.vsn)
The function ord simplifies the running of ordination methods such as principal component, correspondence or non-symmetric correspondence analysis. It provides a wrapper which can call each of these methods in ade4.
data.coa <-ord(data.vsn, type=”coa”)
Have a look at data.coa. The ordination results are in $ord. The row, column coordinates are in $li and $co. The eigenvalues are in $eig.
plot(data.coa)
heatplot(data.coa$ord$li, dend=FALSE, lowcol="blue")
We have run a Correspondence Analysis, Compare these results to a PCA.
data.pca <-ord(data.vsn, type=”pca”)
Compare the difference plots from PCA and COA.
plotarrays(data.pca)
plotarrays(data.pca, graph=”s.var”)
plotgenes(data.pca)
At this stage.. it would be useful to colour the plots to help our interpretation
Task 2: Interpretation – labelling with covariates
Although the overview shows that we have 2 major groups within the data, it is very difficult to know what we are looking at. Lets read a text file (tab delimited) with some sample information into R.
annt<-read.table("annt.txt", header=TRUE)
annt
read.table reads in a table as a data.frame. This is just a data matrix with labels. To view the data in the Age column, use the $ symbol before the column label.
annt$Age
This file contains short names for the samples, information about the Donor (Gorilla, Bonobo, Human), Age (years), gender (male/female), doubling time of the cell lines, and information about whether cells where established from the same cell lines. The column heading are;
Cels / short.names / Donor / Age / Gender / DT / estb.sameAG_04659_AS.cel / AG_04659 / Hsa / 65 / M / 1.7 / F
AG_05283_AS.CEL / AG_05283 / Hsa / 69 / M / 1.7 / F
AG_05414_AS.cel / AG_05414 / Hsa / 73 / M / 2.3 / D
AG_11745_AS.cel / AG_11745 / Hsa / 43 / F / 1.8 / D
AG_13927_AS.cel / AG_13927 / Hsa / 45 / F / 2.8 / -
KB_5047_2070_2_AS.CEL / KB_5047 / Ggo / 19 / F / 2 / -
KB_5275_2_AS.CEL / KB_5275 / Ppa / 2 / M / 2.4 / -
KB_5828_AS.cel / KB_5828 / Ppa / 12 / M / 2.7 / -
KB_6268_2_AS.cel / KB_6268 / Ggo / 19 / F / 2 / -
KB_8025_AS.cel / KB_8025 / Ppa / 19 / M / 2 / -
KB_8840_AS.cel / KB_8840 / Ggo / 2 / F / 2.5 / -
Lets review the overview graph, but this time label it with information about the Donor.
overview(data.vsn, label=annt$Donor)
This looks better, human are distinct from other primates. But BEFORE we go ahead and search for genes distinguishing these… .CHECK the other co variants (sample info).
Do the samples also group by Age, Gender, DT, estb.same ?
What do you think of the experimental design?
How could it be improved?
Also plot the arrays projections from the COA.
plotarrays(data.coa, classvec=annt$Donor)
plotarrays(data.coa, classvec=annt$Gender)
Have a look at the different plots,
Task 3: Getting out the Gene Information
Do heatmap of top genes from axis 1.
If you look at data.ord, you will see the gene co-ordinates are in $co. To get the top genes from axis 1 in the ordination.
ax1<- topgenes(data.coa$ord$co, ends="both", axis=1)
ax1
You will see that R sometimes has the habit of putting X in front of rownames if they start with a number. Hence the names in ax1 don’t agree with data. Its easier to sort this an remove the “X” in the names,
ax1<-sub("X", "", ax1)
To make a heatmap and dendrogram of these:
heatplot(data.vsn[ax1,])
savePlot("heatplot_COA")
Task 4: Annotating the plots with gene information
GO Annotation: Get the annotation for data (this can be done using either the annaffy or biomaRt packages which will both be demonstrated during the course. For this practical we will use annaffy)
To get the Gene Symbol for all genes, and use these in plots
library(annaffy)
If annaffy is not installed, download annaffy, GO and KEGG from Bioconductor or use nstall.packages2() if reposTools is installed.
affy.id <-rownames(data)
affy.symbols<-aafSymbol(affy.id, “hgu95av2”)
affy.symbols <-getText(affy.symbols)
plotgenes(data.coa, varlabels= affy.symbols)
To get lots Gene Information for a subset of genes
anncols<-aaf.handler()
anncols
anntable <- aafTableAnn(ax1, "hgu95av2", anncols)
saveHTML(anntable, "example1.html", title = "Example")
A copy of this output is available on the web page
Have a look. Are there any genes of interest to you?
Advanced Tasks
If you have time, there are extra tasks. If not the code may be useful to you in your own data analysis.
Advanced Task 1: 3D Plots
Using scatterplot3D library
do3d(data.coa$ord$li, classvec=annt$Donor, cex.symbols=3)
rotate3d(data.coa$ord$li, classvec=annt$Donor)
html3D(data.coa$ord$li, classvec=annt$Donor)
html3D produces a plot which can be rotated using chime or jmol. For an example see the course website.
Advanced Task 2: Use SAM to select genes significant with a 1% FDA and examine these using ordination and clustering
library(siggenes)
# SAM of human versus non-humanoid primate
sam.c1<-as.character(annt$Donor)
sam.c1
sam.c1[!sam.c1==”Hsa”]<-“other primate”
sam.c1
# Perform a SAM analysis for the two class unpaired case assuming unequal variances. Perform 100 permutations (B).
sam.out<-sam(data.vsn, sam.c1, B=100)
sam.out
plot(sam.out)
savePlot("Sam.plot1")
# Select a delta to give a good a 1 %FDR and reasonable number of genes. For example to obtain the SAM plot for Delta = 2
plot(sam.out,2)
# Get information about the significant genes at delta =4, use:
plot(sam.out, 4)
sam.res <-summary(sam.out, 4)
names(sam.res)
length(sam.res$row.sig.genes)
sam.genes<-data.vsn[sam.res$row.sig.genes,]
heatplot(t(sam.genes))
plotarrays(ord(sam.genes), classvec=sam.c1)
Advanced Task 3: Supervised Ordination – Between Group Analysis
For this example, we will look at the Khan et al., (2001) datasets which measured the genes expression of four types of small round blue cell tumours dataset, these are EWS, RMS, BL and NB. This example was described by Culhane et al., 2002 in their paper in which they apply BGA to the class prediction of these tumours.
data(khan)
names(khan)
?khan
khan.bga<-bga(khan$train, classvec=khan$train.classes)
khan.bga
plot(khan.bga, genelabels=khan$annotation$Symbol)
Provide a view of the principal components (axes) of the bga heatplot(khan.bga$bet$ls, dend=FALSE,lowcol="blue")
In supervised an analysis such as this it is essential to test the discrimination using cross-validation. To perform jack-knife one leave out cross –validation use
bga.jackknife(khan$train, khan$train.classes)
To project new test data use
khan.supp<- suppl(khan.bga, supdata=khan$test, supvec=khan$test.classes)
It will output something like
projected.Axis1 projected.Axis2 closest.centre predicted true.class
TEST.9 0.1400 -0.038 RMS RMS Normal
TEST.11 -0.1400 -0.170 EWS EWS Normal
TEST.5 -0.0600 -0.120 EWS EWS Normal
TEST.8 0.0420 -0.094 NB NB NB
TEST.10 0.4600 0.160 RMS RMS RMS
Which gives the class predicted by BGA (predicted), and the closest group centroid in Elucidean space (closest centre). If you supplied a supvec, which gives the true classes of the test data, this will be attached to the output (last column true.class).
To perform training and test in one use:
khan.bga<-bga.suppl(khan$train, supdata=khan$test, classvec=khan$train.classes)
Advanced Task 3: Comparing datasets (meta-analysis) using Coinertia Analysis
Coinertia analysis has been applied to the cross-platform comparison(meta-analysis) of microarray gene expression datasets (Culhane et al., 2003). CIA is a multivariate method that identifies trends or co-relationships in multiple datasets which contain the samesamples. That is either the rows or the columns of a matrix must be "matchable". CIAcan be applied to datasets where the number of variables (genes) far exceeds the numberof samples (arrays) such is the case with microarray analyses. cia calls coinertia in the ade4 package. See the CIA vignette for more details on this method.
To run this. Lets look at two gene expression datasets of the same 60 cell lines. The NCI60 cells lines are a set of 60 cell lines with different tumour phenotypes (eg Breast, Colon, Leukemia, Prostate, CNS, lung cancer, ovarian, renal cancer etc). The gene expression of these cell lines have been examined by a number of groups. We will compare one from Affymetrix (Staunton et al., 2001) and one that was obtained using Stanford spotted cDNA arrays (Ross et al., 2000) using CIA.
data(NCI60)
summary(NCI60)
names(NCI60)
NCI60$classes
coin <- cia(NCI60$Ross, NCI60$Affy)
names(coin)
coin$coinertia$RV
The RV coefficient $RV which is 0.78 in this instance, is a measure of “global” similarity between the datasets. The closer to 1, in the scale 0-1the greater the correlation between the two datasets.
To visually examine the cell lines that have similar or different gene expression profiles in these datasets.
plotarrays(coin, classvec=NCI60$classes[,2])
plotarrays(coin, classvec=NCI60$classes[,2], lab=NULL, cpoint=3)
plot(coin, classvec=NCI60$classes[,2])
For each of the 60 cell lines, there are two coordinates ($coinertia$mX and $coinertia$mY). On the plot, these are visually shown as a closed circle and an arrow. These are joined by a line. If the profiles are similar they will be projected close together in the new space (ie joined by a short line). For more information see Culhane et al., BMC bioinformatics 2003.
For information only, please don’t repeat in this today, this is ONLY background information. Vsn normalisation of data.
To produce this. I downloaded the cel files, Started R, Used File -> Change directory to select the directory with the cel files in it. Then
getwd()
dir()
library(affy)
library(vsn)
cels <- list.celfiles()
data <- ReadAffy(filenames= cels)
normalize.AffyBatch.methods <- c(normalize.AffyBatch.methods, "vsn")
data.vsn <- es1 = expresso(data, bg.correct = FALSE, normalize.method = "vsn", pmcorrect.method = "pmonly", summary.method = "medianpolish")
exprs2excel(data.vsn, file="data.vsn.csv")
A tab delimited text file could also be saved using
write.exprs(data.vsn, file="data.rma.txt")