Supplementary tutorial: functions for G-matrix comparisons.
Comparing G: Multivariate Analysis of Genetic Variation in Multiple Populations
J. David Aguirre, Emma Hine, Katrina McGuigan and Mark W. Blows
Notes on the structure of the tutorial
Below we presentthe R script we used to generate the results presented in Aguirre et al..The output files for this worked example (Supplementary data: SI_output_and_ped)have a smaller number of MCMC samples than those in the original manuscript to limit the computational time required to complete this tutorial. An understanding of the basic R commands is assumed, and experienced users are encouraged toimprove the script at their discretion.For less experienced users, however, the matrix operations will work so long as the lines in Courier Newfont are copied in to the R window. It is important that the entire function is copied into the R window, thus, we have flagged the #START and #END of each function. Furthermore, some example results are presented in text. These sections are not intended to be copied into the R window and are flagged by #< and #>. At the end of the tutorial we provide the script to produce the figures. However, we stress that script for the figures should be considered examples, as there in many different and likely better, ways to draw these figures. If users have any problems implementing the methods described in this tutorial please do not hesitate to contact the corresponding author ().
Importing the model output
The files are in the format typical of MCMCglmm (Hadfield, 2010).An example syntax is provided along with the original data in Dryad;however, these models can take some time to run, so, in the interest of expedience, we move forward using the output files provided. The output files are structured such that each matrix element of each variance component is a column and each row is an MCMC sample of the posterior distribution of that matrix element. Thus, because there were eight traits, and we specified an unstructured covariance structure for each trait at the animal, vial and residual levels, our output file for each population has 192 columns (i.e. 8 x 8 + 8 x 8 + 8 x 8) and 1000 rows.
First, we will define a number of terms that will be used repeatedly.
MCMCsamp <- 1000
#number of MCMC samples
n <- 8
#number of traits
m <- 6
#number of matrices to compare
r <- 3
#number of random effects specified in the model. In our analyses these were animal, vial and residual effects.
traitnames <- c("lc2","lc3","lc4","lc5","lc6","lc7","lc8","lc9")
#trait names
Gnames <- c("b1","b2","c1","c2","m1","m2")
#matrixlabels
Then we will set up an empty array, and fill this array by referencing the correct file path for the output files. Note that the file path will need to be customized to the directory where the files have been stored on your computer.
MCMCarray <- array(,c(MCMCsamp,(n^2)*r,m))
#empty array
MCMCarray[,,1] <- as.matrix(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/VC_b1_3MCMC.csv",header =T))
#G1 stored as the 1st element of dim[3]
MCMCarray[,,2] <- as.matrix(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/VC_b2_3MCMC.csv",header =T))
#G2 stored as the 2nd element of dim[3]
MCMCarray[,,3] <- as.matrix(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/VC_c1_3MCMC.csv",header =T))
#G3 stored as the 3rd element of dim[3]
MCMCarray[,,4] <- as.matrix(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/VC_c2_3MCMC.csv",header =T))
#G4 stored as the 4th element of dim[3]
MCMCarray[,,5] <- as.matrix(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/VC_m1_3MCMC.csv",header =T))
#G5 stored as the 5th element of dim[3]
MCMCarray[,,6] <- as.matrix(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/VC_m2_3MCMC.csv",header =T))
#G6 stored as the 6th element of dim[3]
Reshaping the array and standardizingG
We now need to reshape the MCMCarray to get the G and P arrays in the format we need to run the functions below.
Garray <- array(,c(n,n,m,MCMCsamp))
dimnames(Garray) <- list(traitnames,traitnames,Gnames)
Parray <- array(,c(n,n,m,MCMCsamp))
dimnames(Parray) <- list(traitnames,traitnames,Gnames)
for (i in 1:m){
for (j in 1:MCMCsamp){
G <- matrix(MCMCarray[j,1:(n^2),i],ncol= n)
CE <- matrix(MCMCarray[j,((n^2)+1):((n^2)*2),i],ncol= n)
R <- matrix(MCMCarray[j,(((n^2)*2)+1):((n^2)*3),i],ncol= n)
Garray[,,i,j] <- G
Parray[,,i,j] <- G + CE + R
}
}
The Garray and Parray objects have the MCMC samples ofG and P stored in the first two dimensions of the array. The third dimension identifies the population and the fourth dimension identifies the MCMC sample. For example, if we wanted to view the genetic variance in the lc2 trait (i.e. matrix element 1, 1) of the c1 population for the 5th MCMC sample we would use the following syntax,
Garray[1,1,3,5]
Similarly, if we wanted to view the G matrix of the b2population for the 13th MCMC sample we would use the following syntax.
Garray[,,2,13]
Last, if we wanted to view the G matrices for all populations for the 42nd MCMC sample we would use the following syntax.
Garray[,,,42]
Next, we provide a method to apply the multivariate standardization () presented inHansen and Houle(2008) to the Garray. We will proceed in our example, without applying the standardization as our traits are measured and analyzed on the same scale. Nevertheless, to apply the standardization we first define a function to calculate then we use this function to apply the standardization to each MCMC sample of each replicate line.
inv.rootP <- function (P){
rootP <- matrix(0,n, n)
for (i in 1:n){
val <- eigen(P)$values
vec <- eigen(P)$vectors
rootP <- rootP + (vec[,i] %*% t(vec[,i]))*sqrt(val[i])
}
solve(rootP)
}
HHGarray <- array(,c(n,n,m,MCMCsamp))
for (k in 1:MCMCsamp){
for (j in 1:m){
P <- inv.rootP(Parray[,,j,k])
HHGarray[,,j,k] <- P %*% Garray[,,j,k] %*% P
}
}
The HHGarray object contains the standardizedGarray.
Generating randomisedG matrices for hypothesis tests
The approach we use to generate the randomisedG matrices uses the rbv function in the MCMCglmm package. rbv uses the pedigree structure of a population to generate breeding values for individuals by sampling from a multivariate normal distribution with a mean = 0 and a variance = G. Therefore, the first step in generatingrandomisedG matrices is to import the pedigree files for each population. Again, the file path will need to be customized for your computer.
Ped.b1 <- as.data.frame(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/Ped.b1.csv",header =T))
Ped.b2 <- as.data.frame(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/Ped.b2.csv",header =T))
Ped.c1 <- as.data.frame(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/Ped.c1.csv",header =T))
Ped.c2 <- as.data.frame(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/Ped.c2.csv",header =T))
Ped.m1 <- as.data.frame(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/Ped.m1.csv",header =T))
Ped.m2 <- as.data.frame(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/Ped.m2.csv",header =T))
To generate the randomisedG matrices we will write a loop to calculate breeding values for eachindividual given the populations pedigree and the correspondingMCMC sample of G.These vectors of breeding values are archived and then randomly allocated to populations. Last we construct 6 randomisedG for each MCMC sample.
library(MCMCglmm)
rand.Garray <- array(,c(n,n,m,MCMCsamp))
dimnames(rand.Garray) <- list(traitnames,traitnames,Gnames)
for (i in 1:MCMCsamp){
b1.bv<-rbv(Ped.b1,Garray[,,1,i])
b2.bv<-rbv(Ped.b2,Garray[,,2,i])
c1.bv<-rbv(Ped.c1,Garray[,,3,i])
c2.bv<-rbv(Ped.c2,Garray[,,4,i])
m1.bv<-rbv(Ped.m1,Garray[,,5,i])
m2.bv<-rbv(Ped.m2,Garray[,,6,i])
a.pop <- cumsum(c(dim(Ped.b1)[1],dim(Ped.b2)[1],dim(Ped.c1)[1],dim(Ped.c2)[1],dim(Ped.m1)[1],dim(Ped.m2)[1]))
pop.bv <- rbind(b1.bv,b2.bv,c1.bv,c2.bv,m1.bv,m2.bv)
rand.pop.bv <- pop.bv[sample(dim(pop.bv)[1],replace=F),]
rand.Garray[,,1,i] <- cov(rand.pop.bv[1:a.pop[1],])
rand.Garray[,,2,i] <- cov(rand.pop.bv[(a.pop[1] + 1):a.pop[2],])
rand.Garray[,,3,i] <- cov(rand.pop.bv[(a.pop[2] + 1):a.pop[3],])
rand.Garray[,,4,i] <- cov(rand.pop.bv[(a.pop[3] + 1):a.pop[4],])
rand.Garray[,,5,i] <- cov(rand.pop.bv[(a.pop[4] + 1):a.pop[5],])
rand.Garray[,,6,i] <- cov(rand.pop.bv[(a.pop[5] + 1):a.pop[6],])
}
This approach simulates a set of Gthat have been sampled from the same population, and thus the only dissimilarity among them is random sampling error. Furthermore, because rand.Garrayis of the same order as the Garray(i.e. n x n x m x MCMCsamp), the matrix comparison functions work for both the observed and randomised arrays.
Method 1. Random projections through G
Matrix projection is a simple but effective tool for uncovering similarities or dissimilarities among matrices. Thefunction below uses the projection of random,normal vectors through MCMC samples of G matrices to identify regions of the genetic space where matrices differ significantly in variance.
#START
R.proj <- function(Gs,p,vec){
if (dim(Gs)[[1]] != dim(Gs)[[2]]){
stop("G array must be of order n x n x m x MCMCsamp")
}
if (is.na(dim(Gs)[4])) {
stop("There are no MCMCsamples")
}
n <- dim(Gs)[[1]]
m <- dim(Gs)[[3]]
MCMCsamp <- dim(Gs)[[4]]
rand.vec <-matrix(,vec,n)
for (i in 1:vec){
b <- runif(n,-1,1)
rand.vec[i,] <- b/(sqrt(sum(b^2)))
}
#generate unit length random vectors
proj<- function(G,b) t(b) %*% G %*% (b)
#internal function to do projection
G.proj <- array(,c(MCMCsamp,m,vec))
colnames(G.proj) <- dimnames(Gs)[[3]]
for (i in 1:vec){
G.proj[,,i]<- t(apply(Gs,3:4,proj,b = rand.vec[i,]))
}
#project each random vector through each MCMC sample of each G
prs <- cbind(rep(1:m, each = m), 1:m)
prs.comp <- prs[prs[,1] < prs[,2], , drop = FALSE]
#setting up an index for HPD comparisons
proj.score <-matrix(,vec,((m^2 - m)/2))
for (k in 1:vec){
HPD.int <- HPDinterval(as.mcmc(G.proj[,,k]), prob = p)
proj.score[k,] <- ifelse(HPD.int[prs.comp[,1],1] > HPD.int[prs.comp[,2],2] | HPD.int[prs.comp[,2],1] > HPD.int[prs.comp[,1],2],1,0)
}
#for a given random vector, examine if the HPDintervals of any pair of G matrices overlap
vec.score <-cbind(rand.vec,proj.score)
colnames(vec.score) <- c(1:n,paste(dimnames(Gs)[[3]][prs.comp[, 1]], ".vs.", dimnames(Gs)[[3]][prs.comp[, 2]], sep = ""))
#collate the random vectors and the outcome of their projection on the G matrices
sig.vec <- subset(vec.score,rowSums(vec.score[,(n+1):(n+((m^2 - m)/2))]) > 0)
#collate just the random vectors that resulted in significant differences in variance
if(dim(sig.vec)[1] <= 1) {warning("There were <= 1 significant vectors, try a larger vec or lower p"); eig.R <- "Na"}
else{
eig.R <- eigen(cov(sig.vec[,1:n]))
rownames(eig.R$vectors) <- dimnames(Gs)[[1]]
colnames(eig.R$vectors) <- c(paste("e", 1:n, sep = ""))
}
#eigen analysis of the R matrix
list(G.proj = G.proj, vec.score = vec.score, eig.R = eig.R)
}
#END
The arguments passed toR.proj are:
GsThe G array must be of the order n x n x m x MCMCsamp
pThe probability density interval for assessing significant differences among matrices
vecThe number of random vectors
The line below applies the R.proj function to the observed Garray, assuming a probability of 0.95 and 1000 random vectors,then stores the results in MCMC.R.proj.This calculation will take a few minutes to complete, but will vary depending on the computer specifications, as well as the number of MCMC samples and the number of random vectors.
MCMC.R.proj <- R.proj(Garray, p = 0.95, vec = 1000)
MCMC.R.proj is a list with 3 slots:
$G.projGenetic variance in the direction of each random vector for each MCMC samples of G. The rows of $G.proj identify the MCMC samples, the columns identify the replicate line and third dimension of $G.proj identifies the random vector.
$vec.scoreFor each row of $vec.score, the first n columns show the random vector,and the remaining columns show the results of pairwise HPD comparisons of genetic variance in the direction of therandom vector. Zeros indicate the HPD intervals overlapped (i.e. non-significant differences in variance), whereas ones indicate the HPD intervals did not overlap (i.e. a significant difference in variance)
$eig.RThe eigenanalysis of the R(co)variance matrix
Note, because vectors are generated randomly, the vector ID’s and summary statistics presented below will differ slightly to those on your computer. Thus, it is important that you customise the row indexes in the plot script for the correct rows on your computer. We can visualize what is happening internally in M.proj by plotting the posterior distribution of genetic variances for a vector where there were significant differences among replicate lines and a vector for which there were no significant differences among replicate lines (Figure 1). To find candidate vectors for the example in this appendix we can examine the first 8 rows of columns 9 to 23 (i.e the columns for the pair wise hypothesis tests)ofMCMC.M.proj$vec.score
MCMC.R.proj$vec.score[1:n,(n+1):(n+((m^2 - m)/2))]
#<
b1.vs.b2 b1.vs.c1 b1.vs.c2 b1.vs.m1 b1.vs.m2 b2.vs.c1 b2.vs.c2
[1,] 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 0
[3,] 0 0 0 0 1 0 0
[4,] 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0
[6,] 0 0 0 0 0 0 0
[7,] 0 0 0 0 0 0 0
[8,] 0 0 0 0 0 0 0
b2.vs.m1 b2.vs.m2 c1.vs.c2 c1.vs.m1 c1.vs.m2 c2.vs.m1 c2.vs.m2
[1,] 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 0
[3,] 0 1 0 0 0 0 0
[4,] 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 0 0
[6,] 0 0 0 0 0 0 0
[7,] 0 0 0 0 0 0 0
[8,] 0 0 0 0 0 0 0
m1.vs.m2
[1,] 0
[2,] 0
[3,] 0
[4,] 0
[5,] 0
[6,] 0
[7,] 0
[8,] 0
#>
Recalling that vectors finding significant differences are indexed with 1’s and those not finding significant difference are indexed with 0’s. The table above shows that for direction defined by vector 3 the genetic variance in m2 differed from the genetic variance in the b1 and b2 populations (Figure 1A ) whereas for the remaining vectors, there were no significant differences in genetic variance among populations (Figure 1B).
Figure 1. Posterior means and 95% HPD intervals of the genetic variance in the direction of random vectors#3 (panel B) and #5(panel A)for each population. In this example the vector in panel A would contribute to theRmatrix whereas the vector in panel B would not contribute to the Rmatrix.
To examine the proportion of random vectors that found significant differences in genetic variance among populationswe can calculate the number of vectors that found significant (TRUE) and non-significant (FALSE) pairwise differences in genetic variance.
table(rowSums(MCMC.R.proj$vec.score[,(n+1):(n+((m^2 - m)/2))]) > 0 )
#<
FALSE TRUE
844 156
#>
The 156 vectors that identified differences (at 95% HPD) in genetic variation among matrices summarize the directions in multivariate space where matrices differ significantly in variance. The next step is then to examine the eigenstructure of the R matrix. To visualize the eigenstructure of theRmatrix we can use,
lapply(MCMC.R.proj$eig.R,round,digits = 3)
#<
$values
[1] 0.300 0.156 0.137 0.125 0.118 0.080 0.058 0.028
$vectors
e1 e2 e3 e4 e5 e6 e7 e8
lc2 0.356 -0.616 0.193 -0.004 0.138 0.602 0.275 0.003
lc3 -0.885 -0.012 -0.006 -0.015 0.004 0.429 0.179 -0.024
lc4 -0.267 -0.645 -0.014 0.194 0.377 -0.533 -0.105 -0.197
lc5 -0.008 -0.238 -0.466 -0.698 -0.064 0.131 -0.456 -0.098
lc6 0.103 0.126 -0.636 0.572 0.293 0.310 -0.221 -0.105
lc7 -0.036 -0.270 0.054 0.360 -0.808 0.047 -0.287 -0.236
lc8 0.087 0.158 -0.089 -0.136 -0.007 -0.037 0.396 -0.885
lc9 -0.011 0.188 0.575 0.017 0.310 0.229 -0.620 -0.318
#>
Then to examine the of eigenvectors R that result in significant differences in genetic variance among populations we can use.
proj<- function(G,b) t(b) %*% G %*% (b)
#Function to do projection
R.vec.proj <- array(,c(MCMCsamp,m,n))
for (i in 1:n){
R.vec.proj[,,i] <- t(apply(Garray, 3:4, proj, b = MCMC.R.proj$eig.R$vectors[,i]))
}
#Genetic variance in each population in the direction of the eigenvectors of R
HPD.R.vec.proj <- array(,c(m,2,n))
for (i in 1:n){
HPD.R.vec.proj[,,i] <- HPDinterval(as.mcmc(R.vec.proj[,,i]), prob = 0.95)
}
#HPD intervals for the genetic variance in each population in the direction of the eigenvectors of R
Figure 2Genetic variance in the direction of each of the eigenvectors of Rfor each population.
In figure 2 we see that only the first eigenvector of Rresults in significant differences in genetic variance among populations.
Method 2. Krzanowski’s common subspaces
Krzanowski(1979)provides a method for identifying thecommonsubspaces for a representative subset of vectors of two or more matrices.In our first example we will use first four (i.e. n/2) eigenvectors of each G, but we highlight that this is a decision the reader needs to consider carefully. Because, as we will show, the number of eigenvectors included for each matrix can influence our final conclusions.
#START
kr.subspace <- function(Gs,vec){
if (dim(Gs)[[1]] != dim(Gs)[[2]]){
stop("G array must be of order n x n x m x MCMCsamp")
}
if (is.na(dim(Gs)[4])) {
stop("There are no MCMCsamples")
}
n <- dim(Gs)[[1]]
m <- dim(Gs)[[3]]
MCMCsamp <- dim(Gs)[[4]]
if(length(vec) != m){stop("vec must have length = m")}
h <- function (g,v){
AA <- array(,c(n, n, m))
for (k in 1:m){
g.vec <- eigen(g[,,k])$vectors[,1:(v[k])]
AA[,,k] <- g.vec %*% t(g.vec)
}
H <- apply(AA,1:2,sum)
list(H = H, AA = AA)
}
#internal function to calculate AA and H
MCMC.H <- array(,c(n,n,MCMCsamp))
dimnames(MCMC.H) <- list(dimnames(Gs)[[1]],dimnames(Gs)[[1]],dimnames(Gs)[[4]])
MCMC.AA <- array(,c(n,n,m,MCMCsamp))
dimnames(MCMC.AA) <- list(dimnames(Gs)[[1]],dimnames(Gs)[[1]],dimnames(Gs)[[3]],dimnames(Gs)[[4]])
for (i in 1:MCMCsamp){
kr <- h(Gs[,,,i], v = vec)
MCMC.H[,,i] <- kr$H
MCMC.AA[,,,i] <- kr$AA
}
#calculate AA and H for the ith MCMC sample of the G array
avH <- apply(MCMC.H,1:2,mean)
rownames(avH) <- dimnames(Gs)[[1]]
colnames(avH) <- dimnames(Gs)[[1]]
#calculate the posterior mean H
avAA <- apply(MCMC.AA,1:3,mean)
dimnames(avAA) <- list(dimnames(Gs)[[1]],dimnames(Gs)[[1]],dimnames(Gs)[[3]])
#calculate the posterior mean AA
avH.vec <- eigen(avH)$vectors
#eigenanalysis of posterior mean H
proj<- function(a,b) t(b) %*% a %*% b
#internal function to do projection
avH.theta <- matrix(,n,m)
for (i in 1:n){
for (i in 1:n){
avH.theta[i,] <- acos(sqrt(apply(avAA, 3, proj, b = avH.vec[,i]))) * (180/pi)
}
}
#angles between the eigenvectors posterior mean H and the posterior mean subspaces of each population
MCMC.H.val <- matrix(,MCMCsamp,n)
colnames(MCMC.H.val) <- paste("h",1:n,sep="")
for (i in 1:n){
MCMC.H.val[,i] <- apply(MCMC.H, 3, proj, b = avH.vec[,i])
}
#posterior distribution of the genetic variance for the eigenvectors of posterior mean H
MCMC.H.theta <- array(,c(n,m,MCMCsamp))
rownames(MCMC.H.theta) <- paste("h",1:n,sep="")
colnames(MCMC.H.theta) <- dimnames(Gs)[[3]]
for(i in 1:n){
for(j in 1:MCMCsamp){
MCMC.H.theta[i,,j] <- acos(sqrt(apply(MCMC.AA[,,,j], 3, proj, b = avH.vec[,i]))) * (180/pi)
}
}
#posterior distribution of the angles between the eigenvectors of posterior mean H and the MCMC samples of the subspaces of each population
list(avAA = avAA, avH = avH, MCMC.AA = MCMC.AA, MCMC.H = MCMC.H, MCMC.H.val = MCMC.H.val, MCMC.H.theta = MCMC.H.theta)
}
#END
The arguments passed to kr.subspace are:
GsThe G array.Must be of the order n x n x m x MCMCsamp
vecA vector indicating the number of eigenvectors of each G to include in the construction of H.Must have the same number of elements as there are matrices to compare
The line below applies the kr.subspace function to the first four eigenvectors of the observed Garray, then stores the results in MCMC.kr.
MCMCG.kr <- kr.subspace(Garray, vec = rep(n/2,m))
MCMC.kris a list with 3 slots:
$avAAPosterior mean AAT
$avHPosterior mean H
$MCMC.AAMCMC samples of AAT
$MCMC.HMCMC samples of H
$MCMC.H.valMCMC samples of the eigenvalues of H
$MCMC.H.thetaMCMC samples of the angles between the eigenvectors of posterior mean Hand the MCMC samples of AAT
Because the eigenvalues of H are bounded between zero and m, to provide a hypothesis test that identifies significant common subspaces we need to generate a sensible null model. The null model we use assumes the G are sampled from the same population, and hence the randomised G will have all subspaces in common up to sampling error. Thus, to compare our observed and null model we will apply the kr.subspace function to thefirst four eigenvectors of therandomised Garray, and plot the eigenvalues of H for the observed and randomised data.
MCMCG.kr.rand <- kr.subspace(rand.Garray, vec = rep(n/2,m))
Figure3.Posterior mean and 95% HPD intervals for the eigenvalues of H for the first four eigenvectors of the observed and randomisedG arrays.