Supplementary Tutorial: Functions for G-Matrix Comparisons

Supplementary tutorial: functions for G-matrix comparisons.

Comparing G: Multivariate Analysis of Genetic Variation in Multiple Populations

J. David Aguirre, Emma Hine, Katrina McGuigan and Mark W. Blows

Notes on the structure of the tutorial

Below we presentthe R script we used to generate the results presented in Aguirre et al..The output files for this worked example (Supplementary data: SI_output_and_ped)have a smaller number of MCMC samples than those in the original manuscript to limit the computational time required to complete this tutorial. An understanding of the basic R commands is assumed, and experienced users are encouraged toimprove the script at their discretion.For less experienced users, however, the matrix operations will work so long as the lines in Courier Newfont are copied in to the R window. It is important that the entire function is copied into the R window, thus, we have flagged the #START and #END of each function. Furthermore, some example results are presented in text. These sections are not intended to be copied into the R window and are flagged by #< and #>. At the end of the tutorial we provide the script to produce the figures. However, we stress that script for the figures should be considered examples, as there in many different and likely better, ways to draw these figures. If users have any problems implementing the methods described in this tutorial please do not hesitate to contact the corresponding author ().

Importing the model output

The files are in the format typical of MCMCglmm (Hadfield, 2010).An example syntax is provided along with the original data in Dryad;however, these models can take some time to run, so, in the interest of expedience, we move forward using the output files provided. The output files are structured such that each matrix element of each variance component is a column and each row is an MCMC sample of the posterior distribution of that matrix element. Thus, because there were eight traits, and we specified an unstructured covariance structure for each trait at the animal, vial and residual levels, our output file for each population has 192 columns (i.e. 8 x 8 + 8 x 8 + 8 x 8) and 1000 rows.

First, we will define a number of terms that will be used repeatedly.

MCMCsamp <- 1000

#number of MCMC samples

n <- 8

#number of traits

m <- 6

#number of matrices to compare

r <- 3

#number of random effects specified in the model. In our analyses these were animal, vial and residual effects.

traitnames <- c("lc2","lc3","lc4","lc5","lc6","lc7","lc8","lc9")

#trait names

Gnames <- c("b1","b2","c1","c2","m1","m2")

#matrixlabels

Then we will set up an empty array, and fill this array by referencing the correct file path for the output files. Note that the file path will need to be customized to the directory where the files have been stored on your computer.

MCMCarray <- array(,c(MCMCsamp,(n^2)*r,m))

#empty array

MCMCarray[,,1] <- as.matrix(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/VC_b1_3MCMC.csv",header =T))

#G1 stored as the 1st element of dim[3]

MCMCarray[,,2] <- as.matrix(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/VC_b2_3MCMC.csv",header =T))

#G2 stored as the 2nd element of dim[3]

MCMCarray[,,3] <- as.matrix(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/VC_c1_3MCMC.csv",header =T))

#G3 stored as the 3rd element of dim[3]

MCMCarray[,,4] <- as.matrix(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/VC_c2_3MCMC.csv",header =T))

#G4 stored as the 4th element of dim[3]

MCMCarray[,,5] <- as.matrix(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/VC_m1_3MCMC.csv",header =T))

#G5 stored as the 5th element of dim[3]

MCMCarray[,,6] <- as.matrix(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/VC_m2_3MCMC.csv",header =T))

#G6 stored as the 6th element of dim[3]

Reshaping the array and standardizingG

We now need to reshape the MCMCarray to get the G and P arrays in the format we need to run the functions below.

Garray <- array(,c(n,n,m,MCMCsamp))

dimnames(Garray) <- list(traitnames,traitnames,Gnames)

Parray <- array(,c(n,n,m,MCMCsamp))

dimnames(Parray) <- list(traitnames,traitnames,Gnames)

for (i in 1:m){

for (j in 1:MCMCsamp){

G <- matrix(MCMCarray[j,1:(n^2),i],ncol= n)

CE <- matrix(MCMCarray[j,((n^2)+1):((n^2)*2),i],ncol= n)

R <- matrix(MCMCarray[j,(((n^2)*2)+1):((n^2)*3),i],ncol= n)

Garray[,,i,j] <- G

Parray[,,i,j] <- G + CE + R

}

The Garray and Parray objects have the MCMC samples ofG and P stored in the first two dimensions of the array. The third dimension identifies the population and the fourth dimension identifies the MCMC sample. For example, if we wanted to view the genetic variance in the lc2 trait (i.e. matrix element 1, 1) of the c1 population for the 5th MCMC sample we would use the following syntax,

Garray[1,1,3,5]

Similarly, if we wanted to view the G matrix of the b2population for the 13th MCMC sample we would use the following syntax.

Garray[,,2,13]

Last, if we wanted to view the G matrices for all populations for the 42nd MCMC sample we would use the following syntax.

Garray[,,,42]

Next, we provide a method to apply the multivariate standardization () presented inHansen and Houle(2008) to the Garray. We will proceed in our example, without applying the standardization as our traits are measured and analyzed on the same scale. Nevertheless, to apply the standardization we first define a function to calculate then we use this function to apply the standardization to each MCMC sample of each replicate line.

inv.rootP <- function (P){

rootP <- matrix(0,n, n)

for (i in 1:n){

val <- eigen(P)$values

vec <- eigen(P)$vectors

rootP <- rootP + (vec[,i] %*% t(vec[,i]))*sqrt(val[i])

}

solve(rootP)

}

HHGarray <- array(,c(n,n,m,MCMCsamp))

for (k in 1:MCMCsamp){

for (j in 1:m){

P <- inv.rootP(Parray[,,j,k])

HHGarray[,,j,k] <- P %*% Garray[,,j,k] %*% P

}

The HHGarray object contains the standardizedGarray.

Generating randomisedG matrices for hypothesis tests

The approach we use to generate the randomisedG matrices uses the rbv function in the MCMCglmm package. rbv uses the pedigree structure of a population to generate breeding values for individuals by sampling from a multivariate normal distribution with a mean = 0 and a variance = G. Therefore, the first step in generatingrandomisedG matrices is to import the pedigree files for each population. Again, the file path will need to be customized for your computer.

Ped.b1 <- as.data.frame(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/Ped.b1.csv",header =T))

Ped.b2 <- as.data.frame(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/Ped.b2.csv",header =T))

Ped.c1 <- as.data.frame(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/Ped.c1.csv",header =T))

Ped.c2 <- as.data.frame(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/Ped.c2.csv",header =T))

Ped.m1 <- as.data.frame(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/Ped.m1.csv",header =T))

Ped.m2 <- as.data.frame(read.csv(file="C:/Users/uqjaguir/Desktop/Heredity/SI_output_and_ped/Ped.m2.csv",header =T))

To generate the randomisedG matrices we will write a loop to calculate breeding values for eachindividual given the populations pedigree and the correspondingMCMC sample of G.These vectors of breeding values are archived and then randomly allocated to populations. Last we construct 6 randomisedG for each MCMC sample.

library(MCMCglmm)

rand.Garray <- array(,c(n,n,m,MCMCsamp))

dimnames(rand.Garray) <- list(traitnames,traitnames,Gnames)

for (i in 1:MCMCsamp){

b1.bv<-rbv(Ped.b1,Garray[,,1,i])

b2.bv<-rbv(Ped.b2,Garray[,,2,i])

c1.bv<-rbv(Ped.c1,Garray[,,3,i])

c2.bv<-rbv(Ped.c2,Garray[,,4,i])

m1.bv<-rbv(Ped.m1,Garray[,,5,i])

m2.bv<-rbv(Ped.m2,Garray[,,6,i])

a.pop <- cumsum(c(dim(Ped.b1)[1],dim(Ped.b2)[1],dim(Ped.c1)[1],dim(Ped.c2)[1],dim(Ped.m1)[1],dim(Ped.m2)[1]))

pop.bv <- rbind(b1.bv,b2.bv,c1.bv,c2.bv,m1.bv,m2.bv)

rand.pop.bv <- pop.bv[sample(dim(pop.bv)[1],replace=F),]

rand.Garray[,,1,i] <- cov(rand.pop.bv[1:a.pop[1],])

rand.Garray[,,2,i] <- cov(rand.pop.bv[(a.pop[1] + 1):a.pop[2],])

rand.Garray[,,3,i] <- cov(rand.pop.bv[(a.pop[2] + 1):a.pop[3],])

rand.Garray[,,4,i] <- cov(rand.pop.bv[(a.pop[3] + 1):a.pop[4],])

rand.Garray[,,5,i] <- cov(rand.pop.bv[(a.pop[4] + 1):a.pop[5],])

rand.Garray[,,6,i] <- cov(rand.pop.bv[(a.pop[5] + 1):a.pop[6],])

}

This approach simulates a set of Gthat have been sampled from the same population, and thus the only dissimilarity among them is random sampling error. Furthermore, because rand.Garrayis of the same order as the Garray(i.e. n x n x m x MCMCsamp), the matrix comparison functions work for both the observed and randomised arrays.

Method 1. Random projections through G

Matrix projection is a simple but effective tool for uncovering similarities or dissimilarities among matrices. Thefunction below uses the projection of random,normal vectors through MCMC samples of G matrices to identify regions of the genetic space where matrices differ significantly in variance.

#START

R.proj <- function(Gs,p,vec){

if (dim(Gs)[[1]] != dim(Gs)[[2]]){

stop("G array must be of order n x n x m x MCMCsamp")

}

if (is.na(dim(Gs)[4])) {

stop("There are no MCMCsamples")

}

n <- dim(Gs)[[1]]

m <- dim(Gs)[[3]]

MCMCsamp <- dim(Gs)[[4]]

rand.vec <-matrix(,vec,n)

for (i in 1:vec){

b <- runif(n,-1,1)

rand.vec[i,] <- b/(sqrt(sum(b^2)))

}

#generate unit length random vectors

proj<- function(G,b) t(b) %*% G %*% (b)

#internal function to do projection

G.proj <- array(,c(MCMCsamp,m,vec))

colnames(G.proj) <- dimnames(Gs)[[3]]

for (i in 1:vec){

G.proj[,,i]<- t(apply(Gs,3:4,proj,b = rand.vec[i,]))

}

#project each random vector through each MCMC sample of each G

prs <- cbind(rep(1:m, each = m), 1:m)

prs.comp <- prs[prs[,1] < prs[,2], , drop = FALSE]

#setting up an index for HPD comparisons

proj.score <-matrix(,vec,((m^2 - m)/2))

for (k in 1:vec){

HPD.int <- HPDinterval(as.mcmc(G.proj[,,k]), prob = p)

proj.score[k,] <- ifelse(HPD.int[prs.comp[,1],1] > HPD.int[prs.comp[,2],2] | HPD.int[prs.comp[,2],1] > HPD.int[prs.comp[,1],2],1,0)

}

#for a given random vector, examine if the HPDintervals of any pair of G matrices overlap

vec.score <-cbind(rand.vec,proj.score)

colnames(vec.score) <- c(1:n,paste(dimnames(Gs)[[3]][prs.comp[, 1]], ".vs.", dimnames(Gs)[[3]][prs.comp[, 2]], sep = ""))

#collate the random vectors and the outcome of their projection on the G matrices

sig.vec <- subset(vec.score,rowSums(vec.score[,(n+1):(n+((m^2 - m)/2))]) > 0)

#collate just the random vectors that resulted in significant differences in variance

if(dim(sig.vec)[1] <= 1) {warning("There were <= 1 significant vectors, try a larger vec or lower p"); eig.R <- "Na"}

else{

eig.R <- eigen(cov(sig.vec[,1:n]))

rownames(eig.R$vectors) <- dimnames(Gs)[[1]]

colnames(eig.R$vectors) <- c(paste("e", 1:n, sep = ""))

}

#eigen analysis of the R matrix

list(G.proj = G.proj, vec.score = vec.score, eig.R = eig.R)

}

#END

The arguments passed toR.proj are:

GsThe G array must be of the order n x n x m x MCMCsamp

pThe probability density interval for assessing significant differences among matrices

vecThe number of random vectors

The line below applies the R.proj function to the observed Garray, assuming a probability of 0.95 and 1000 random vectors,then stores the results in MCMC.R.proj.This calculation will take a few minutes to complete, but will vary depending on the computer specifications, as well as the number of MCMC samples and the number of random vectors.

MCMC.R.proj <- R.proj(Garray, p = 0.95, vec = 1000)

MCMC.R.proj is a list with 3 slots:

$G.projGenetic variance in the direction of each random vector for each MCMC samples of G. The rows of $G.proj identify the MCMC samples, the columns identify the replicate line and third dimension of $G.proj identifies the random vector.

$vec.scoreFor each row of $vec.score, the first n columns show the random vector,and the remaining columns show the results of pairwise HPD comparisons of genetic variance in the direction of therandom vector. Zeros indicate the HPD intervals overlapped (i.e. non-significant differences in variance), whereas ones indicate the HPD intervals did not overlap (i.e. a significant difference in variance)

$eig.RThe eigenanalysis of the R(co)variance matrix

Note, because vectors are generated randomly, the vector ID’s and summary statistics presented below will differ slightly to those on your computer. Thus, it is important that you customise the row indexes in the plot script for the correct rows on your computer. We can visualize what is happening internally in M.proj by plotting the posterior distribution of genetic variances for a vector where there were significant differences among replicate lines and a vector for which there were no significant differences among replicate lines (Figure 1). To find candidate vectors for the example in this appendix we can examine the first 8 rows of columns 9 to 23 (i.e the columns for the pair wise hypothesis tests)ofMCMC.M.proj$vec.score

MCMC.R.proj$vec.score[1:n,(n+1):(n+((m^2 - m)/2))]

b1.vs.b2 b1.vs.c1 b1.vs.c2 b1.vs.m1 b1.vs.m2 b2.vs.c1 b2.vs.c2

[1,] 0 0 0 0 0 0 0

[2,] 0 0 0 0 0 0 0

[3,] 0 0 0 0 1 0 0

[4,] 0 0 0 0 0 0 0

[5,] 0 0 0 0 0 0 0

[6,] 0 0 0 0 0 0 0

[7,] 0 0 0 0 0 0 0

[8,] 0 0 0 0 0 0 0

b2.vs.m1 b2.vs.m2 c1.vs.c2 c1.vs.m1 c1.vs.m2 c2.vs.m1 c2.vs.m2

[1,] 0 0 0 0 0 0 0

[2,] 0 0 0 0 0 0 0

[3,] 0 1 0 0 0 0 0

[4,] 0 0 0 0 0 0 0

[5,] 0 0 0 0 0 0 0

[6,] 0 0 0 0 0 0 0

[7,] 0 0 0 0 0 0 0

[8,] 0 0 0 0 0 0 0

m1.vs.m2

[1,] 0

[2,] 0

[3,] 0

[4,] 0

[5,] 0

[6,] 0

[7,] 0

[8,] 0

Recalling that vectors finding significant differences are indexed with 1’s and those not finding significant difference are indexed with 0’s. The table above shows that for direction defined by vector 3 the genetic variance in m2 differed from the genetic variance in the b1 and b2 populations (Figure 1A ) whereas for the remaining vectors, there were no significant differences in genetic variance among populations (Figure 1B).

Figure 1. Posterior means and 95% HPD intervals of the genetic variance in the direction of random vectors#3 (panel B) and #5(panel A)for each population. In this example the vector in panel A would contribute to theRmatrix whereas the vector in panel B would not contribute to the Rmatrix.

To examine the proportion of random vectors that found significant differences in genetic variance among populationswe can calculate the number of vectors that found significant (TRUE) and non-significant (FALSE) pairwise differences in genetic variance.

table(rowSums(MCMC.R.proj$vec.score[,(n+1):(n+((m^2 - m)/2))]) > 0 )

FALSE TRUE

844 156

The 156 vectors that identified differences (at 95% HPD) in genetic variation among matrices summarize the directions in multivariate space where matrices differ significantly in variance. The next step is then to examine the eigenstructure of the R matrix. To visualize the eigenstructure of theRmatrix we can use,

lapply(MCMC.R.proj$eig.R,round,digits = 3)

$values

[1] 0.300 0.156 0.137 0.125 0.118 0.080 0.058 0.028

$vectors

e1 e2 e3 e4 e5 e6 e7 e8

lc2 0.356 -0.616 0.193 -0.004 0.138 0.602 0.275 0.003

lc3 -0.885 -0.012 -0.006 -0.015 0.004 0.429 0.179 -0.024

lc4 -0.267 -0.645 -0.014 0.194 0.377 -0.533 -0.105 -0.197

lc5 -0.008 -0.238 -0.466 -0.698 -0.064 0.131 -0.456 -0.098

lc6 0.103 0.126 -0.636 0.572 0.293 0.310 -0.221 -0.105

lc7 -0.036 -0.270 0.054 0.360 -0.808 0.047 -0.287 -0.236

lc8 0.087 0.158 -0.089 -0.136 -0.007 -0.037 0.396 -0.885

lc9 -0.011 0.188 0.575 0.017 0.310 0.229 -0.620 -0.318

Then to examine the of eigenvectors R that result in significant differences in genetic variance among populations we can use.

proj<- function(G,b) t(b) %*% G %*% (b)

#Function to do projection

R.vec.proj <- array(,c(MCMCsamp,m,n))

for (i in 1:n){

R.vec.proj[,,i] <- t(apply(Garray, 3:4, proj, b = MCMC.R.proj$eig.R$vectors[,i]))

}

#Genetic variance in each population in the direction of the eigenvectors of R

HPD.R.vec.proj <- array(,c(m,2,n))

for (i in 1:n){

HPD.R.vec.proj[,,i] <- HPDinterval(as.mcmc(R.vec.proj[,,i]), prob = 0.95)

}

#HPD intervals for the genetic variance in each population in the direction of the eigenvectors of R

Figure 2Genetic variance in the direction of each of the eigenvectors of Rfor each population.

In figure 2 we see that only the first eigenvector of Rresults in significant differences in genetic variance among populations.

Method 2. Krzanowski’s common subspaces

Krzanowski(1979)provides a method for identifying thecommonsubspaces for a representative subset of vectors of two or more matrices.In our first example we will use first four (i.e. n/2) eigenvectors of each G, but we highlight that this is a decision the reader needs to consider carefully. Because, as we will show, the number of eigenvectors included for each matrix can influence our final conclusions.

#START

kr.subspace <- function(Gs,vec){

if (dim(Gs)[[1]] != dim(Gs)[[2]]){

stop("G array must be of order n x n x m x MCMCsamp")

}

if (is.na(dim(Gs)[4])) {

stop("There are no MCMCsamples")

}

n <- dim(Gs)[[1]]

m <- dim(Gs)[[3]]

MCMCsamp <- dim(Gs)[[4]]

if(length(vec) != m){stop("vec must have length = m")}

h <- function (g,v){

AA <- array(,c(n, n, m))

for (k in 1:m){

g.vec <- eigen(g[,,k])$vectors[,1:(v[k])]

AA[,,k] <- g.vec %*% t(g.vec)

}

H <- apply(AA,1:2,sum)

list(H = H, AA = AA)

}

#internal function to calculate AA and H

MCMC.H <- array(,c(n,n,MCMCsamp))

dimnames(MCMC.H) <- list(dimnames(Gs)[[1]],dimnames(Gs)[[1]],dimnames(Gs)[[4]])

MCMC.AA <- array(,c(n,n,m,MCMCsamp))

dimnames(MCMC.AA) <- list(dimnames(Gs)[[1]],dimnames(Gs)[[1]],dimnames(Gs)[[3]],dimnames(Gs)[[4]])

for (i in 1:MCMCsamp){

kr <- h(Gs[,,,i], v = vec)

MCMC.H[,,i] <- kr$H

MCMC.AA[,,,i] <- kr$AA

}

#calculate AA and H for the ith MCMC sample of the G array

avH <- apply(MCMC.H,1:2,mean)

rownames(avH) <- dimnames(Gs)[[1]]

colnames(avH) <- dimnames(Gs)[[1]]

#calculate the posterior mean H

avAA <- apply(MCMC.AA,1:3,mean)

dimnames(avAA) <- list(dimnames(Gs)[[1]],dimnames(Gs)[[1]],dimnames(Gs)[[3]])

#calculate the posterior mean AA

avH.vec <- eigen(avH)$vectors

#eigenanalysis of posterior mean H

proj<- function(a,b) t(b) %*% a %*% b

#internal function to do projection

avH.theta <- matrix(,n,m)

for (i in 1:n){

avH.theta[i,] <- acos(sqrt(apply(avAA, 3, proj, b = avH.vec[,i]))) * (180/pi)

}

#angles between the eigenvectors posterior mean H and the posterior mean subspaces of each population

MCMC.H.val <- matrix(,MCMCsamp,n)

colnames(MCMC.H.val) <- paste("h",1:n,sep="")

for (i in 1:n){

MCMC.H.val[,i] <- apply(MCMC.H, 3, proj, b = avH.vec[,i])

}

#posterior distribution of the genetic variance for the eigenvectors of posterior mean H

MCMC.H.theta <- array(,c(n,m,MCMCsamp))

rownames(MCMC.H.theta) <- paste("h",1:n,sep="")

colnames(MCMC.H.theta) <- dimnames(Gs)[[3]]

for(i in 1:n){

for(j in 1:MCMCsamp){

MCMC.H.theta[i,,j] <- acos(sqrt(apply(MCMC.AA[,,,j], 3, proj, b = avH.vec[,i]))) * (180/pi)

}

#posterior distribution of the angles between the eigenvectors of posterior mean H and the MCMC samples of the subspaces of each population

list(avAA = avAA, avH = avH, MCMC.AA = MCMC.AA, MCMC.H = MCMC.H, MCMC.H.val = MCMC.H.val, MCMC.H.theta = MCMC.H.theta)

}

#END

The arguments passed to kr.subspace are:

GsThe G array.Must be of the order n x n x m x MCMCsamp

vecA vector indicating the number of eigenvectors of each G to include in the construction of H.Must have the same number of elements as there are matrices to compare

The line below applies the kr.subspace function to the first four eigenvectors of the observed Garray, then stores the results in MCMC.kr.

MCMCG.kr <- kr.subspace(Garray, vec = rep(n/2,m))

MCMC.kris a list with 3 slots:

$avAAPosterior mean AAT

$avHPosterior mean H

$MCMC.AAMCMC samples of AAT

$MCMC.HMCMC samples of H

$MCMC.H.valMCMC samples of the eigenvalues of H

$MCMC.H.thetaMCMC samples of the angles between the eigenvectors of posterior mean Hand the MCMC samples of AAT

Because the eigenvalues of H are bounded between zero and m, to provide a hypothesis test that identifies significant common subspaces we need to generate a sensible null model. The null model we use assumes the G are sampled from the same population, and hence the randomised G will have all subspaces in common up to sampling error. Thus, to compare our observed and null model we will apply the kr.subspace function to thefirst four eigenvectors of therandomised Garray, and plot the eigenvalues of H for the observed and randomised data.

MCMCG.kr.rand <- kr.subspace(rand.Garray, vec = rep(n/2,m))

Figure3.Posterior mean and 95% HPD intervals for the eigenvalues of H for the first four eigenvectors of the observed and randomisedG arrays.