Data Sets for Evaluating Algorithms for Inducing Structural Equation Models from Data

NEO tutorial:

A multi-edge simulation model that involves 100 SNPs

Steve Horvath, Jason Aten

In this R software tutorial, we show how NEO can be used to orient the edges of a multi-trait network by automatically selecting markers for each trait separately. The

analyses proceed in a stepwise fashion: edges are oriented one at the time. For each edge, NEO computes edge orienting scores (LEO.NB.OCA, LEO.NB.CPA, etc). By thresholding these edge orienting scores, one can arrive at a globally oriented trait network.

The details of the simulation model and relevant R code is

presented presented below. Briefly, we simulated a causal network between five gene expressions (denoted by E1 through E5) and a trait (denoted by Trait). Each of the 6 traits was simulated to be under the causal influence of 3 SNPs. We added 82 noise SNPs so that the data contained 100 SNPs.

We simulated the following causal relationships between the

traits:

E1->E2

E1->E3

E3 <- HiddenConfounder -> E4

E4 -> Trait

Trait -> E5.

Note that the correlation between traits E3 and E4 is due a hidden confounder, i.e. this edge contains no causal signal. This example is shown in Figure 7 of our article (Aten et al, 2008).

To cite this tutorial and the R code, please use the following reference.

Aten JE, Fuller TF, Lusis AJ, Horvath S (2008) Using genetic markers to orient the edges in quantitative trait networks: the NEO software. BMC Systems Biology 2008, 2:34

Contents

1. Network Simulation

Here we simulate a causal network between 5 gene expressions labelled E1,E2,...E5,

a clinical trait called Trait, SNP marker data.

2. Next we describe how to use the NEO R functions to infer the causal relationships between the genes.

# read in the R libraries

library(MASS) # standard, no need to install

library(class) # standard, no need to install

library(cluster)

# Download the WGCNA package from the following webpage

# http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/

library(WGCNA)

library(impute)# install it for imputing missing value

#Memory

# check the maximum memory that can be allocated

memory.size(TRUE)/1024

# increase the available memory

memory.limit(size=4000)

dir.create("C:/NEOMultiEdgeSimulation")

setwd("C:/NEOMultiEdgeSimulation")

# Please adapt the following paths so that the point to the directory where

# the causality function files are located. Note that we use / instead of \

source("C:/Documents and Settings/Steve Horvath/My Documents/RFunctions/CausalityFunctions.txt")

source("C:/Documents and Settings/Steve Horvath/My Documents/RFunctions/updatedneo.txt")

#Step 1: Simulate traits and SNP marker data

# Copy and paste the following functions into R

# The function SNPsignatureDependence outputs a data frame of SNP markers

# which are correlated with an underlying genetic signal (e.g. a haplotype).

# The number of columns equals the input parameter number of SNPs (no.SNPs).

# The length of the vector geneticsignal equals the number of samples (e.g. mice)

# The parameter cor.with.signal should lie between minus 1 and 1.

# All SNPs have minor allele frequency .5

if(exists("SNPsignatureDependence")) rm(SNPsignatureDependence);

SNPsignatureDependence=function(geneticsignal, cor.with.signal, no.SNPs,transformToOrdinal=T){

if (abs(cor.with.signal)>1 ) stop("Error in function SNPsignature: abs(cor.with.signal) should be smaller than or equal to 1")

no.samples=length(geneticsignal)

if ( !(no.samples>=1) ) stop("Error in the function SNP signature. The input vector geneticsignal has fewer than 1 observation.")

if ( !( no.SNPs >=1) ) stop("Error in the function SNP signature. The parameter no.SNPs should be larger than 1.")

if (var(geneticsignal,na.rm=T)>0 ) geneticsignal=as.numeric(scale(geneticsignal))

if (var(geneticsignal,na.rm=T)==0 & cor.with.signal==1 ) {

print("Warning: since var(geneticsignal,na.rm=T)==0 & cor.with.signal==1, I set cor.with.signal to 0.999 to avoid constant SNPs (minor allele freq=0).")

cor.with.signal=0.999

} # end of if

datSNP=matrix(geneticsignal, nrow=no.samples, ncol=no.SNPs)

datSNP= cor.with.signal*datSNP+sqrt(1-cor.with.signal^2)* matrix(rnorm(no.samples*no.SNPs), nrow=no.samples, ncol=no.SNPs)

datout=datSNP

if (transformToOrdinal ) {

LowerQuartile=quantile(c(as.matrix(datSNP)),prob=0.25)[[1]]

UpperQuartile= quantile(c(as.matrix(datSNP)),prob=0.75)[[1]]

datout[datSNP>UpperQuartile]=2

datout[datSNP<=LowerQuartile]=0

datout[datSNP> LowerQuartile & datSNP<=UpperQuartile ]=1

}

datout

} # end of the function SNPsignature

# The function SNPsignal takes a data frame of SNPs (columns) and outputs a vector

# that equals the row sum of the SNPs.

# The output vector has mean 0 and variance 1

# Thus, the function can be used to represents an additive genetic effect of multiple SNPs.

if(exists("SNPsignal")) rm(SNPsignal);

SNPsignal=function(datSNP){

no.SNPs=dim(as.matrix(datSNP))[[2]]

additive1=as.matrix(datSNP)%*% rep(1, no.SNPs)

var1=var(as.numeric(additive1) ,na.rm=T)

if (var1>0 ){additive1=as.numeric(scale(additive1))}

additive1

} # end of the function SNPsignal

# This function simulates a data frame of traits and SNPs

if(exists("SimulateNetwork") ) rm(SimulateNetwork);

SimulateNetwork=function(no.samples=200, no.SNPs=100){

# This data frame will contain SNP markers

datSNP=data.frame(matrix(sample(c(0,1,2), size=no.samples*no.SNPs, prob=c(.25,.5,.25) ,replace=T)

, nrow=no.samples, ncol=no.SNPs))

names(datSNP)=paste("SNP", c(1:no.SNPs), sep="")

# SNPs that are anchors of the Trait

indexvector=c(1:3)

GeneticSignalTrait= SNPsignal( datSNP[, indexvector] )

# SNPs that are anchors of expression E1

indexvector=c(10:12)

GeneticSignal1= SNPsignal( datSNP[, indexvector] )

# SNPs that are anchors of expression E2

indexvector=c(20:22)

GeneticSignal2= SNPsignal( datSNP[, indexvector] )

indexvector=c(30:32)

GeneticSignal3= SNPsignal( datSNP[, indexvector] )

indexvector=c(40:42)

GeneticSignal4= SNPsignal( datSNP[, indexvector] )

indexvector=c(50:52)

GeneticSignal5= SNPsignal( datSNP[, indexvector] )

# The following code determines the causal relationships between the traits

# Expression trait is determined by the genetic signal

E1= GeneticSignal1+rnorm(no.samples)

E2=.5* E1+GeneticSignal2+rnorm(no.samples)

Confounder= rnorm(no.samples)

E3= .5*E1+GeneticSignal3+2* Confounder

E4=2*Confounder+GeneticSignal4+rnorm(no.samples)

Trait=.5*E4+GeneticSignalTrait+rnorm(no.samples)

E5= .5*Trait+GeneticSignal5+rnorm(no.samples)

datout=data.frame(Trait, E1,E2,E3, E4, E5,datSNP)

datout

} # end of function

# here we set the seed of the randomization procedure.

set.seed(1)

# our data will contain 100 individuals (e.g mice)

datCombined= SimulateNetwork(no.samples=100)

# the first 6 columns of A contain the traits

A=data.frame(datCombined[,1:6] )

B= data.frame(datCombined[,1:6] )

# this data frame contains the SNPs

datSNP=datCombined[,-c(1:6)]

# Lets first study our simulated traits

signif(cor(A),2)

> signif(cor(A),2)

Trait E1 E2 E3 E4 E5

Trait 1.000 -0.031 0.059 0.34 0.62 0.480

E1 -0.031 1.000 0.410 0.39 0.17 -0.053

E2 0.059 0.410 1.000 0.22 0.14 -0.062

E3 0.340 0.390 0.220 1.00 0.73 0.210

E4 0.620 0.170 0.140 0.73 1.00 0.280

E5 0.480 -0.053 -0.062 0.21 0.28 1.000

Compare the correlation matrix above to the true simulated model

E1->E2

E1->E3

E3 <- HiddenConfounder -> E4

E4 -> Trait

Trait -> E5.

The relationship between SNPs and traits.

Recall that we simulated the following true causal anchor relationships between SNPs and traits.

SNP1+SNP2+SNP3 -> Trait

SNP10+SNP11+SNP12 -> E1

SNP20+SNP21+SNP22 -> E2

SNP30+SNP31+SNP32 -> E3

SNP40+SNP41+SNP42 -> E4

SNP50+SNP51+SNP52 -> E5

# the following vector specifies the true causal anchors

trueSNP=c(1:3, 10:12, 20:22,30:32, 40:42, 50:52)

corSNPA=signif(cor(datSNP[,trueSNP],A),2)

corSNPA

> corSNPA

Trait E1 E2 E3 E4 E5

SNP1 0.2500 -0.1800 -0.0240 -0.1400 -0.0870 0.2700

SNP2 0.4100 -0.1600 -0.0170 -0.0071 0.0700 0.1200

SNP3 0.4200 -0.1800 0.0630 -0.1100 0.0270 0.0780

SNP10 0.0340 0.3700 0.1100 0.3100 0.1400 0.0068

SNP11 0.0680 0.5200 0.2500 0.2900 0.1300 -0.1100

SNP12 0.0024 0.3700 0.1900 0.1100 -0.0076 0.1300

SNP20 -0.0190 -0.1500 0.4700 -0.1400 -0.1600 -0.1000

SNP21 -0.0680 0.1200 0.2800 -0.0700 -0.0720 0.0082

SNP22 0.0440 -0.0380 0.2900 0.0800 0.1000 -0.0230

SNP30 -0.0130 0.0960 -0.0075 0.2500 -0.0084 0.1100

SNP31 -0.0039 0.0510 -0.0720 0.3000 0.1600 0.0080

SNP32 -0.1000 -0.0610 0.1800 0.3400 0.1700 0.0750

SNP40 0.1100 -0.0440 -0.1400 0.0810 0.1800 0.2100

SNP41 0.2600 0.0630 0.0520 0.0940 0.3600 0.2000

SNP42 0.1500 0.0140 0.0460 0.0270 0.3000 0.0390

SNP50 -0.0390 0.0033 -0.1400 -0.0290 -0.1400 0.3200

SNP51 0.1100 0.0920 0.0140 0.0600 0.0430 0.3500

SNP52 -0.1700 0.0950 -0.0880 -0.0180 -0.1300 0.3100

Comment: True relationships (correlations) are colored in red.Note that the true signal SNPs have moderately high correlations with their respective traits.

Causal edges scores between the traits

# The following command computes edge scores between

# the traits (columns) specified in A and those in B.

neo1=NEO.scores(datSNP=datSNP ,A=A,B=B, detailed.output=F, testConfounding=F)

# The following is an excerpt of the output

neo1

$LEO.NB.OCA.AtoB

Trait E1 E2 E3 E4 E5

Trait NA NA NA -0.629 -4.7900 1.2500

E1 NA NA 2.6100 0.594 0.0225 NA

E2 NA -6.000 NA -1.530 -0.3920 NA

E3 -0.555 -2.100 0.0147 NA -0.0443 -0.0522

E4 0.739 -0.103 0.0222 -2.080 NA 0.2590

E5 -3.160 NA NA -0.104 -1.2000 NA

$LEO.NB.CPA.AtoB

Trait E1 E2 E3 E4 E5

Trait NA NA NA -2.290 -7.890 0.7050

E1 NA NA 2.2900 -0.224 -0.105 NA

E2 NA -8.3600 NA -2.580 -1.500 NA

E3 -2.17 -1.4200 -0.0394 NA 0.431 -0.0576

E4 2.28 -0.0363 0.3270 -3.350 NA 0.5840

E5 -4.98 NA NA -0.570 -3.010 NA

# Comment: Recall that we recommend a threshold of 0.3 for the LEO.NB.OCA score.

# This scores retrieves all true causal relationships (colored in red).

# The CPA score does not work as well in this example. Note that it misses the causal direction E1->E3.

Options for visualizing the true and the observed causal model.

# The following matrix specifies the true causal relationships (1 for causal otherwise NA)

LEOtruemodel=data.frame(matrix(NA , nrow=6, ncol=6))

dimnames(LEOtruemodel)[[1]]=names(data.frame(A))

dimnames(LEOtruemodel)[[2]]=names(data.frame(A))

LEOtruemodel["E1", "E2"]=1

LEOtruemodel["E1", "E3"]=1

LEOtruemodel["E4", "Trait"]=1

LEOtruemodel["Trait", "E5"]=1

# This adjacency matrix is defined as the absolute value of the correlation between the

# traits.

ADJ= data.frame(abs(cor(A)))

# The following visualizes the true causal model.

# note that traits are clustered according to the adjacency matrix.

par(mfrow=c(1,1))

DirectedNetworkPlot(LEOtruemodel, ADJ=ADJ, main="True Causal Model")

#postscript(file="TrueCausalModel.ps",horizontal=F)

Caption: The heatmap plot of this Figure corresponds to Figure 7(a) in our article.

The figure depicts the true causal model. Note that a red square in the i-th

row and j-th column indicates that trait i causally affects

trait j, e.g. E1->E2, E1->E3, etc. The rows and columns of the heatmap are ordered

according to a hierarchical clustering tree, which was constructed

using average linkage hierarchical clustering with the dissimilarity dissADJ=1-ADJ

# Now we visualize the observed (retrieved) causal network based on the

# LEO.NB.OCA score and a threshold of 0.3

LEOthresholded= I(neo1$LEO.NB.OCA.AtoB>.3)+0.0

# Next we create a plot for the observed causal relationships.

DirectedNetworkPlot( LEOthresholded, ADJ=ADJ, main="Observed LEO.NB.OCA")

Caption: This corresponds to Figure 7(b) in our article. The plot shows the

heatmap of the observed network that was reconstructed by thresholding the LEO.NB.OCA score.

#postscript(file="ObservedLEONBOCA.ps",horizontal=F)

# Now we re-create the "blue line plot" corresponding to our Figure 7c in the article.

pm=neo.get.param()

NEOoutput1=neo(datCombined,pm=pm)

# The function call results in the following output.

# To get the edges scores you can use the following GetEdgeScores

L1=GetEdgeScores(NEOoutput1,datTraits=A)

$LEO.NB.OCA.AtoB

Trait E1 E2 E3 E4 E5

Trait NA -0.289 -0.02200 -0.629 -4.7900 1.25000

E1 -0.3140 NA 2.61000 0.594 0.0225 0.00623

E2 -0.0416 -6.000 NA -1.530 -0.3920 -0.00188

E3 -0.5550 -2.100 0.01470 NA -0.0443 -0.05220

E4 0.7390 -0.103 0.02220 -2.080 NA 0.25900

E5 -3.1600 -0.146 -0.00184 -0.104 -1.2000 NA

$LEO.NB.CPA.AtoB

Trait E1 E2 E3 E4 E5

Trait NA -1.3400 -0.1960 -2.290 -7.890 0.7050

E1 -0.286 NA 2.2900 -0.224 -0.105 -0.8510

E2 -0.212 -8.3600 NA -2.580 -1.500 -0.4360

E3 -2.170 -1.4200 -0.0394 NA 0.431 -0.0576

E4 2.280 -0.0363 0.3270 -3.350 NA 0.5840

E5 -4.980 -0.5000 -0.8120 -0.570 -3.010 NA

Robustness analysis with respect to the number of anchored SNPs.

Since the edge orienting scores for an edge A-B critically depend on the input genetic marker sets M_A and M_B, we also recommend carrying out a robustness analysis with respect to different marker sets. In particular, the automated SNP selection

results should be carefully evaluated with regard to the threshold

parameters that were used to define the marker sets. For example,

when using a greedy SNP selection strategy, it is advisable to

study how the LEO.NB score is affected by altering the number of

most highly correlated SNPs. For a given edge and a given edge

orienting score (e.g. LEO.NB.OCA), NEO implements a robustness

analysis with respect to automatic marker selection

In the following R code, we will reproduce Figure 7(d) of our article.

Specifically, we report the LEO.NB.OCA and LEO.NB.CPA scores for the edge orientation E4->Trait.

pm=neo.get.param()

RobustnessOutput1=NEOrobustness(datSNP= datCombined[,7:106], datExpr= data.frame(E4=datCombined$E4, Trait= datCombined$Trait), TopGreedy=1:5, TopForward=1:5, no.log.quiet = TRUE, pm=pm)

LEOscoreOCAvector=data.frame(RobustnessOutput1[,c(1,21,11:20)])$LEO.NB.OCA

LEOscoreOCAvector=as.numeric(as.character(as.vector(LEOscoreOCAvector)))

LEOscoreCPAvector=data.frame(RobustnessOutput1[,c(1,21,11:20)])$LEO.NB.CPA

LEOscoreCPAvector=as.numeric(as.character(as.vector(LEOscoreCPAvector)))

range.y=range(c(LEOscoreOCAvector, LEOscoreCPAvector, 0.3, 0.8), na.rm=T)

plot(c(1:length(LEOscoreOCAvector)),LEOscoreOCAvector,type="l",ylab="LEO.score",xlab="Step, increasing SNPs",ylim=range.y,cex.lab=1.5)

abline(h=0.3, lwd=1.5, col="red")

abline(h=0.8, lwd=1.5, col="blue")

points(1:length(LEOscoreOCAvector),LEOscoreOCAvector,pch=21,bg="red",cex=1.5)

lines(1:length(LEOscoreCPAvector),LEOscoreCPAvector)

points(1:length(LEOscoreCPAvector),LEOscoreCPAvector,pch=22,cex=1.5,bg="yellow") # square

legend(length(LEOscoreOCAvector)-2,range.y[2]-.3, c("LEO.NB.OCA", "LEO.NB.CPA"), cex=1.2, pch=c(21,22,23),pt.bg=c("red","yellow","blue"))

title("Robustness Analysis",cex.main=1.5)

Caption: This corresponds to Figure 7d of our article. The red horizontal line corresponds to the threshold of 0.3 for the OCA score. The LEO.NB.OCA score exceeds this threshold of 0.3 for all steps of the robustness analysis, i.e.,~it

retrieves the orientation correctly. The blue horizontal line corresponds to the threshold of 0.8 for the CPA score. Note that the LEO.NB.CPA score exceeds this threshold for all steps of the robustness analysis, i.e.,~it retrieves the orientation correctly.

THE END