NEO tutorial:
A multi-edge simulation model that involves 100 SNPs
Steve Horvath, Jason Aten
In this R software tutorial, we show how NEO can be used to orient the edges of a multi-trait network by automatically selecting markers for each trait separately. The
analyses proceed in a stepwise fashion: edges are oriented one at the time. For each edge, NEO computes edge orienting scores (LEO.NB.OCA, LEO.NB.CPA, etc). By thresholding these edge orienting scores, one can arrive at a globally oriented trait network.
The details of the simulation model and relevant R code is
presented presented below. Briefly, we simulated a causal network between five gene expressions (denoted by E1 through E5) and a trait (denoted by Trait). Each of the 6 traits was simulated to be under the causal influence of 3 SNPs. We added 82 noise SNPs so that the data contained 100 SNPs.
We simulated the following causal relationships between the
traits:
E1->E2
E1->E3
E3 <- HiddenConfounder -> E4
E4 -> Trait
Trait -> E5.
Note that the correlation between traits E3 and E4 is due a hidden confounder, i.e. this edge contains no causal signal. This example is shown in Figure 7 of our article (Aten et al, 2008).
To cite this tutorial and the R code, please use the following reference.
Aten JE, Fuller TF, Lusis AJ, Horvath S (2008) Using genetic markers to orient the edges in quantitative trait networks: the NEO software. BMC Systems Biology 2008, 2:34
Contents
1. Network Simulation
Here we simulate a causal network between 5 gene expressions labelled E1,E2,...E5,
a clinical trait called Trait, SNP marker data.
2. Next we describe how to use the NEO R functions to infer the causal relationships between the genes.
# read in the R libraries
library(MASS) # standard, no need to install
library(class) # standard, no need to install
library(cluster)
# Download the WGCNA package from the following webpage
# http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/
library(WGCNA)
library(impute)# install it for imputing missing value
#Memory
# check the maximum memory that can be allocated
memory.size(TRUE)/1024
# increase the available memory
memory.limit(size=4000)
dir.create("C:/NEOMultiEdgeSimulation")
setwd("C:/NEOMultiEdgeSimulation")
# Please adapt the following paths so that the point to the directory where
# the causality function files are located. Note that we use / instead of \
source("C:/Documents and Settings/Steve Horvath/My Documents/RFunctions/CausalityFunctions.txt")
source("C:/Documents and Settings/Steve Horvath/My Documents/RFunctions/updatedneo.txt")
#Step 1: Simulate traits and SNP marker data
# Copy and paste the following functions into R
# The function SNPsignatureDependence outputs a data frame of SNP markers
# which are correlated with an underlying genetic signal (e.g. a haplotype).
# The number of columns equals the input parameter number of SNPs (no.SNPs).
# The length of the vector geneticsignal equals the number of samples (e.g. mice)
# The parameter cor.with.signal should lie between minus 1 and 1.
# All SNPs have minor allele frequency .5
if(exists("SNPsignatureDependence")) rm(SNPsignatureDependence);
SNPsignatureDependence=function(geneticsignal, cor.with.signal, no.SNPs,transformToOrdinal=T){
if (abs(cor.with.signal)>1 ) stop("Error in function SNPsignature: abs(cor.with.signal) should be smaller than or equal to 1")
no.samples=length(geneticsignal)
if ( !(no.samples>=1) ) stop("Error in the function SNP signature. The input vector geneticsignal has fewer than 1 observation.")
if ( !( no.SNPs >=1) ) stop("Error in the function SNP signature. The parameter no.SNPs should be larger than 1.")
if (var(geneticsignal,na.rm=T)>0 ) geneticsignal=as.numeric(scale(geneticsignal))
if (var(geneticsignal,na.rm=T)==0 & cor.with.signal==1 ) {
print("Warning: since var(geneticsignal,na.rm=T)==0 & cor.with.signal==1, I set cor.with.signal to 0.999 to avoid constant SNPs (minor allele freq=0).")
cor.with.signal=0.999
} # end of if
datSNP=matrix(geneticsignal, nrow=no.samples, ncol=no.SNPs)
datSNP= cor.with.signal*datSNP+sqrt(1-cor.with.signal^2)* matrix(rnorm(no.samples*no.SNPs), nrow=no.samples, ncol=no.SNPs)
datout=datSNP
if (transformToOrdinal ) {
LowerQuartile=quantile(c(as.matrix(datSNP)),prob=0.25)[[1]]
UpperQuartile= quantile(c(as.matrix(datSNP)),prob=0.75)[[1]]
datout[datSNP>UpperQuartile]=2
datout[datSNP<=LowerQuartile]=0
datout[datSNP> LowerQuartile & datSNP<=UpperQuartile ]=1
}
datout
} # end of the function SNPsignature
# The function SNPsignal takes a data frame of SNPs (columns) and outputs a vector
# that equals the row sum of the SNPs.
# The output vector has mean 0 and variance 1
# Thus, the function can be used to represents an additive genetic effect of multiple SNPs.
if(exists("SNPsignal")) rm(SNPsignal);
SNPsignal=function(datSNP){
no.SNPs=dim(as.matrix(datSNP))[[2]]
additive1=as.matrix(datSNP)%*% rep(1, no.SNPs)
var1=var(as.numeric(additive1) ,na.rm=T)
if (var1>0 ){additive1=as.numeric(scale(additive1))}
additive1
} # end of the function SNPsignal
# This function simulates a data frame of traits and SNPs
if(exists("SimulateNetwork") ) rm(SimulateNetwork);
SimulateNetwork=function(no.samples=200, no.SNPs=100){
# This data frame will contain SNP markers
datSNP=data.frame(matrix(sample(c(0,1,2), size=no.samples*no.SNPs, prob=c(.25,.5,.25) ,replace=T)
, nrow=no.samples, ncol=no.SNPs))
names(datSNP)=paste("SNP", c(1:no.SNPs), sep="")
# SNPs that are anchors of the Trait
indexvector=c(1:3)
GeneticSignalTrait= SNPsignal( datSNP[, indexvector] )
# SNPs that are anchors of expression E1
indexvector=c(10:12)
GeneticSignal1= SNPsignal( datSNP[, indexvector] )
# SNPs that are anchors of expression E2
indexvector=c(20:22)
GeneticSignal2= SNPsignal( datSNP[, indexvector] )
indexvector=c(30:32)
GeneticSignal3= SNPsignal( datSNP[, indexvector] )
indexvector=c(40:42)
GeneticSignal4= SNPsignal( datSNP[, indexvector] )
indexvector=c(50:52)
GeneticSignal5= SNPsignal( datSNP[, indexvector] )
# The following code determines the causal relationships between the traits
# Expression trait is determined by the genetic signal
E1= GeneticSignal1+rnorm(no.samples)
E2=.5* E1+GeneticSignal2+rnorm(no.samples)
Confounder= rnorm(no.samples)
E3= .5*E1+GeneticSignal3+2* Confounder
E4=2*Confounder+GeneticSignal4+rnorm(no.samples)
Trait=.5*E4+GeneticSignalTrait+rnorm(no.samples)
E5= .5*Trait+GeneticSignal5+rnorm(no.samples)
datout=data.frame(Trait, E1,E2,E3, E4, E5,datSNP)
datout
} # end of function
# here we set the seed of the randomization procedure.
set.seed(1)
# our data will contain 100 individuals (e.g mice)
datCombined= SimulateNetwork(no.samples=100)
# the first 6 columns of A contain the traits
A=data.frame(datCombined[,1:6] )
B= data.frame(datCombined[,1:6] )
# this data frame contains the SNPs
datSNP=datCombined[,-c(1:6)]
# Lets first study our simulated traits
signif(cor(A),2)
> signif(cor(A),2)
Trait E1 E2 E3 E4 E5
Trait 1.000 -0.031 0.059 0.34 0.62 0.480
E1 -0.031 1.000 0.410 0.39 0.17 -0.053
E2 0.059 0.410 1.000 0.22 0.14 -0.062
E3 0.340 0.390 0.220 1.00 0.73 0.210
E4 0.620 0.170 0.140 0.73 1.00 0.280
E5 0.480 -0.053 -0.062 0.21 0.28 1.000
Compare the correlation matrix above to the true simulated model
E1->E2
E1->E3
E3 <- HiddenConfounder -> E4
E4 -> Trait
Trait -> E5.
The relationship between SNPs and traits.
Recall that we simulated the following true causal anchor relationships between SNPs and traits.
SNP1+SNP2+SNP3 -> Trait
SNP10+SNP11+SNP12 -> E1
SNP20+SNP21+SNP22 -> E2
SNP30+SNP31+SNP32 -> E3
SNP40+SNP41+SNP42 -> E4
SNP50+SNP51+SNP52 -> E5
# the following vector specifies the true causal anchors
trueSNP=c(1:3, 10:12, 20:22,30:32, 40:42, 50:52)
corSNPA=signif(cor(datSNP[,trueSNP],A),2)
corSNPA
> corSNPA
Trait E1 E2 E3 E4 E5
SNP1 0.2500 -0.1800 -0.0240 -0.1400 -0.0870 0.2700
SNP2 0.4100 -0.1600 -0.0170 -0.0071 0.0700 0.1200
SNP3 0.4200 -0.1800 0.0630 -0.1100 0.0270 0.0780
SNP10 0.0340 0.3700 0.1100 0.3100 0.1400 0.0068
SNP11 0.0680 0.5200 0.2500 0.2900 0.1300 -0.1100
SNP12 0.0024 0.3700 0.1900 0.1100 -0.0076 0.1300
SNP20 -0.0190 -0.1500 0.4700 -0.1400 -0.1600 -0.1000
SNP21 -0.0680 0.1200 0.2800 -0.0700 -0.0720 0.0082
SNP22 0.0440 -0.0380 0.2900 0.0800 0.1000 -0.0230
SNP30 -0.0130 0.0960 -0.0075 0.2500 -0.0084 0.1100
SNP31 -0.0039 0.0510 -0.0720 0.3000 0.1600 0.0080
SNP32 -0.1000 -0.0610 0.1800 0.3400 0.1700 0.0750
SNP40 0.1100 -0.0440 -0.1400 0.0810 0.1800 0.2100
SNP41 0.2600 0.0630 0.0520 0.0940 0.3600 0.2000
SNP42 0.1500 0.0140 0.0460 0.0270 0.3000 0.0390
SNP50 -0.0390 0.0033 -0.1400 -0.0290 -0.1400 0.3200
SNP51 0.1100 0.0920 0.0140 0.0600 0.0430 0.3500
SNP52 -0.1700 0.0950 -0.0880 -0.0180 -0.1300 0.3100
Comment: True relationships (correlations) are colored in red.Note that the true signal SNPs have moderately high correlations with their respective traits.
Causal edges scores between the traits
# The following command computes edge scores between
# the traits (columns) specified in A and those in B.
neo1=NEO.scores(datSNP=datSNP ,A=A,B=B, detailed.output=F, testConfounding=F)
# The following is an excerpt of the output
neo1
$LEO.NB.OCA.AtoB
Trait E1 E2 E3 E4 E5
Trait NA NA NA -0.629 -4.7900 1.2500
E1 NA NA 2.6100 0.594 0.0225 NA
E2 NA -6.000 NA -1.530 -0.3920 NA
E3 -0.555 -2.100 0.0147 NA -0.0443 -0.0522
E4 0.739 -0.103 0.0222 -2.080 NA 0.2590
E5 -3.160 NA NA -0.104 -1.2000 NA
$LEO.NB.CPA.AtoB
Trait E1 E2 E3 E4 E5
Trait NA NA NA -2.290 -7.890 0.7050
E1 NA NA 2.2900 -0.224 -0.105 NA
E2 NA -8.3600 NA -2.580 -1.500 NA
E3 -2.17 -1.4200 -0.0394 NA 0.431 -0.0576
E4 2.28 -0.0363 0.3270 -3.350 NA 0.5840
E5 -4.98 NA NA -0.570 -3.010 NA
# Comment: Recall that we recommend a threshold of 0.3 for the LEO.NB.OCA score.
# This scores retrieves all true causal relationships (colored in red).
# The CPA score does not work as well in this example. Note that it misses the causal direction E1->E3.
Options for visualizing the true and the observed causal model.
# The following matrix specifies the true causal relationships (1 for causal otherwise NA)
LEOtruemodel=data.frame(matrix(NA , nrow=6, ncol=6))
dimnames(LEOtruemodel)[[1]]=names(data.frame(A))
dimnames(LEOtruemodel)[[2]]=names(data.frame(A))
LEOtruemodel["E1", "E2"]=1
LEOtruemodel["E1", "E3"]=1
LEOtruemodel["E4", "Trait"]=1
LEOtruemodel["Trait", "E5"]=1
# This adjacency matrix is defined as the absolute value of the correlation between the
# traits.
ADJ= data.frame(abs(cor(A)))
# The following visualizes the true causal model.
# note that traits are clustered according to the adjacency matrix.
par(mfrow=c(1,1))
DirectedNetworkPlot(LEOtruemodel, ADJ=ADJ, main="True Causal Model")
#postscript(file="TrueCausalModel.ps",horizontal=F)
Caption: The heatmap plot of this Figure corresponds to Figure 7(a) in our article.
The figure depicts the true causal model. Note that a red square in the i-th
row and j-th column indicates that trait i causally affects
trait j, e.g. E1->E2, E1->E3, etc. The rows and columns of the heatmap are ordered
according to a hierarchical clustering tree, which was constructed
using average linkage hierarchical clustering with the dissimilarity dissADJ=1-ADJ
# Now we visualize the observed (retrieved) causal network based on the
# LEO.NB.OCA score and a threshold of 0.3
LEOthresholded= I(neo1$LEO.NB.OCA.AtoB>.3)+0.0
# Next we create a plot for the observed causal relationships.
DirectedNetworkPlot( LEOthresholded, ADJ=ADJ, main="Observed LEO.NB.OCA")
Caption: This corresponds to Figure 7(b) in our article. The plot shows the
heatmap of the observed network that was reconstructed by thresholding the LEO.NB.OCA score.
#postscript(file="ObservedLEONBOCA.ps",horizontal=F)
# Now we re-create the "blue line plot" corresponding to our Figure 7c in the article.
pm=neo.get.param()
NEOoutput1=neo(datCombined,pm=pm)
# The function call results in the following output.
# To get the edges scores you can use the following GetEdgeScores
L1=GetEdgeScores(NEOoutput1,datTraits=A)
$LEO.NB.OCA.AtoB
Trait E1 E2 E3 E4 E5
Trait NA -0.289 -0.02200 -0.629 -4.7900 1.25000
E1 -0.3140 NA 2.61000 0.594 0.0225 0.00623
E2 -0.0416 -6.000 NA -1.530 -0.3920 -0.00188
E3 -0.5550 -2.100 0.01470 NA -0.0443 -0.05220
E4 0.7390 -0.103 0.02220 -2.080 NA 0.25900
E5 -3.1600 -0.146 -0.00184 -0.104 -1.2000 NA
$LEO.NB.CPA.AtoB
Trait E1 E2 E3 E4 E5
Trait NA -1.3400 -0.1960 -2.290 -7.890 0.7050
E1 -0.286 NA 2.2900 -0.224 -0.105 -0.8510
E2 -0.212 -8.3600 NA -2.580 -1.500 -0.4360
E3 -2.170 -1.4200 -0.0394 NA 0.431 -0.0576
E4 2.280 -0.0363 0.3270 -3.350 NA 0.5840
E5 -4.980 -0.5000 -0.8120 -0.570 -3.010 NA
Robustness analysis with respect to the number of anchored SNPs.
Since the edge orienting scores for an edge A-B critically depend on the input genetic marker sets M_A and M_B, we also recommend carrying out a robustness analysis with respect to different marker sets. In particular, the automated SNP selection
results should be carefully evaluated with regard to the threshold
parameters that were used to define the marker sets. For example,
when using a greedy SNP selection strategy, it is advisable to
study how the LEO.NB score is affected by altering the number of
most highly correlated SNPs. For a given edge and a given edge
orienting score (e.g. LEO.NB.OCA), NEO implements a robustness
analysis with respect to automatic marker selection
In the following R code, we will reproduce Figure 7(d) of our article.
Specifically, we report the LEO.NB.OCA and LEO.NB.CPA scores for the edge orientation E4->Trait.
pm=neo.get.param()
RobustnessOutput1=NEOrobustness(datSNP= datCombined[,7:106], datExpr= data.frame(E4=datCombined$E4, Trait= datCombined$Trait), TopGreedy=1:5, TopForward=1:5, no.log.quiet = TRUE, pm=pm)
LEOscoreOCAvector=data.frame(RobustnessOutput1[,c(1,21,11:20)])$LEO.NB.OCA
LEOscoreOCAvector=as.numeric(as.character(as.vector(LEOscoreOCAvector)))
LEOscoreCPAvector=data.frame(RobustnessOutput1[,c(1,21,11:20)])$LEO.NB.CPA
LEOscoreCPAvector=as.numeric(as.character(as.vector(LEOscoreCPAvector)))
range.y=range(c(LEOscoreOCAvector, LEOscoreCPAvector, 0.3, 0.8), na.rm=T)
plot(c(1:length(LEOscoreOCAvector)),LEOscoreOCAvector,type="l",ylab="LEO.score",xlab="Step, increasing SNPs",ylim=range.y,cex.lab=1.5)
abline(h=0.3, lwd=1.5, col="red")
abline(h=0.8, lwd=1.5, col="blue")
points(1:length(LEOscoreOCAvector),LEOscoreOCAvector,pch=21,bg="red",cex=1.5)
lines(1:length(LEOscoreCPAvector),LEOscoreCPAvector)
points(1:length(LEOscoreCPAvector),LEOscoreCPAvector,pch=22,cex=1.5,bg="yellow") # square
legend(length(LEOscoreOCAvector)-2,range.y[2]-.3, c("LEO.NB.OCA", "LEO.NB.CPA"), cex=1.2, pch=c(21,22,23),pt.bg=c("red","yellow","blue"))
title("Robustness Analysis",cex.main=1.5)
Caption: This corresponds to Figure 7d of our article. The red horizontal line corresponds to the threshold of 0.3 for the OCA score. The LEO.NB.OCA score exceeds this threshold of 0.3 for all steps of the robustness analysis, i.e.,~it
retrieves the orientation correctly. The blue horizontal line corresponds to the threshold of 0.8 for the CPA score. Note that the LEO.NB.CPA score exceeds this threshold for all steps of the robustness analysis, i.e.,~it retrieves the orientation correctly.
THE END
1