NEO Mini Tutorial
Jason Aten, Steve Horvath
Here we describe the basic commands and output of the network edge orienting software (NEO) using a simple example. We simulate two expression traits E1 and E2 and anchor them to two SNP markers each. Next we compute edge orienting scores for evaluating the orientation E1 -> E2.
If you are interested in a simulated example involving multiple traits and edges, we recommend another NEO tutorial called "NEOTutorialMultiEdgeSimulation.doc".
To cite this tutorial and the R code, please use the following reference.
Aten JE, Fuller TF, Lusis AJ, Horvath S (2008) Using genetic markers to orient the edges in quantitative trait networks: the NEO software.BMC Systems Biology 2008, 2:34
# Copy and paste the following functions into R
# The following path should be rather small. A long pathy may result in an error.
dir.create("C:/NEOminiTutorial")
setwd("C:/NEOminiTutorial")
# Please adapt the following paths so that the point to the directory where
# the causality function files are located. Note that we use / instead of \
source("C:/Documents and Settings/Steve Horvath/My Documents/RFunctions/CausalityFunctions.txt")
source("C:/Documents and Settings/Steve Horvath/My Documents/RFunctions/updatedneo.txt")
# this is the number of samples (e.g. number of mice or humans)
no.samples=100
# Here we set the random seed so that the simulation results become reproducible.
set.seed(1);
# here we simulate SNP markers will allele frequency 0.5.
SNP1=sample(c(0,1,2), size=no.samples, prob=c(.25,.5,.25) ,replace=T);
SNP2= sample(c(0,1,2), size=no.samples, prob=c(.25,.5,.25) ,replace=T);
SNP3= sample(c(0,1,2), size=no.samples, prob=c(.25,.5,.25) ,replace=T);
SNP4= sample(c(0,1,2), size=no.samples, prob=c(.25,.5,.25) ,replace=T);
environmental.variation=0.5;
# SNPs 1 and 2 additively cause variation in E1
E1=scale(scale(SNP1+SNP2)+rnorm(no.samples,sd=sqrt(environmental.variation)))
# Now we simulate E1 -> E2 and SNPs 3 and 4 additively cause variation in E2
E2=scale(scale(E1+SNP3+SNP4)+rnorm(no.samples,sd=sqrt(environmental.variation)))
# The following specifies a data frame of the SNP markers
datSNP=data.frame(SNP1,SNP2,SNP3,SNP4);
neo1=NEO.scores(datSNP=datSNP ,A=E1,B=E2, detailed.output=T)
# Let’s look at the values of the NEO.scores function
neo1
$LEO.NB.OCA.AtoB
B
A 6.94
$LEO.NB.CPA.AtoB
B
A 4.78
#Comment:
The edge orienting score LEO.NB.OCA(E1->E2) = 6.94 lies way above the recommended threshold of 0.3, i.e. there is very strong evidence for a causal relationship E1->E2.
Similarly, the score LEO.NB.CPA(E1->E2) = 4.78 is also above its recommended threshold of 0.8.
Let’s us evaluate the wrong orientation E2 -> E1
neoFalseDirection=NEO.scores(datSNP=datSNP ,A=E2,B=E1, detailed.output=T)
neoFalseDirection
$LEO.NB.OCA.AtoB
B
A -8.56
$LEO.NB.CPA.AtoB
B
A -5.88
Comment: note that both edge scores are highly negative, i.e. the wrong orientation
E2->E1 is not supported by NEO.
Robustness analysis with respect to the number of anchored SNPs.
Since the edge orienting scores for an edge A-B criticallydepend on the input genetic marker sets M_A and M_B, we alsorecommend carrying out a robustness analysis with respect todifferent marker sets. For a given edge and a given edge
orienting score (e.g. LEO.NB.OCA), NEO implements a robustness
analysis with respect to automatic marker selection
Specifically, we report the LEO.NB.OCA and LEO.NB.CPA scores for the edge orientation E1-> E2 as a function of automatically assigned SNP markers.
# Let’s create a data frame of expression traits
datExpr1= data.frame(E1, E2)
# the following command is needed for assigning default parameter to NEO.
pm=neo.get.param()
RobustnessOutput1=NEOrobustness(datSNP=datSNP, datExpr=datExpr1, TopGreedy=1:5, TopForward=1:5, no.log.quiet = TRUE, pm=pm)
LEOscoreOCAvector=data.frame(RobustnessOutput1[,c(1,21,11:20)])$LEO.NB.OCA
LEOscoreOCAvector=as.numeric(as.character(as.vector(LEOscoreOCAvector)))
LEOscoreCPAvector=data.frame(RobustnessOutput1[,c(1,21,11:20)])$LEO.NB.CPA
LEOscoreCPAvector=as.numeric(as.character(as.vector(LEOscoreCPAvector)))
range.y=range(c(LEOscoreOCAvector, LEOscoreCPAvector), na.rm=T)
plot(c(1:length(LEOscoreOCAvector)),LEOscoreOCAvector,type="l",ylab="LEO.score",xlab="Step, increasing SNPs",ylim=range.y,cex.lab=1.5)
abline(h=0.3, lwd=1.5, col="red")
abline(h=0.8, lwd=1.5, col="blue")
points(1:length(LEOscoreOCAvector),LEOscoreOCAvector,pch=21,bg="red",cex=1.5)
lines(1:length(LEOscoreCPAvector),LEOscoreCPAvector)
points(1:length(LEOscoreCPAvector),LEOscoreCPAvector,pch=22,cex=1.5,bg="yellow") # square
legend(length(LEOscoreOCAvector)-2,range.y[2]-.3, c("LEO.NB.OCA", "LEO.NB.CPA"), cex=1.2, pch=c(21,22,23),pt.bg=c("red","yellow","blue"))
title("Robustness Analysis",cex.main=1.5)
Caption: This robustness plot shows how the edge orienting scores depend on the number of automatically selected SNPs. At each step, the NEO uses both the greedy and forward selection model for finding SNPs. Remember that we anchored each trait to two true SNPs in the simulation model, which explains why the results don’t change after step 2.
The red and the blue horizontal lines correspond to the thresholds of 0.3 and 0.8 for the OCA and the CPA scores, respectively.
MORE ADVANCED TUTORIAL
In the following, we describe advanced commands for the NEO analysis.
First, we describe an alternative method for inputting traits and SNP data.
# Let’s create a data frame of expression traits
datExpr1= data.frame(E1, E2)
# We create a combined data frame that contains the traits and the SNP markers.
datCombined=data.frame(datExpr1, datSNP)
# we assign default parameters
pm=neo.get.param()
# To see the values of the default parameters, simply write
pm
# Many of the parameters are not important. Please look in the file neo.txt to learn more.
# The following is the title used in the network graph
# and the NEO spread sheets and log files"
# Please do not insert spaces into the run.title, otherwise you get an error message.
pm$run.title="MiniTutorial"
# This tells the software to output a log file
pm$no.log=FALSE
pm$quiet=FALSE
# let us look at the names of the variables in datCombined
names(datCombined)
"E1" "E2" "SNP1" "SNP2" "SNP3" "SNP4"
# Here tell NEO that we want to evaluate the edge E1 and E2
pm$A=1 # E1 corresponds to column 1 in datCombined
pm$B=2 # E2 corresponds to column 2 in datCombined
pm$top.N.snps.per.trait=3 # this is the maximum number of anchors for each trait.
# this activates the calculation of an edge orienting score, which is explained below.
pm$do.m1m2.average = TRUE
pm$excel.path= "\"C:\\Program Files\\Microsoft Office\\Office11\\EXCEL.EXE\""
NEOoutput1=neo(datCombined,pm=pm);
After running these commands an Excel file is available (and it may open automatically if the Excel Path is set correctly (see the red line above))
The second to last line of the output in the R console tells you where the NEO spreadsheet file is located.
Here it is in
[1] "See file '"C:\\NEOminiTutorial\\neo.logs.Thu_Mar_06_11.35.35_2008.\\neo.logs.Thu_Mar_06_11.35.35_2008..csv"
and that directory for records of the run."
Note that calling the NEO function produced the following graph
The blue lines between SNPs and trait indicate when a SNP to trait correlation exceeds the threshold pm$cor.dep.th (here=.1 as can be seen from the first line of the graph title.).
The graph indicates that SNP1 and SNP2 are causal anchors of trait E1.
Analogously, SNP3 and SNP4 are causal anchors of trait E2.
As an aside, we mention that the computation of the LEO.NB scores does not allow that two traits are anchored to the same SNP (consistency heuristic).
The reported LEO score above the blue edge E1-E2 is by default the LEO.NB.OCA score (which is controlled by the NEO parameter pm$eo.type). The top of the file neo.txt documents all the possible parameter choices and their defaults.
Since in this case 6.9 is greater than pm$edge.th (here 1) an arrow is drawn.
The threshold of 1 requires that the causal model is at least 10 fold more likely than the next best alternative model.
An alternative method for visualizing a causal network is implemented in our function
DirectedNetworkPlot, see the NEO tutorial "NEOTutorialMultiEdgeSimulation.doc".
The contents of the variables of the NEOoutput file can be looked at by writing
names(NEOoutput1)
[1] "call" "my.title"
[3] "excel.file" "M"
[5] "super" "dir.only"
[7] "af" "qlin"
[9] "cx" "pcor"
[11] "covx" "numsnps"
[13] "lcd.edge" "lcd.drop"
[15] "coor" "pm"
[17] "non.snp.cols" "snpcols"
[19] "zeo" "nzeo"
[21] "ZEO" "stats"
[23] "datCombined" "twostep.log"
[25] "final.eo" "zeo.log"
[27] "eo.all.log" "eo.max.log"
[29] "eo.for.log" "losem.eo.all.log"
[31] "losem.eo.max.log" "losem.eo.for.log"
[33] "my.snp" "min.k.abs.pcor"
[35] "min.k" "min.k.abs.pcor.ignore.supernode"
[37] "min.k.nosup" "collider.zz"
[39] "top10" "bot10"
[41] "pre.save.fname" "post.save.fname"
[43] "final.coor"
#Comment: The output is very comprehensive and many columns can be ignored.
# Detailed explanations can be found in Jason Aten’s dissertation.
# Let us simply output the following important columns
NEOoutput1$stats[,c("edge", "LEO.NB.OCA", "LEO.NB.CPA", "LEO.NB.MAX", "M1M2.AVG")]
edge LEO.NB.OCA LEO.NB.CPA LEO.NB.MAX M1M2.AVG
1 E1 -> E2 6.94 4.78 -0.0858 1.90
2 E2 -> E1 -8.56 -5.88 0.0858 -4.47
3 SNP1 -> E1 NA NA NA NA
4 SNP2 -> E1 NA NA NA NA
5 SNP3 -> E1 NA NA NA NA
6 SNP4 -> E1 NA NA NA NA
7 SNP1 -> E2 NA NA NA NA
8 SNP2 -> E2 NA NA NA NA
9 SNP3 -> E2 NA NA NA NA
10 SNP4 -> E2 NA NA NA NA
Comment: edge orienting scores are only computed between traits (not between edges involving a SNP. The LEO.NB scores OCA and CPA are decribed in our article Aten et al (2008). In the following we briefly describe the alternative (and often inferior) edge orienting scores LEO.NB.MAX and M1M2.AVG.
Explanation of LEO.NB.MAX
First, each trait is automatically anchored to multiple SNPs using the greedy and forward stepwise procedure.
Second, SNPs may be swapped between traits of an edge according to the consistence heuristic.
Third, pick the strongest SNP from MA into A and the strongest SNP from MB into B.
Here strength is measured by the absolute value of the Pearson correlation.
Fourth, the LEO.NB.OCA score is computed based on the best SNP for each trait.
Explanation of M1M2.AVG
First, each trait is automatically anchored to multiple SNPs using the greedy and forward stepwise procedure.
Second, SNPs may be swapped between traits of an edge according to the consistence heuristic.
Third, the edge scoring evaluates pairs of SNPs where the first SNP comes from MA and the second from MB.For each SNP pair, the LEO.NB.OCA score is computed.
Fourth, the M1M2.AVG edge orienting score is defined as the average LEO.NB.OCA score across all pairs of MA and MB markers.
If an alternative aggregation method is desired, e.g. median instead of average, one can access the pairwise LEO.NB scores in the log file in the NEO spreadsheet file under the column LINK.to.LOG.M1M2.AVG.
Here is an excerpt from this log file.
M1M2.Average LOG for A( E1 ) -> B( E2 ) with M1M2.AVG score of 1.9
M.A markers were: SNP1
M.A markers were: SNP2
M.B markers were: SNP3
M.B markers were: SNP4
m1m2.labels.AB = c("A(E1).to.B(E2).with.MA(SNP1).and.MB(SNP3)","A(E1).to.B(E2).with.MA(SNP1).and.MB(SNP4)","A(E1).to.B(E2).with.MA(SNP2).and.MB(SNP3)","A(E1).to.B(E2).with.MA(SNP2).and.MB(SNP4)")
m1m2.scores.AB = c(-0.0858,2.2,0.812,4.68)
m1m2.labels.BA = c("A(E2).to.B(E1).with.MA(SNP3).and.MB(SNP1)","A(E2).to.B(E1).with.MA(SNP4).and.MB(SNP1)","A(E2).to.B(E1).with.MA(SNP3).and.MB(SNP2)","A(E2).to.B(E1).with.MA(SNP4).and.MB(SNP2)")
m1m2.scores.BA = c(0.0858,-8.1,-0.812,-9.05)
The individual pairwise LEO.NB scores can be found in m1m2.scores.AB
#Copy and paste the following from the log file into R
m1m2.scores.AB = c(-0.0858,2.2,0.812,4.68)
# Then you can run
mean(m1m2.scores.AB)
mean(m1m2.scores.AB)
[1] 1.90155
#Comment: This average score is reported as M1M2.AVG in the A->B row of
NEOoutput1$stats[NEOoutput1$stats$edge=="E1 -> E2", "M1M2.AVG"]
[1] 1.9
THE END
1