Tutorial

Differential Network Analysis

Tova Fuller, Steve Horvath

Correspondence: ,

Abstract

Here we illustrate differential network analysis by comparing the connectivity and module structure of two networks based on the liver expression data of lean and heavy mice. This unbiased method for comparing two phenotypically distinct subgroups of mouse samples serves as a method for understanding the underlying differential gene co-expression network topology giving rise to altered biological pathways.

This work is in press:

Tova Fuller, Anatole Ghazalpour, Jason Aten, Thomas A. Drake, Aldons J. Lusis, Steve Horvath (2007) Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm Genome, in press.

The data are described in:

Anatole Ghazalpour, Sudheer Doss, Bin Zang, Susanna Wang,Eric E. Schadt, Thomas A. Drake, Aldons J. Lusis, Steve Horvath (2006) Integrating Genetics and Network Analysis to Characterize Genes Related to Mouse Weight. PloS Genetics

We provide the statistical code used for generating the weighted gene co-expression network results. Thus, the reader be able to reproduce all of our findings. This document also serves as a tutorial to differential weighted gene co-expression network analysis. Some familiarity with the R software is desirable but the document is fairly self-contained. This document and data files can be found at the following webpage:

http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/DifferentialNetworkAnalysis

More material on weighted network analysis can be found here:

http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/

Method Description:

The data are described in the PLoS article cited above [1]. Please also refer to the citations above and below for more information regarding weighted gene co-expression network analysis (WGCNA).

Here we attempt to show that networks may be constructed from two phenotypically different subgroups of samples from a prior WGCNA experiment on mice. Here we identify 30 mice at both extremes of the weight spectrum in the BxH data and construct weighted gene co-expression networks from each.

Network Construction:

In co-expression networks, network nodes correspond to genes and connection strengths are determined by the pairwise correlations between expression profiles. In contrast to unweighted networks, weighted networks use soft thresholding of the Pearson correlation matrix for determining the connection strengths between two genes. Soft thresholding of the Pearson correlation preserves the gene co-expression information and leads to weighted co-expression networks that are highly robust with respect to the construction method [2].

The network construction algorithm is described in detail elsewhere [2]. Briefly, a gene co-expression similarity measure (absolute value of Pearson’s product moment correlation) was used to relate every pairwise gene-gene relationship. An adjacency matrix was then constructed using a `soft’ power adjacency function aij = Power(sij, b) º |sij| b where sij is the co-expression similarity, and aij represents the resulting adjacency that measures the connection strengths. The power b is chosen using the scale free topology criterion proposed in Zhang and Horvath (2005). Briefly, the power was chosen such the resulting network exhibited approximate scale free topology and a high mean number of connections. The scale free topology criterion led us to choose a power of b = 6 based on the preliminary network built from the 8000 most varying genes. However, since we are using a weighted network as opposed to an unweighted network, the biological findings are highly robust with respect to the choice of this power [2].

Topological Overlap Matrix and Gene Modules

The adjacency matrix was then used to define a network distance measure or more precisely a measure of node dissimilarity based on the topological overlap matrix [2]. Specifically the topological overlap matrix is given by

where denotes the number of nodes to which both i and j are connected, and u indexes the nodes of the network. The topological overlap matrix (TOM) is given by Ω=[ωij]. ωij is a number between 0 and 1 and is symmetric (i.e, ωij= ωji). The rationale for considering this similarity measure is that nodes that are part of highly integrated modules are expected to have high topological overlap with their neighbors.

Network Module Identification.

Gene "modules" are groups of nodes that have high topological overlap. Module identification was based on the topological overlap matrix Ω=[ωij] defined above. To use it in hierarchical clustering, it was turned into a dissimilarity measure by subtracting it from one (i.e, the topological overlap based dissimilarity measure is defined by ). Based on the dissimilarity matrix we can use hierarchical clustering to discriminate one module from another. We used a dynamic cut-tree algorithm for automatically and precisely identifying modules in hierarchical clustering dendrogram (the details of the algorithm could be found at http://www.genetics.ucla.edu/labs/horvath/binzhang/DynamicTreeCut).

The algorithm takes into account an essential feature of cluster occurrence and makes use of the internal structure in a dendrogram. Specifically, the algorithm is based on an adaptive process of cluster decomposition and combination and the process is iterated until the number of clusters becomes stable. No claim is made that our module construction method is optimal. A comparison of different module construction methods is beyond the scope of this paper.

Network Comparison Measures:

For the ith gene, we denote the whole network connectivity by k1(i) and k2(i) in networks 1 and 2, respectively. To facilitate the comparison between the connectivity measures of each network, we divide each gene connectivity by the maximum network connectivity, i.e. and . We utilize the following measure of differential connectivity as DiffK(i) = K1(i) – K2(i), but other measures of differential connectivity could also be considered. To measure differential gene expression between the lean and heavy mice, we use the absolute value of the Student t-test statistic.

Sector plots and permutation test:

Plotting DiffK versus the t-test statistic value for each gene gives a visual demonstration of how difference in connectivity relates to a more traditional t-statistic describing difference in expression between the two networks.

To determine whether membership in each of these sectors was significant, I permute high or low weight status among our 60 mice in Networks 1 and 2, and reconstruct each network accordingly. I obtained values DiffKperm and ttestperm as shown above, taking into account permuted high or low weight status.

This permutation process was repeated 1000 times (no.perm = 1000), each time noting the amount of genes that could be found in each of the m = 8 sectors as . Similarly, the sector counts for unpermuted, observed data was noted as . Then, I found a value Nm for each sector representing the number of iterations for which . The p-value for each sector was found as .

Functional Analysis

Notable genes were analyzed for pathway enrichment using the Database for Annotation, Visualization, and Integrated Discovery (DAVID) [3] (http://david.niaid.nih.gov/david/ease.htm).

1. Ghazalpour, A., et al., Integrating genetic and network analysis to characterize genes related to mouse weight. PLoS Genet, 2006. 2(8): p. e130.

2. Zhang, B. and S. Horvath, A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol, 2005. 4: p. Article17.

3. Dennis, G., Jr., et al., DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol, 2003. 4(5): p. P3.

Statistical References

To cite this tutorial or the statistical methods please use

1. Zhang B, Horvath S (2005) A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17. http://www.bepress.com/sagmb/vol4/iss1/art17

For the generalized topological overlap matrix as applied to unweighted networks see

2. Yip A, Horvath S (2006) Generalized Topological Overlap Matrix and its Applications in Gene Co-expression Networks. Proceedings Volume. Biocomp Conference 2006, Las Vegas. Technical report at http://www.genetics.ucla.edu/labs/horvath/GTOM/.

For some additional theoretical insights consider

3. Horvath S, Dong J, Yip A (2006) The Relationship between Intramodular Connectivity and Gene Significance. Proceedings Volume. Biocomp Conference 2006, Las Vegas. Technical report at http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/

4. Horvath, Dong, Yip (2006) Using Module Eigengenes to Understand Connectivity and Other Network Concepts in Co-expression Networks. Submitted.

# Absolutely no warranty on the code. Please contact TF or SH with suggestions.

# Downloading the R software

# 1) Go to http://www.R-project.org, download R and install it on your computer

# After installing R, you need to install several additional R library packages:

# For example to install Hmisc, open R,

# go to menu "Packages\Install package(s) from CRAN",

# then choose Hmisc. R will automatically install the package.

# When asked "Delete downloaded files (y/N)? ", answer "y".

# Do the same for some of the other libraries mentioned below. But note that

# several libraries are already present in the software so there is no need to re-install them.

# To get this tutorial and data files, go to the following webpage

# http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/DifferentialNetworkAnalysis

# Download the zip file containing:

# 1) R function file: "NetworkFunctions.txt", which contains several R functions

# needed for Network Analysis.

# Unzip all the files into the same directory,

## The user should copy and paste the following script into the R session.

## Text after "#" is a comment and is automatically ignored by R.

# read in the R libraries

library(MASS) # standard, no need to install

library(class) # standard, no need to install

library(cluster)

library(sma) # install it for the function plot.mat

library(impute)# install it for imputing missing value

library(scatterplot3d)

# Note: alter the following file paths to point towards where your files are.

source("/Users/TovaFuller/Documents/HorvathLab2006/NetworkFunctions/NetworkFunctions.txt")

setwd("/Users/TovaFuller/Documents/HorvathLab2007/MouseProject2.0/DiffNetworkAnalysis/")

# The following 3421 probe set were arrived at using the following steps

#1) reduce to the 8000 most varying, 2) 3600 most connected, 3) focus on unique genes

dat0=read.table("cnew_liver_bxh_f2female_8000mvgenes_p3600_UNIQUE_tommodules.xls",header=T)

names(dat0)

# this contains information on the genes

datSummary=dat0[,c(1:8,144:150)]

# the following data frame contains the gene expression data: columns are genes, rows are

# arrays (samples)

datExpr <- t(dat0[,9:143])

no.samples <- dim(datExpr)[[1]]

dimnames(datExpr)[[2]]=datSummary[,1]

dim(datExpr)

# We read in the clinical data

datClinicalTraits=read.csv("BXH_ClinicalTraits_361mice_forNewBXH.csv",header=T)

#Now we order the mice so that trait file and expression file agree

restrictMice=is.element(datClinicalTraits$MiceID,dimnames(datExpr)[[1]])

table(restrictMice)

datClinicalTraits=datClinicalTraits[restrictMice,]

orderMiceTraits=order(datClinicalTraits$MiceID)

orderMiceExpr=order(dimnames(datExpr)[[1]])

datClinicalTraits =datClinicalTraits[orderMiceTraits,]

datExpr =datExpr[orderMiceExpr,]

# From the following table, we verify that all 135 mice are in order

table(datClinicalTraits$MiceID==dimnames(datExpr)[[1]])

rm(dat0);collect_garbage()

BodyWeight=as.numeric(datClinicalTraits$WeightG)

# Now we find the 30 heaviest and leanest mice. Network 1 will refer to modules defined by

# the 30 leanest mice, and Network 2 will refer to modules defined by the 30 heaviest mic.

rest1=rank(BodyWeight,ties="first")<=30

rest2=rank(-BodyWeight,ties="first")<=30

# We check to make sure there are 30 of each, with no overlap between the groups

table(rest1)

table(rest2)

table(rest1==T & rest2==T)

# FALSE

# 135

# We separate the expressions for the fat and lean groups:

datExpr1=datExpr[rest1,]

datExpr2=datExpr[rest2,]

# Now we find the whole network connectivity measures for each:

k1=SoftConnectivity(datExpr1,6)

k2=SoftConnectivity(datExpr2,6)

# We would like to normalize the connectivity measures. We do this by dividing by the

# maximum values.

K1=k1/max(k1)

K2=k2/max(k2)

# Now we find the difference between the two connectivity values.

DiffK=K1-K2 # Note that we did not take the absolute value here.

# Negative values of this difference imply that normalized Lean connectivity (k2) is

# greater than K1, fat normalized connectivity.

poolRest=as.logical(rest1+rest2)

factorLevels=rep(NA,length(poolRest))

factorLevels[rest1]="lowWeight" # trait 1 is low weight

factorLevels[rest2]="highWeight" # trait 2 is high weight

ttest=rep(NA, dim(datExpr)[[2]])

# Let's determine the t-statistic.

for (i in 1:dim(datExpr)[[2]]){

ttest[i]=t.test(datExpr[poolRest,i]~factorLevels[poolRest])$statistic

}

# Permuted Status

# Let's create a control DiffK with permuted fat/lean status. We choose from the same

# mice that were considered in the fat and lean groups. Note results from this following

# analysis will be different each time because we are sampling randomly.

temp1=sample(which(poolRest==T),30,replace=F)

Rest1=rep(F,length(poolRest))

Rest1[temp1]=T

table(Rest1)

Rest2=as.logical(poolRest-Rest1)

table(Rest2)

table(Rest1==T & Rest2==T)

# FALSE

# 135 good!

factorLevels2=rep(NA,length(poolRest))

factorLevels2[Rest1]="permLowWeight"

factorLevels2[Rest2]="permHighWeight"

datExpr1.2=datExpr[Rest1,]

datExpr2.2=datExpr[Rest2,]

# Now we find the whole network connectivity measures for each:

k1.2=SoftConnectivity(datExpr1.2,6)

k2.2=SoftConnectivity(datExpr2.2,6)

# We would like to normalize the permuted connectivity measures. We do this by

# dividing by the maximum values.

K1.2=k1.2/max(k1.2)

K2.2=k2.2/max(k2.2)

# Now we find the difference between the two connectivity values.

DiffK.2=K1.2-K2.2 # Note that we did not take the absolute value here.

ttest2=rep(NA, dim(datExpr)[[2]])

# Let's determine the t-statistic.

for (i in 1:dim(datExpr)[[2]]){

ttest2[i]=t.test(datExpr[poolRest,i]~factorLevels2[poolRest])$statistic

}

# Plotting DiffK versus ttest in Unpermuted and Permuted

par(mfrow=c(1,2))

plot(DiffK,ttest, main=paste("Unpermuted: cor=", signif(cor(DiffK,ttest, use="pairwise.complete.obs") ,3)),xlim=range(DiffK),ylim=range(ttest)) # colorGroup1 ??

abline(h=1.96)

abline(h=-1.96)

abline(v=.4)

abline(v=-.4)

plot(DiffK.2,ttest2, main=paste("Permuted: cor=", signif(cor(DiffK.2,ttest2, use="pairwise.complete.obs") ,3)), xlim=range(DiffK),ylim=range(ttest))

# colorGroup1 ??

abline(h=1.96)

abline(h=-1.96)

abline(v=.4)

abline(v=-.4)

# We repeat this permutation 1000 times, each time

# counting the amount of "dots" in the quadrants

# shown at left.

# Now we wish to obtain a p-value for the DiffK

# when there is permutation versus when there is

# not.

# Here are the sector counts for unpermuted:

sector1obs = (ttest>1.96) & (DiffK<(-0.4))

n1obs=sum(sector1obs)

sector2obs = (ttest>1.96) &(DiffK>(-0.4)) & (DiffK<0.4)

n2obs=sum(sector2obs)

sector3obs = (ttest>1.96) & (DiffK>0.4)

n3obs=sum(sector3obs)

sector4obs = (ttest>(-1.96)) & (ttest<1.96) & (DiffK>0.4)

n4obs=sum(sector4obs)

sector5obs = (ttest<(-1.96)) & (DiffK>0.4)

n5obs=sum(sector5obs)

sector6obs = (ttest<(-1.96)) &(DiffK>(-0.4)) & (DiffK<0.4)

n6obs=sum(sector6obs)

sector7obs = (ttest<(-1.96)) & (DiffK<(-0.4))

n7obs=sum(sector7obs)

sector8obs = (ttest>(-1.96)) & (ttest<1.96) & (DiffK<(-0.4))

n8obs=sum(sector8obs)

# We will find a vector of these same counts for each permutation, but the niobs value

# will always stay the same (scalar)

no.perms=1000

# We started with 100, and then redid the experiment with no.perms=1000

n1perm=rep(NA,no.perms)

n2perm=rep(NA,no.perms)

n3perm=rep(NA,no.perms)

n4perm=rep(NA,no.perms)

n5perm=rep(NA,no.perms)

n6perm=rep(NA,no.perms)

n7perm=rep(NA,no.perms)

n8perm=rep(NA,no.perms)

# We turn now to obtaining sector counts for each permutation

for (i in c(1:no.perms)) {

temp1=sample(which(poolRest==T),30,replace=F)

Rest1=rep(F,length(poolRest))

Rest1[temp1]=T

Rest2=as.logical(poolRest-Rest1)

factorLevels2=rep(NA,length(poolRest))

factorLevels2[Rest1]="fakefat"

factorLevels2[Rest2]="fakelean"

datExprFat.2=datExpr[Rest1,]

datExprLean.2=datExpr[Rest2,]

k1.2=SoftConnectivity(datExprFat.2,6)

k2.2=SoftConnectivity(datExprLean.2,6)

K1.2=k1.2/max(k1.2)

K2.2=k2.2/max(k2.2)

DiffK.2=K1.2-K2.2 # Note that we did not take the absolute value here.

ttest2=rep(NA, dim(datExpr)[[2]])

# Let's determine the t-statistic.

for (j in 1:dim(datExpr)[[2]]){

ttest2[j]=t.test(datExpr[poolRest,j]~factorLevels2[poolRest])$statistic

}

sector1perm = (ttest2>1.96) & (DiffK.2<(-0.4))

n1perm[i]=sum(sector1perm)

sector2perm = (ttest2>1.96) &(DiffK.2>(-0.4)) & (DiffK.2<0.4)

n2perm[i]=sum(sector2perm)

sector3perm = (ttest2>1.96) & (DiffK.2>0.4)

n3perm[i]=sum(sector3perm)

sector4perm = (ttest2>(-1.96)) & (ttest2<1.96) & (DiffK.2>0.4)

n4perm[i]=sum(sector4perm)

sector5perm = (ttest2<(-1.96)) & (DiffK.2>0.4)

n5perm[i]=sum(sector5perm)

sector6perm = (ttest2<(-1.96)) &(DiffK.2>(-0.4)) & (DiffK.2<0.4)

n6perm[i]=sum(sector6perm)

sector7perm = (ttest2<(-1.96)) & (DiffK.2<(-0.4))

n7perm[i]=sum(sector7perm)

sector8perm = (ttest2>(-1.96)) & (ttest2<1.96) & (DiffK.2<(-0.4))

n8perm[i]=sum(sector8perm)

}

logicalSum1=sum(n1obs<=n1perm)

logicalSum2=sum(n2obs<=n2perm)

logicalSum3=sum(n3obs<=n3perm)

logicalSum4=sum(n4obs<=n4perm)

logicalSum5=sum(n5obs<=n5perm)

logicalSum6=sum(n6obs<=n6perm)

logicalSum7=sum(n7obs<=n7perm)

logicalSum8=sum(n8obs<=n8perm)

pval1=(logicalSum1+1)/(no.perms+1)

pval1

pval2=(logicalSum2+1)/(no.perms+1)

pval2

pval3=(logicalSum3+1)/(no.perms+1)

pval3

pval4=(logicalSum4+1)/(no.perms+1)

pval4

pval5=(logicalSum5+1)/(no.perms+1)

pval5

pval6=(logicalSum6+1)/(no.perms+1)

pval6

pval7=(logicalSum7+1)/(no.perms+1)

pval7

pval8=(logicalSum8+1)/(no.perms+1)

pval8

# For no.perms=100:

# pval1

# [1] 0.5940594

# pval2

# [1] 0.00990099

# pval3

# [1]0.00990099

# pval4

# [1] 0.930693

# pval5

# [1] 0.00990099

# pval6

# [1] 0.00990099

# pval7

# [1] 0.4950495

# pval8

# [1] 0.8118812

# For no.perms=1000

pval1

[1] 0.4645355

pval2

[1] 0.000999001

pval3

[1] 0.000999001

pval4

[1] 0.947053

pval5

[1] 0.00999001

pval6

[1] 0.000999001

pval7

[1] 0.3346653

pval8

[1] 0.7972028

# Because we are dividing by no.perms, the p-values are 10-fold different for 100 or 1000

# iterations.

# The result we achieve makes sense; regions with significantly high or low ttest values

# (p < 0.05) tend to be significant. However, since there are few data points that have a

# DiffK value < -0.4, the upper and lower right hand corner sectors do not prove

# significant. This could be because DiffK is defined as "normalized low weight

# connectivity" minus "normalized high weight connectivity" and there are few genes that in

# fact are more connected in the fat than in the lean that also have extreme ttest scores.

# Module Identification

beta1=6

# Note: we use a beta1 = 6 as this was the value used in our previous analysis

# Finding hierGTOM in our Network 1 (low weight).

datExpr1=data.frame(datExpr1)

AdjMat1=abs(cor(datExpr1,use="p"))^beta1

collect_garbage()

dissGTOM1=TOMdist1(AdjMat1)

collect_garbage()