R SOFTWARE TUTORIAL for

General Topological Overlap Measure for Unweighted Networks.

Applications to a Yeast Gene Co-Expression Network

Andy M. Yip
Department of Mathematics
National University of Singapore
2, Science Drive 2
Singapore 117543, Singapore
Email: / Steve Horvath
Departments of Human Genetics and Biostatistics
University of California Gonda Research Center
695 Charles E. Young Drive South, Box 708822
Los Angeles, CA 90095-7088, USA
Email:

This R software tutorial is supporting material for the following paper

·  Andy M. Yip, Steve Horvath (2006) The Generalized Topological Overlap Matrix For Detecting Modules in Gene Networks. UCLA Technical Report. http://www.genetics.ucla.edu/labs/horvath/GTOM/

An important use of DNA microarray data is to annotate genes by clustering them on the basis of their gene expression profiles across several microarrays. Because the transcriptional response of cells to changing conditions involves the coordinated co-expression of genes encoding interacting proteins, studying co-expression patterns can provide insights into the underlying cellular processes. It is standard

to use the (Pearson) correlation coefficient as a co-expression measure.

Here we show how to use the generalized topological overlap matrix (GTOM) for screening for essential genes. Specifically, we compare the following dissimilarity measures for quantifying how dissimilar 2 gene expression profiles are.

  1. distCor1 = 1 - |rij|
  2. distCor6 = 1 - (rij6)
  3. distTom0 = 1 - (aij) - Ii=j
  4. distTom1 = 1 - topological overlap of 1-step neighbors
  5. distTom2 = 1 - topological overlap of 2-step neighbors
  6. distTom3 = 1 - topological overlap of 3-step neighbors

where

rij = Pearson correlation between genes i and j

aij = Dichotomized rij with a given threshold:

Measure 1: Since the absolute value of the Pearson correlation is widely used to assess how similar 2 expression profiles are, distCor1 (measure 1) is the standard way of assessing dissimililarity of expression profiles.

Measure 2: distCor6 is quite similar except that the correlation is raised to a power. Raising the correlation to a power >1 emphasizes high correlations at the expense of low correlations.

Measure 3: This measure is based on the adjacency matrix of the unweighted gene co-expression network. Since it is based on the dichotomized correlation matrix, this ignores the continuous nature of the co-expression information.

Measure 4: This is the topological overlap matrix measure used by Ravasz et al (2000) and further investigated in Zhang and Horvath (2006).

Measures 4-6 represent the generalized topological overlap dissimililarity measures proposed in Yip and Horvath (2006).

Microarray Data Set: Yeast cell cycle dataset that records gene expression levels during different stages of cell cycles in yeasts Spellman-Sherlock-et al 2000. For computational reasons, network analysis was limited to 4000 genes.

Constructing a network from gene-expression data

Largely unexplored in most high-dimensional data analyses are the relationships between gene-expression levels, which exhibit varying degrees of correlation. Genes with expression levels that are highly correlated are biologically interesting, since they may be regulated by a common regulatory mechanisms or participate in similar biological processes. To construct a network from microarray gene-expression data, we begin by calculating the Pearson correlations for all pairs of genes in the network. In principle, one could specify that the connection strength between 2 genes equals the absolute value of the correlation coefficient. Because microarray data can be noisy and the number of samples is o ften small, we and others have found it useful to emphasize strong correlations and punish weak correlations. A natural way of doing this is to

define the connection strength between 2 genes by thresholding the absolute value of the correlation coefficient, e.g. below we pick a threshold tau=0.7. This results in an unweighted network where each node represents a gene and an edge between two nodes is present if their absolute correlation coefficient exceeds a threshold tau=0.7. Our methods work for any threshold. A discussion for how this threshold is chosen is beyond the scope of the article. Here, we obtained the threshold tau$by using the scale-free criterion

proposed by Zhang and Horvath (2005).As an aside, we mention that one can also raise the absolute value of the Pearson correlation to a power ß >=1 which results in a weighted network (Zhang and Horvath 2005).

By adding up these connection strengths for each gene, we produce a single number (called connectivity, or k) that describes how strongly that gene is connected to all other genes in the network.

The next step in network construction is to identify groups of genes with similar patterns of connection strengths by searching for genes with high "topological overlap" (Ravasz et al., 2002; Zhang and Horvath, 2005). A pair of genes is said to have high topological overlap if they are both strongly connected to the same group of genes. The use of topological overlap thus serves as a filter to exclude spurious or isolated connections during network construction. After calculating the topological overlap for all pairs of genes in the network, this information is used in conjunction with a hierarchical clustering algorithm to identify groups, or modules, of densely interconnected genes.

More material on weighted network analysis can be found here

http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/

In an unweighted network, the topological overlap of two nodes reflects their similarity in terms of the commonality of the nodes they connect to, see [Ravasz et al 2002, Yip and Horvath 2006].

Definition of Topological Overlap for an Unweighted Network

As cited in our article, the topological overlap matrix comes from the manuscript by Ravasz et al 2002. They report our form of the topological overlapmatrix in the methods supplement of their paper (there is a typo in the main paper!). Topological overlap of two nodes (genes) reflects their relative interconnectivity. For a network represented by an adjacency matrix is given by

(1)

where, denotes the number of nodes to which both i and j are connected, and is the number of connections of a node, with and . Since , we find that .

Thus, . Since , we find that ωij is a number between 0 and 1. There are 2 reasons for adding to the denominator in the topological overlap matrix: 1) in this form, the denominator can never be 0 and 2) for an unweighted network, one can show that ωij=1 only if the node with fewer links satisfies two conditions: (a) all of its neighbors are also neighbors of the other node, i.e. it is connected to all of the neighbors of the other node and (b) it is linked to the other node. In contrast, ωij=0 if i and j are unlinked and the two nodes don't have common neighbors. Further, the topological overlap matrix is symmetric, i.e, ωij= ωji. and its diagonal elements are set to 0 (i.e. ωii=0). The rationale for considering this similarity measure is that nodes that are part of highly integrated modules are expected to have high topological overlap with their neighbors.

Identifying Gene Modules

Authors differ on how they define modules. Intuitively speaking, we assume that modules are groups of genes whose expression profiles are highly correlated across the samples. Modules are groups of nodes that have high topological overlap. Module identification is based on the topological overlap matrix Ω=[ωij] defined as (1). To use it in hierarchical clustering, it is turned into a dissimilarity measure by using the standard approach of subtracting it from one (i.e, the topological overlap based dissimilarity measure is defined by ).

To group genes with coherent expression profiles into modules, we use average linkage hierarchical clustering coupled with the TOM-based dissimilarity. In this article, gene modules correspond to branches of the hierarchical clustering tree (dendrogram). The simplest (not necessarily best) method is to choose a height cutoff to cut branches off the tree. The resulting branches correspond to gene modules, i.e. sets of highly co-expressed genes.

References

To cite this document please use

·  Andy M. Yip, Steve Horvath (2006) The Generalized Topological Overlap Matrix For Detecting Modules in Gene Networks. UCLA Technical Report. http://www.genetics.ucla.edu/labs/horvath/GTOM/

The weighted and unweighted gene co-expression network analysis method is described in

·  Bin Zhang and Steve Horvath (2005) "A General Framework for Weighted Gene Co-Expression Network Analysis", Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17. http://www.bepress.com/sagmb/vol4/iss1/art17

For a more mathematical description of gene co-expression networks consider

·  S Horvath, J Dong, A Yip (2006) Connectivity, Module-Conformity, and Significance: Understanding Gene Co-Expression Network Methods. http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/

·  Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N., and Barabási, A. L. (2002). Hierarchical organization of modularity in metabolic networks. Science 297, 1551-1555.

Getting the R software and data

Downloading the R software Go to http://www.R-project.org, download R and install it on your computer. After installing R, you need to install several additional R library packages: For example to install Hmisc, open R, go to menu "Packages\Install package(s) from CRAN", then choose Hmisc. R will automatically install the package. Do the same for some of the other libraries mentioned below. But note that several libraries are already present in the software so there is no need to re-install them.

Download the comman delimited text file YEASTCombinedCellCycle4000.csv from our webpage

http://www.genetics.ucla.edu/labs/horvath/GTOM/

Download the R function file: "NetworkFunctions.txt", which contains several R functions needed for Weighted Gene Co-Expression Network Analysis. http://www.genetics.ucla.edu/labs/horvath/GeneralFramework/NetworkFunctions.txt In that file you can also find a short description of the functions. DISCLAIMER: Absolutely no warranty on the code. Please contact SH with suggestions.

Unzip all the files into the same directory (e.g. we put it into “C:\Documents and Settings\shorvath\My Documents\ADAG\AndyYip\TutorialGTOMscreening”

# STARTING THE R session:

# Open the R software by double clicking the corresponding icon

#To interact with the R software copy and paste the commands into the R console.

#Text after "#" is a comment and is automatically ignored by R.

# Set the working directory of the R session by using the following command.

setwd(“C:/Documents and Settings/shorvath/My Documents/ADAG/AndyYip/TutorialGTOMscreening”)

# Note that we use / instead of \ in the path.

# read in the R libraries.

library(sna) # this is needed for closeness

#library(MASS)

#library(class)

library(cluster)

#library(sma) # different from sna! this is needed for plot.mat below

library(impute) # needed for imputing missing value before principal component analysis

#library(splines) # for the spline predictor to estimate the number of clusters

library(Hmisc) # probably you won’t need this

#Memory

# check the maximum memory that can be allocated

memory.size(TRUE)/1024

# increase the available memory

memory.limit(size=2048)

# read in the custom network functions

source("C:/Documents and Settings/shorvath/My Documents/RFunctions/NetworkFunctions.txt")


# Load the expression data

dat0 = read.csv(“YEASTCombinedCellCycle4000.csv”, header=T, row.names=1)

# This file contains information on every gene

datSummary=dat0[,1:31]

# This function contains the expression data: rows are microarray samples, columns are genes

datExpr = t(dat0[,32:75])

no.samples = dim(datExpr)[[1]]

dim(datExpr)

rm(dat0);collect_garbage()

#To choose a power beta for computing the connection strengths, we make use of the

# the Scale-free Topology Criterion (Zhang and Horvath 2005). We focus on the scale free

# topology model fitting index (denoted as scale.law.R.2) that quantifies how well

# a network satisfies a scale-free topology.

# The slope of the regression corresponds to the value gamma for the scale free distribution.

# To construct an unweighted network (hard thresholding),

# we consider the following vector of potential thresholds.

thresholds1= c(seq(.1,.5, by=.1), seq(.55,.9, by=.05) )

# To choose a cut-off value, we propose to use the Scale-free Topology Criterion (Zhang and

# Horvath 2005). Here the focus is on the linear regression model fitting index

# (denoted below by scale.law.R.2) that quantify the extent of how well a network

# satisfies a scale-free topology.

# The function PickHardThreshold can help one to estimate the cut-off value

# when using hard thresholding with the step adjacency function.

# The first column lists the threshold ("cut"),

# the second column lists the corresponding p-value based on the Fisher transform.

# The third column reports the resulting scale free topology fitting index R^2.

# The fourth column reports the slope of the fitting line.

# The fifth column reports the fitting index for the truncated exponential scale free model.

# Usually we ignore it.

# The remaining columns list the mean, median and maximum connectivity.

# To pick a hard threshold (cut) with the scale free topology criterion:

# aim for high scale free R^2 (column 3), high connectivity (col 6)

# and negative slope (around -1, col 4).

RdichotTable=PickHardThreshold(datExpr, cutvector= thresholds1)[[2]]

Cut p.value scale.law.R.2 slope. truncated.R.2 mean.k. median.k. max.k.

1 0.10 5.13e-01 0.5520 8.060 0.824 2820.000 2850 3350

2 0.20 1.88e-01 0.0617 1.620 0.705 1780.000 1820 2720

3 0.30 4.53e-02 -0.0977 -0.423 0.810 1000.000 1010 2090

4 0.40 6.48e-03 0.2740 -1.440 0.902 486.000 463 1440

5 0.50 4.70e-04 0.5850 -1.900 0.944 199.000 160 898

6 0.55 9.09e-05 0.6910 -1.860 0.962 117.000 83 663

7 0.60 1.32e-05 0.7560 -1.790 0.958 65.500 37 466

8 0.65 1.35e-06 0.8760 -1.620 0.988 34.400 14 314

9 0.70 8.73e-08 0.9060 -1.520 0.993 16.600 4 196

10 0.75 3.03e-09 0.8410 -1.560 0.968 7.250 1 123

11 0.80 4.31e-11 0.8980 -1.360 0.982 2.680 0 64

12 0.85 1.51e-13 0.9470 -1.230 0.961 0.797 0 32

13 0.90 0.00e+00 0.9420 -1.250 0.964 0.146 0 17

# Comment: Note that for a cut-off value (tau)=0.10 the scaling law R^2 equals 0.55, which seems #to be pretty high. However, the slope is positive, i.e. this would predict there are more genes with #high connectivity than there are genes with low connectivity. This is unbiological!

# This is the reason why we look at the signed R^2 value:

#Signed R^2=-sign(RdichotTable[,4])*RdichotTable[,3]

#Let’s plot the scale free topology model fitting index (R^2) versus the cut-off tau. However, the #R^2 values of those cut-offs that lead to a negative slope have been pre-multiplied by -1.