SUPPLEMENTARY INFORMATION

Table of Contents

1. Introduction

2. Supplementary information aboutsamples from different databases

2.1. Study design

2.2. Brain collections

2.2.1. Stanley Medical Research Institute (SMRI): Bipolar disorder (BD), Schizophrenia (SCZ) and control (CTRL) samples

2.2.1.1. Prefrontal cortex (PFC)

2.2.1.2. Parietal cortex (PCX)

2.2.2.3. Cerebellum (CB)

2.2.2. Victorian Brain Bank Network (VBBN): SCZ and CTRL samples

3. SMRI PCX and CB sample preparation

3.1.RNA extraction

3.2. Array processing

4. Gene expression data quality assessment

4.1. Probe level assessment

4.1.1. Affymetrix Gene Array filtering

4.1.2. Affymetrix U133 Array filtering

4.2. Sample level assessment

4.3. Batch effect adjustment

4.4. Summary

5. Gene expression data analyses

5.1. Identification of differential expression genes between SCZ and CTRL samples in PCX-SMRI

5.2. Weighted gene co-expression network analysis (WGCNA)

5.2.1. Network construction

5.2.2. Module detection

5.3. Characteristics of modules

5.4. Module preservation statistics

5.5. Pathway and functional analyses

6. Genotyping data from different consortiums

6.1. Genetic Association Information Network (GAIN)-SCZ

6.2. GAIN-BD

6.2. Translational Genomics Research Institute(TGen)-BD

7. Imputation of GWAS data

8. Integration analysis of expression and GWAS data

9. Supplementary references

10. Supplementary figures

Supplementary Fig. 1|Cross-tabulation of results displays overlaps between Oldham modules and PCX-CTRL modules.

Supplementary Fig. 2 | Topological overlap matrix(TOM) plot to detect modulesin PCX

Supplementary Fig. 3 | Composite preservation statistics of SMRI PCX modules in SMRI PFC SCZ and CTRL samples

Supplementary Fig. 4 | Composite preservation statistics of SMRIPCX modules in SMRICB SCZ and CTRL samples

Supplementary Fig. 5 | Composite preservation statistics of SMRIPCX modules in VBBNPFC SCZ and CTRL samples

Supplementary Fig. 6 | Composite preservation statistics of SMRIPCX modules in SMRIPCXBD and CTRL samples

Supplementary Fig. 7 | Composite preservation statistics of SMRIPCX modules in SMRICBBD and CTRL samples

Supplementary Fig. 8 | Manhattan Plot for M1A gene set enrichmentin GAIN-BD genetic association signals

Supplementary Fig. 9 | Manhattan Plot for M1A gene set enrichmentin GAIN-SCZ genetic association signals

Supplementary Fig. 10 | Manhattan Plot for M1A gene set enrichmentin TGen-BD genetic association signals

11. Supplementary tables

Supplementary Table 1 |Demography of SMRI PCX samples

Supplementary Table 2 |Demography of SMRI CB samples

Supplementary Table 3 |Demography of SMRI PFC samples

Supplementary Table 4 |Demography of VBBN PFC samples

Supplementary Table 5 | M1A gene list and intramodularconnectivity

Supplementary Table 6 | M3A gene list and intramodularconnectivity

Supplementary Table 7 | M1A gene list and top GO functions

Supplementary Table 8 | M3A gene list and top GO functions

1. Introduction

This file contains detailed information about sample sourcesand preparation (sections 2, 3), data preprocessing and analysis (sections 4-8), and supplementary figures and tables. The samplesources include the online resources from whichwe downloaded data and the data produced in our lab. The data preprocessing and analysis includesquality control assessments and data analysis. In sections 5- 8, we provide detailed information ongene expression network analysis, module preservation analysis, pathway analysis, and genetic signals enrichment test. Additional tables andfigures arepresented atthe end.

2. Supplementary information of samples from different databases

2.1. Study design

Parietal cortex tissues from Stanley Medical Research Institute (SMRI) SCZ and CTRL samples were used as preliminary data in our analysis49. To validate our findings, we tested them in gene expression data sets from three brain banks, three brain regions and two psychotic diseases.

2.2. Brain collections

2.2.1. Stanley Medical Research Institute (SMRI): BD, SCZ and CTRL samples

Brains came from the SMRI’s Neuropathology Consortium and Array collections, and included 50 SCZ samples, 50 BD samplesand 50 CTRL samples. The detailed information about age, sex, race, postmortem interval, pH and side of brain is provided in the demographics table (Supplementary Table 1).

2.2.1.1. Parietal cortex (PCX) and Cerebellum (CB)

Our lab obtained the PCX and CB tissues from SMRI. The RNA was prepared in our lab, then sent to the Yale core facilityfor the microarray experiments.The details of the sample preparation procedure arelisted in section 3.

2.1.1.2. Prefrontal cortex (PFC)

PFC gene expression data was downloaded from the Stanley Medical Research Institute’s online genomics database ( The Study ID is five and created by Seth E. Dobrin.Brain region is frontalBA 46 and array type is Affymetrix hug133p.

2.2.2. Victorian Brain Bank Network (VBBN): SCZ and CTRL samples

VBBN expression data was downloaded from the Gene Expression Omnibus (GEO) database. The data came from postmortem brain tissue (BA46)of 30 schizophrenic patients and 29 age- and sex-matched CTRLs ( )50.

3. PCX and CB sample preparation

3.1. RNA extraction

RNA was extracted from the cerebellarcortex of 132 samples, and parietal cortex of 146 samples using the RNeasy Mini kit (Qiagen, Valencia, CA). The concentration and A260/A280 ratio were measured on the NanoDrop spectrophotometer. The 28S:18S rRNA ratio and RNA Integrity Number (RIN) were measured using an RNA LabChip kit on the Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA). Only RNA samples with a RIN > 7 were used for the expression profiling.

3.2. Array processing

Affymetrix Human Gene 1.0 ST Array was used for whole genome transcriptome profiling at the NIH Neuroscience Microarray Consortium facility at Yale University.

4. Gene expression data quality assessment

4.1. Probe level assessment

4.1.1. Filtering of Affymetrix gene array data

Single nucleotide polymorphisms (SNPs)inprobe regions can affect probe hybridization efficiency51.We created the list of probes containing SNPs, and removed those probes before data preprocessing:39,529 out of 805,481 probes were eliminated from the analysis. We provided customized library files ( for the Robust Multichip Average (RMA) preprocessing steps: background correction, quantile normalization and gene level summarization52. Afterwards, for convenience of comparison,only genes with Entrez IDs were kept.

4.1.2. Filtering of Affymetrix U133 Array data

To reduce noise, we filtered out probe sets using theMAS 5.0 algorithm53; probe sets which were called as present in more than 80% of the samples were retained. As forthe PCX and CB data, probe sets without detailed annotation were also removed.

4.2. Sample level assessment

For all data sets, we removed non-Europeans and outliers detected by Affymetrix Expression Console and randomly chose one sample if the data set included replicates54.

4.3. Batch effect adjustment

ComBat, an efficient batch effect removal approach, was used to remove batch effectsfrom these data sets55,56.

4.4. Summary

The data collected in our lab from the SMRI PCX and CB tissues ultimately included CB data from 39 schizophrenia patients, 36 bipolar disorder patients and 44 normal samples, as well as PCX data from 45 schizophrenia patients, 42 bipolar disorder patients and 46 normal samples in parietal cortex. For both these data sets, 19,984 genes were retained.

Downloaded SMRI PFC data came from 30 schizophrenia patients, 25 bipolar disorder patients and 29 normal samples; for this data set, 14,988 probe sets were retained for further analysis.

The PFC data set from VBBN included 30 schizophrenia patients and 29 normal samples; 14,988 probe sets were retained for further analysis. The PFC data from CCHPC included 28 schizophrenia patients and 23 normal samples; 14,988 probe sets were retained for further analysis.

5. Gene expression data analyses

5.1. Identification of genes differentially-expressed between SCZ and CTRL samples in PCX-SMRI

Multiple linear regression was used for each transcript, with age, sex and post-mortem interval as covariates, to remove any effects from these potentially confounding factors. Differential gene expression analysis between SCZ and CTRL samples was run after this correction.

5.2. Weighted gene co-expression network analysis (WGCNA)

We identified genes with similar expression patternsusing weighted gene co-expression network analysis (WGCNA)57,58.

5.2.1. Network construction

The absolute values of Pearson correlation coefficients were calculated for all possible pairwise genes. This correlation matrix, S= [], was weighted into an adjacency matrix A= [] by power function, i.e. =power (,β). The parameter of power, β, was chosen to ensure the adjacency matrix had an approximate scale-free topology.

5.2.2. Module detection

Gene module detection begun with transforming the adjacency matrix into a topological overlap matrix (TOM) Ω= [], with 0<<1 implying 0<<1. Next, TOM-based dissimilarity between all possible pairwise genes was defined by. The Dynamic Tree cut algorithm was used to detect network modules. WGCNA and Dynamic Tree cut algorithm were implemented in R59.

5.3. Characteristics of modules

Association test based on singular value decomposition (SVD) on each module

We ran singular value decomposition (SVD) on each module’s expression matrix, and used the resulting eigengene to characterize the entire module in the subsequent analysis. After multiple linear regression on the eigengenes to remove the effects of sex, age, pH and PMI on SMRI and VBBN samples, the corrected module values were used to test the disease association using Pearson’s correlation test60.

5.4. Module preservation statistics

We utilized cross-tabulation and module preservation statistics to assess the module preservation in different expression data sets61. The module preservation test has two advantages over the traditional cross-tabulation test. First,it considers not only the number of overlapping genes, but also the density and connectivity patterns of modules defined in the reference data set. Second, network-based preservation statistics do not require modules to be identified in the test set for comparison with the reference data set;this reduces the variation introduced to the analysisby various parameter settings used to build the network.

Zsummaryis the measurement statistic used to summarize evidence that a module is preserved more significantly than a random sample of all network genes. Langfelder et al. proposed the thresholds at follows:Zsummary2 implies no evidence for module preservation, 2Zsummary10 implies weak to moderate evidence, andZsummary10 implies strong evidence for module preservation62. Whenmodule size variedacross data sets, we reported the median rank statistic on our module preservation test, which is useful for comparing relative preservation among modules because it does not depend on module size.

Before the preservation test, we converted the probe-level measurements into gene-level measurements to make the data from different platform comparable. For each gene, only the probe with the highest coefficient of variation was kept for further analysis. Overall, 8497 genes were retained on our preservation calculation.

5.6. Pathway and functional analyses

The genes in the identified modules were analyzed through DAVID (DAVID, whichprovided existing knowledge about those genes, on their functionality, potential relevance to neuropsychiatric diseases, and neuronal functions. Also, we identified specific functional category, including canonical pathways, functional categories, protein domain and Gene Ontology (GO) terms, that may be enriched for these genes. M3A contains 106 genes; we used the whole gene list to run the test. M1A includes 490 genes;due to its size, we selected the 200 genes most highly correlated with the module’s eigengene to run the test.

6. Genotyping data from different consortia

6.1. SCZ GWAS data (GAIN)

SCZ genome-wide association data was downloaded from dbGaP ( The study weblink is (Genetic Association Information Network, GAIN)64. The consortium has collected 4591 cases and controls (1217 European-American cases, 1442 European-American controls, 953 African-American cases, 979 African-American controls). Whole genome genotyping was done with the Affymetrix Genome-wide Human SNPArray 6.0.

6.2. BD GWAS data (GAIN)

Genome-wide association study of BD data was downloaded from dbGaP ( The study weblink is (Genetic Association Information Network, GAIN)65. The consortium has collected 3261 cases and controls (1079 European-American cases, 1081 European-American controls, 415 African-American cases, 686 African-American controls). Whole genome genotyping was done with the Affymetrix Genome-wide Human SNPArray 6.0.

6.3. BDGWAS data (TGen)

Genome-wide association study of BD datawas downloaded from theTranslational GenomicsResearch Institute (TGen). This study includes 1,190 newly genotyped BD cases from the Bipolar Genome Study (BiGS) and 401 controls. Sample genotypingwas conducted using Affymetrix GeneChip Mapping 5.0K Array.

7. Imputation of GWAS data

BRLMM-p (Affymetrix) was used as the genotype-calling algorithm.SNPs with call rates less than 99% were excluded from the analysis. SNPs showing departurefrom Hardy-Weinberg equilibrium (HWE) were filtered out, as well (p < 0.001). Of the remaining SNPs,only SNPs showing minor allele frequency (MAF) of at least 10% were carried forward forfurther analysis.

We used MaCH v1.0for SNP imputation, to increase the density ofinterrogated SNPs66,67. Overall, 2,593,107 SNPs in GAIN-SCZ, 3,281,319 SNPs in GAIN-BD, and 2,543,887 SNPs in TGen-BD were included after imputation.

8. Integration analysis of gene expression and GWAS data

We utilized the following procedures to run the enrichment test20. First the max −log(P-value) of a SNP located at 20kb upstream or downstream of a gene was assigned to represent the gene, then M1A and M3A gene set enrichment scores (ES) were calculated based the gene’s rank. SNP label level permutationwas used to generate a distribution of theES,and then the distribution was normalized68. FDR q values were calculated if multiple gene sets were included in enrichment test.

We tested the difference of gene length distribution between genes in the module M1A, which was enriched with neuronal differentiation functions, and other genes in the genome. The result wassignificant (p=5.47e-27). To test whether gene length bias existed in our genetic signals enrichment test, we applieda permutation procedure to randomly-selected genes with similar gene length distribution for verification. First, we randomly selected 490 genes (the number of genes in the M1A module) not included in M1A, and compared those genes’ length distribution to M1A’s length distribution. If the mean of randomly-selected genes’length is longer than the mean length of M1A genes, we ran the GWAS enrichment test for each selected gene set and calculated the enrichment P value. Second,we repeated the above steps B times to obtain the null statistic Pb as the background distribution P values. Finally, we computed the permutation p-value for M1A GWAS enrichment as:

where B equals 1000 and PM1A is theM1A GWAS enrichment P value.

If the p value we calculated is less than 0.05, it means no bias was introduced by gene length, or at least the effect of bias was not significant. Our results show that permutation p values are 0.089, 0.047, and 0.034 for GAIN-SCZ, GAIN-BD and TGen-BD GWAS signal enrichment tests, respectively. This indicatesthat the significant enrichments we detected are not primarily a product of gene length bias.

9. Supplementary References

49.Torrey, E.F., Webster, M., Knable, M., Johnston, N. & Yolken, R.H. The Stanley Foundation brain collection and Neuropathology Consortium. Schizophrenia Research 44, 151-155 (2000).

50.Narayan, S. et al. Molecular profiles of schizophrenia in the CNS at different stages of illness. Brain Research 1239, 235-248 (2008).

51.Benovoy, D., Kwan, T. & Majewski, J. Effect of polymorphisms within probe-target sequences on olignonucleotide microarray experiments. Nucleic Acids Res 36, 4417-23 (2008).

52.Irizarry, R.A. et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249-264 (2003).

53.Li, C. & Wong, W.H. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proceedings of the National Academy of Sciences of the United States of America 98, 31-36 (2001).

54.Affymetrix. Affymetrix Expression Console Software website. Vol. 2011.

55.Johnson, W.E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118-127 (2007).

56.Chen, C. et al. Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods. PLoS One 6(2011).

57.Zhang, B. & Horvath, S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 4, Article17 (2005).

58.Horvath, S. & Langfelder, P. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9(2008).

59.Langfelder, P., Zhang, B. & Horvath, S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 24, 719-720 (2008).

60.Langfelder, P. & Horvath, S. Eigengene networks for studying the relationships between co-expression modules. Bmc Systems Biology 1(2007).

61.Langfelder, P., Luo, R., Oldham, M.C. & Horvath, S. Is My Network Module Preserved and Reproducible? PLoS Computational Biology 7(2011).

62.Langfelder, P. et al. A systems genetic analysis of high density lipoprotein metabolism and network preservation across mouse models. Biochim Biophys Acta.

63.Huang, D.W., Sherman, B.T. & Lempicki, R.A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols 4, 44-57 (2009).

64.Stefansson, K. et al. Common variants conferring risk of schizophrenia. Nature 460, 744-747 (2009).

65.Dick. Genomewide linkage analyses of bipolar disorder: A new sample of 250 pedigrees from the national institute of mental health genetics initiative (vol 73, pg 107, 2003). American Journal of Human Genetics 73, 979-979 (2003).

66.Li, Y., Willer, C., Sanna, S. & Abecasis, G. Genotype Imputation. Annual Review of Genomics and Human Genetics 10, 387-406 (2009).

67.Li, Y., Willer, C.J., Ding, J., Scheet, P. & Abecasis, G.R. MaCH: Using Sequence and Genotype Data to Estimate Haplotypes and Unobserved Genotypes. Genetic Epidemiology 34, 816-834 (2010).

68.Wang, J., Zhang, K.L., Cui, S.J., Chang, S.H. & Zhang, L.Y. i-GSEA4GWAS: a web server for identification of pathways/gene sets associated with traits by applying an improved gene set enrichment analysis to genome-wide association study. Nucleic Acids Research 38, W90-W95 (2010).

10. Supplementary Figures