Supplementary Material for:
Genetic Neuropathology of Obsessive Psychiatric Syndromes
Andrew E. Jaffe, Ph.D.1*, Amy Deep-Soboslay, M.Ed.1*,Ran Tao, Ph.D.1, David T. Hauptman, B.A.2, Walter H. Kaye, M.D.3, Victoria Arango, Ph.D.4, Daniel R. Weinberger, M.D.1,2,5, Thomas M. Hyde, M.D. Ph.D1,2,5, Joel E. Kleinman, M.D. Ph.D.1,2,5
Affiliations:
- Lieber Institute for Brain Development, Baltimore, MD, USA
- Section on Neuropathology, Clinical Brain Disorders Branch, NIMH, NIH, Bethesda, MD, USA
- University of California, San Diego Eating Disorder Treatment and Research Program, San Diego, CA, USA
- Department of Psychiatry, Columbia University, New York, NY, USA
- Departments of Psychiatry, Neurology, Neuroscience and the Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA
Supplementary Statistical Methods
eQTL analysis within control cohort
Within the non-psychiatric control cohort (N=268), we tested for a linear association between the genotype at each of the candidates SNPs and the expression levels of all 30,176expression probesunder an additive genetic model, to identify what are typically referred to as expression quantitative trait loci (eQTL). Surrogate variable analysis (SVA) was used to estimate and control for potential batch effects in the gene expression data(1, 2), and the resulting surrogate variables were included, along with age (using a flexible spline model across life), race, and sex, in every SNP-expression regression model:
for gene expression measurement y for subject i, expression probe j, and SNP k. The ‘age’ term was a 2nd degree basis spline to better capture non-linear changes in gene expression across the lifespan, especially as the magnitude of change decreases across the lifespan (3). The term reflects the association between genotype and gene expression, and large absolute values represent eQTLs. We performed this analysis in two steps: among the 91 SNPs identified in the literature, 1) including only the exact SNPs, by rs number, that were observed or imputed in our data and 2) additionally including the nearest SNPs (within 10kb, to ensure high linkage disequilibrium) to the observed clinical SNP from our imputed dataset. Note that the false discovery rate calculations take into account these additional SNPs.
Statistical analyses were performed using the MatrixEQTL package in R(4). The p-values for the additive genetic effect were converted to q-values using the Benjamini-Hochberg procedure to control for multiple testing via the false discovery rate (FDR). This gene expression data is already publicly available in the Gene Expression Omnibus (GEO) under accession GSE30272(5).
Differential Expression Analysis and eQTL analysis within case-control cohort
Within the case-control cohort (N=133), the microarray data were normalized with the lumi package in R (6), and consisted of background correction, variance stabilizing transform, and quantile normalization.We additionally annotated the probes for technical features like mapping non-uniquely and containing SNPs to ensure differentially expressed genes were not driven by technical artifacts (7). We defined probes as measuring “expressed” genes if their intensities were at least 5% greater than background levels – dropping probes below this background cutoff left 12,969 probes (across 10,651 genes) for statistical analyses. Raw and processed data are publicly available on the Gene Expression Omnibus at accession GSExxxxxxx [to be deposited upon paper acceptance].
While tissue quality, measured indirectly through RNA integrity number (RIN), was high for postmortem tissue experiments (all samples with RIN > 5, 96% of samples with RINs > 6, and 91% of samples with RINs > 7), its levels were differential by diagnosis (Supplementary Figure 3A) and strongly associated with gene expression measurements (Supplementary Figure 3B). Four of the first ten resulting principal components (PCs) of the expression data were associated with RIN via linear regression at p<0.05, including PC1 (p=1.04x10-19). These global associations manifested on the probe level - RIN levels were associated with the expression levels of 8,534 expressed probes (65.8%; at p < 0.05) under the statistical model:
where is observed gene expression for subject i at probe j, is the numerical RIN value for subject i, and is the univariate association between RIN and expression. Because RINs were differential by diagnosis (Table 1; p=6.41x10-4 and Supplementary Figure 3A), we assessed the potential of confounding by RIN in our differential expression analysis for diagnosis by fitting the following model:
where is categorical disease status (ED and OCD) and represents the unadjusted log2 fold change in expression associated with each diagnosis (compared to controls). Supplementary Figures 3C and 3D shows the Pearson correlation between the standardized regression coefficients (Z-scores) from Equation 1 for RIN () versus diagnosis (each column of , e.g. the OCD and ED effects respectively)from Equation 2 across all, demonstrating strong global confounding effect of RIN (OCD: ρ=-0.728, p=0 and ED: ρ=-0.643, p=0). Notably, we identify 1,094 and 604 genes differentially expressed for OCD and ED respectively at FDR < 5% by Equation 2 (univariate model).
Naïve adjustment of RIN in probe-level linear models (e.g. including its levels as a covariate) was insufficient to completely remove the effects of RIN in the data, which is an approach commonly used by biologists:
where the notation is as above, plus is case-control status for subject i and is the log fold change difference in expression between cases and controls, naively adjusted for RIN. Here is the association between RIN and expression, adjusting for diagnosis. The correlation between the t-statistics generated by (model 1) and (model 3) for both OCD and EDwas reduced but still very significant (OCD: ρ=-0.264, p=1x10-205 and ED: ρ=-0.139, p=2.99x10-57). Proper statistical adjustment should result in non-significant association between univariate RIN and RIN-adjusted case-control status. Naïve adjustment may therefore result in many genes being falsely reported as statistically significantly differentially expressed, but we do identify only 100 and 11 genes differentially expressed by OCD and ED respectively at FDR < 5% by Equation 3, which is far fewer than the unadjusted approach. Notably, 7/11 genes for ED were on the sex chromosomes suggesting potential confounding by sex (given its imbalance in diagnosis – see Table 1).
Other commonly employed “batch” correction approaches, particularly those that use factor-based methods like surrogate variable analysis (SVA) actually worsened this confounding, due to the strong statistical association between RIN and diagnosis, as the first step in many of these algorithms involves “protecting” the biological effect of interest (here, diagnosis) which, due to confounding, retains most of the RIN effect (see Supplementary Figure 3), under the following model:
where captured the effect of the estimated surrogate variables (which themselves were strongly associated RIN levels), and the number of SVs was estimated from the data (here, 16 SVs). The correlations between resulting t-statistics from (model 1) and (model 3) were stronger than naïve adjustment (OCD: ρ= -0.436, p=0and ED: ρ= -0.457, p=0), emphasizing this worsened confounding. Additionally, a similar number of probes are called differentially expressed between cases and controls compared to the unadjusted approach– 681 and 757probes were called differentially expressed at FDR < 5% for OCD and ED, respectively, due to how the biological variability is partitioned during these factor-based adjustment approaches.
We therefore removed the statistical association between RIN and diagnosis through randomly sampling a subset of control subjects (n=30) to roughly reflect the distribution of RIN in the diagnostic groups (by weighting the sampling probability by the RIN values themselves), reducing the association between RIN and diagnosis from p=6.4x10-4 to p=0.45. We then performed two different case-control differential expression analyses, first identifying genes associated with ED cases and then OCD cases compared to control subjects, and then genes associated with either ED or OCD diagnoses using linear regression, adjusting for sex and estimated surrogate variables (2, 8). Log2 fold changes, representing the regression coefficients, generated moderated t-statistics for each gene, and corresponding p-values were converted to q-values to control for the false discovery rate (FDR). Genes significant at q < 0.05 were considered genome-wide significant in each analysis. Gene set enrichment analyses were performed with the Wilcox sign rank test, which tests whether the distribution of test statistics (e.g. the s or resulting t-statistics) are shifted for genes belonging to predefined gene sets, available from Kortenhorst et al (9).
Lastly, we modeled the interaction between diagnosis and genetic variation on gene expression using the previously described 92 eating disorder candidate SNPs, again adjusting for the same surrogate variables as above, along with age and sex:
for gene expression measurement y for subject i, expression probe j, and SNP k. The term of interest is which represents the how diagnosis modifies the relationship between genetic variation and gene expression. The p-values for this regression term were converted to q-values to control for the FDR, and SNP-gene pairs with FDR < 20% were considered marginally significant.
References
1.Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28(6):882-3. Epub 2012/01/20.
2.Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS genetics. 2007;3(9):1724-35. Epub 2007/10/03.
3.Colantuoni C, Lipska BK, Ye T, Hyde TM, Tao R, Leek JT, et al. Temporal dynamics and genetic control of transcription in the human prefrontal cortex. Nature. 2011;478(7370):519-23. Epub 2011/10/28.
4.Shabalin AA. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics. 2012;28(10):1353-8.
5.Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30(1):207-10.
6.Du P, Kibbe WA, Lin SM. lumi: a pipeline for processing Illumina microarray. Bioinformatics. 2008;24(13):1547-8.
7.Barbosa-Morais NL, Dunning MJ, Samarajiwa SA, Darot JF, Ritchie ME, Lynch AG, et al. A re-annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data. Nucleic Acids Res. 2010;38(3):e17. Epub 2009/11/20.
8.Smyth GK. Limma: linear models for microarray data. In: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W, editors. Bioinformatics and Computational Biology Solutions using R and Bioconductor: Springer, New York; 2005. p. 397-420.
9.Kortenhorst MS, Wissing MD, Rodriguez R, Kachhap SK, Jans JJ, Van der Groep P, et al. Analysis of the genomic response of human prostate cancer cells to histone deacetylase inhibitors. Epigenetics : official journal of the DNA Methylation Society. 2013;8(9). Epub 2013/07/25.
1