Supplement: A Genome-Wide Association Study of Anorexia Nervosa
Boraska…Bulik
Table of Contents
Supplemental Methods
QC of directly typed variants in discovery dataset
Imputation
Figure S1: Distribution of imputation accuracy scores
Figure S2: Concordance of masked genotypes
Genome-wide association analysis
GWAS meta-analysis QC
Ancestry informative markers
Gene-based association test
Network analysis
AN subtype analyses
Expression studies
Supplemental Results
Meta-analysis results of the discovery dataset
Figure S3: QQ plot
Figure S4: Manhattan plot
Gene-based association test results
Network analysis results
Figure S5: Network analysis
Supplemental Discussion
Discussion of Genetic Loci
Supplemental References
Supplemental Tables
Supplemental Methods
QC of directly typed variants in discovery dataset
The following sample-level QCcriteria were applied: a) all SNPs were converted to forward strand; b) sample call rate exclusion threshold of 99%; c) cross-checking of reported sex; d) heterozygosity exclusion threshold of ±3 standard deviations (SD) from the mean; e) multidimensional scaling analysis (MDS) was performedusing genome-wide set of directly genotyped common SNPs, plots were visually inspected and an exclusion lists with samples of non-European ancestry were generated. MDS plots with cases and controls matched by ancestry were also generated to inspect their compatibility and ancestry matching. All samples not matching case and control ancestry within each of 15 discovery dataset were excluded; f) identity by state (IBS)/ identity by descent (IBD) statistics were calculated to identify duplicated and related samples (exclusion threshold >0.05). All samples failing sample-level QC were removed within each case and control dataset separately prior to performing SNP-level QC.
The SNP-level QC step was also carried out in each case and control dataset separately under the followingcriteria: a) SNP call rate threshold of 99%; b) deviation from Hardy-Weinberg Equilibrium (HWE) in controls (exact HWE p<1x10-4). All SNPs that failed individual SNP-level QC steps were removed. Low frequency and rare SNPs (<0.05) were excluded from cases and controls separately. Differential missingness between cases and controls was investigated within each stratum and SNPs that showed deviation visualized by quantile-quantile (QQ) plots were excluded. All subsequent association analyses were performed using merged case-control post-QC datasets. All QC steps were performed using PLINK.1
Imputation
Two additional SNP filtering steps were performed prior to imputation. First, SNPs with ambiguous allele coding (A/T or C/G) were removed as the strand of these variants can be difficult to resolve. Second, variants not present in the HapMap3 release 2 reference panel were removed as they can not inform the imputation process. Remaining variants were then strand-matched to the reference data and imputation was carried out using Impute2. Post imputation QC consisted of the exclusion of variants with MAF < 0.05 or info score < 0.4 within each dataset. The effect of each of these filtering steps is shown in Table S2.
We assessed the adequacy of imputation in multiple ways. We show the distribution of imputation accuracy metrics (as captured by the INFO score) in each dataset and across a range of minor allele frequencies (Figure S1). As expected, the info scores are consistently high (>0.8).
Figure S1: Distribution of imputation accuracy scores. Figure S2: Concordance of masked genotypes
We carried out an analysis of genotype concordance by masking directly genotyped variants and imputing them as though they had not been directly assayed. We present the results of this analysis (Figure S2). As expected, genotype concordance remains very high across the full range of examined minor allele frequencies and is consistently high (>0.8) in all studied populations.
Genome-wide association analysis
Eleven association analyses were conducted per stratum: one with no principal components (PC) adjustment and the others adjusted for up to 10 PCs. PCs were calculated using EigenSoft.2A list of the most informative 54 independent ancestry informative markers was created based on Tian et al.,3 to distinguish North-South European and Northern European substructure, and also based on the Wellcome Trust Case Control Consortium population stratification SNP list.4 Differences in allele frequencies of these 54 SNPs were examined between cases and controls in each stratum in each of the 11 association analyses as an additional test of ancestry matching between cases and controls (i.e., in addition to the MDS analysis already performed under standard QC). A Bonferroni corrected p-value of 0.0009 was considered significant evidence of population stratification. There were no SNPs reaching Bonferroni corrected p-value threshold in the analyses without adjustment for all cohorts except for the French dataset. For the French dataset, no ancestry marker showed association in the analysis adjusted for the first PC. We also constructed QQ plots for all 11 association analyses for all 15 discovery datasets and visually inspected them. There were no qualitative differences between QQ plots for 11 analyses within each case-control group, i.e.,there were no inflations in test statistics in the analyses without adjustment, in comparison to the ones adjusted for up to 10 principal components (PCs). To cross check the adequacy of our approach to population stratification, we computed PCs on 41,838 independent (r2<0.2) SNPs directly genotyped in all cases and controls and performed individual association analyses adjusting for the first 3 PCs. We then performed meta-analysis across 15 datasets and found equivalent results (GC=1.01, Pmin=1.66x10-12 at rs4957798 when adjusting for the first 3 PCs) to the ones we report under Results (GC= 1.024, Pmin=1.67E-12 at rs4957798, Supplementary Table S5).
GWAS meta-analysis QC
We examined all SNPs with p<10-4 and sample size greater than 1,000 to check for false positives due to genotyping error or other artefacts. We investigated cluster plots and exact HWE p<1.0x10-4 in cases or controls separately. Cluster plots were examined in studies with available intensity data using Evoker.5Poorly genotyped/called SNPs were excluded within each dataset and meta-analyses using clean datasets were repeated. Results were graphically represented using QQ and Manhattan plots. The combined results presented here are not corrected for overall GC.
Ancestry informative markers
Replication samples in this study came from 15 populations of which 12 were of European ancestry and de novo genotyped for the prioritized set of SNPs. Replication cases and controls were carefully matched by country and collection center during collection phase.To additionally check for ancestry matching between cases and controls within each of the 12 replication datasets we included 27 ancestry-informative markers (AIMs) for genotyping. These markers were selected on the basis of our discovery dataset that contained samples from 15 populations of European ancestry.We performed principal components analysis (PCA) on these samples and found that the top two PCs aligned with perpendicular geographical axes (rho=0.90 for PC1 vs. latitude, rho=0.59 for PC2 vs. longitude). We identified markers that were contributed highly to these PCs, and found that even as few as 27 markers were sufficient to predict the ancestry of the samples well. For more information, please see reference.6
Gene-based association test
We used VEGAS7to conduct a gene-based association test using all variants in the GWAS. Briefly, VEGAS assigns SNPs to genes and uses all individual variant test statistics within a gene to calculate a gene-specific test statistic. VEGAS uses a simulation-based approach to account for gene length and LD. The Bonferroni-corrected p-value threshold is p<2.8x10-6.
Network analysis
We examined the genomic context within 0.5 Mb either direction of each of the 20 most statistically significant SNPs using Ensembl (version 70),8and identified the two directly adjacent genes in each case. Of these 20 SNPs, we found two pairs of SNPs with the same adjacent genes. We found 3 SNPs with only one gene within the stated boundaries, and one SNP that lay close to three genes. Altogether this produced 34 candidate genes (Supplementary Table S4). In order to visualize the connections between this gene list, we created a network representing these 34 genes. Each node represents a gene. Genes adjacent to the same SNP are represented by nodes of the same color. Edges within the network represent connections between the genes. We obtained a number of different connections:
- Co-citation in the same study. These data were obtained using the citations for each gene on Ensembl (release version 70)8and the NHGRI database.9
- Implication in the same psychiatric disorder. These data were obtained as in (1)
- Implication in the same metabolic disorder. Data were obtained as in (1)
- Implication in body shape phenotypes such as height, weight, etc. Data obtained as in (1).
- Protein-protein interactions. These were obtained using the String-db database.10
In order to compare the connectedness of this network to one obtained by chance, we simulated one million random sets of 34 genes and repeated the analysis. We then calculated the network connectivity of each node in the network using two clustering coefficients, L and w, both for the observed network and for the simulated networks. W is a measure of local weighted connectivity, as described by Kalna et al.11The second coefficient, Lj, is a measure of global connectivity. This is simply the average distance from node j to all other nodes. In the case where a node has no neighbors, the node is not included in the calculation.
In order to test whether the 34 candidate genes are significantly more closely connected than would be expected by chance, we created random sets of 34 genes, and compared the clustering coefficients in each case. As it was not computationally feasible to investigate every gene in the simulated datasets using Ensembl and String-db, we restricted this analysis to the NHGRI database.
AN subtype analyses
AN restricting subtype included individuals who maintained low body weight solely through caloric restriction and increased energy expenditure. AN binge/purge subtype included individuals who also reported binge eating and/or purging via vomiting or laxatives.
Expression studies
RNAseq experiments sequence fragments of cDNA, which are then mapped to individual gene transcripts. This provides a fragment count, usually expressed in Fragments per Kilobase of exon Per Million fragments (FPKM) values, which correlates with relative gene expression. We analyzed RNAseq data from wholebrain tissue samples from 12 different mouse strains. Analysis was performed as follows:
- Fastq files were downloaded from the European Nucleotide Archive.12 ENA run IDs and strain information for each dataset are presented below.
Mouse strains used, along with the relevant ENA accession IDs.
Mouse Strain / ENA IDA/J / ERP000038
AKR/J / ERP000037
BALB/cJ / ERP000039
CBA/J / ERP000043
DBA/2J / ERP000044
NOD/ShiLtJ / ERP000046
NZO/HILtJ / ERP000047
129P2/OlaHsd / ERP000034
SPRET/EiJ / ERP000049
129S5SvEvBrd / ERP000035
WSB/EiJ / ERP000050
C57BL/6NJ; 2 samples / ERP000041
- The tophat13 and bowtie14 software packages were used to find the read alignments for each fastq file.
- The samtools15and HTseq16 software packages were then used to count the number of alignments per gene.
- Count values were then normalized to account for the depth of sequencing and the transcript length, using the following formula:
Transcript lengths were obtained using the Ensembl GTF annotation (Version 63).8
The top 20 SNPs by statistical significance were used to identify candidate genes. We examined the genomic context within 0.5 Mb either direction of each SNP using Ensembl (version 70),8and identified the two directly adjacent genes in each case. Of these 20 SNPs, we found two pairs of SNPs with the same adjacent genes. We found 3 SNPs with only one gene within the stated boundaries, and one SNP that lay close to three genes. Altogether this produced 34 human candidate genes. 32 mouse orthologues of these genes were identified using Ensembl (version 70).8Significant SNPs are shown with adjacent genes and mouse orthologues in Supplementary Table S4.
Supplemental Results
Meta-analysis results of the discovery dataset
Figure S3: QQ plot
Figure S4: Manhattan plot
Gene-based association test results
There were no gene signals surpassing genome-wide significance levels (at p<2.8x10-6) for VEGAS. Two adjacent genes were associated with p<10-4: TMEM37 (P=4.0x10-5) and SCTR (P=4.7x10-5). The most significant variant in both genes was rs12479137. Its proxy rs2254122 (r2=1) was followed up in independent cohorts and resulted in a replication P=0.870.
Network analysis results
We formulated a network using data on a number of interactions between the genes adjacent to the most significantly associated SNPs, (Supplementary Figure S5), and found that the genes form two distinct clusters, with a number of outliers. Many are closely related in this network, implying that they may be functionally related. Several of the genes are associated with body-shape or metabolic disorders (Supplementary FigureS5.b-d). Further, we find that these genes are significantly more closely related than would be expected by chance. We found a significantly shorter average pathway between genes (P=0.005) and significantly tighter clusters (P=0.03) when using the AN genes compared to one million randomly generated sets of 34 genes.
Figure S5: Network analysis
A Network view of 34 genes. Nodes represent 34 genes adjacent to the top 20 statistically significant SNPs. Edges represent connections between the nodes. A) A network showing all identified connections between the nodes. B) A network showing genes implicated in the same psychiatric disorders. C) A network showing genes implicated in the same metabolic disorders. D) A network showing genes implicated in body shape phenotypes.
Supplemental Discussion
Discussion of Genetic Loci
Here, we briefly describe functions of genetic loci identified through our study.The majority of reported genes are expressed in the brain and have already been associated with a neurodegenerative or neurodevelopmental disorder.
SOX2OT is a non-coding RNA that is expressed in several regions of the human brain.17By applying the Logic Mining method to a gene expression dataset from transgenic mice, Sox2ot gene was identified as potential biomarker for the early and late stage of Alzheimer's disease neurodegeneration.18SOX2OT was also suggested to have a role in the etiology of myopia.19
PPP3CA gene (also known as calcineurin) is also expressed in several regions of human brain (USCS site - GNF Expression Atlas 2 Data from U133A and GNF1H Chips). This gene has been implicated in the etiology of Alzheimer's disease. It is a serine-threonine phosphatase that dephosphorylates tau, as tau hyperphosphorylation is one of hallmarks of Alzheimer's disease.20Moreover, altered expression of novel alternatively spliced PPP3CA isoforms were found in brain regions of postmortem Alzheimer's disease patients.21 The PPP3CA gene is differentially expressed between patients with frontotemporal degeneration and controls.22Genetic variants in PPP3CA have been associated with human endurance exercise phenotype traits.21
CSGALNACT1 (also known as ChGn) plays a role in chondroitin sulfate biosynthesis in cartilage23, 24 but is also ubiquitously expressed in various other tissues.25 Chondroitin sulfate proteoglycans are likely to be regulatory molecules in the process of axonal degeneration/regenerationis in the peripheral nervous system and it was shown that CSGALNACT1 missense mutations are associated with pathogenetic mechanisms in peripheral neuropathies.26
SYN2 (synapsin II), a member of the synapsin family, encodes for neuronal phosphoprotein found in synaptic vesicles that can modulate synaptic transmission and plasticity. 27 Because of its role in synaptogenesis and the modulation of neurotransmitter release, this gene was suggested to be involved in several neurological and neuropsychiatric diseases including epilepsy28,29 and schizophrenia.30,31 In addition, SynII knock-out (KO) mice showed autism-related behavioral abnormalities.32
SPATA13 (also known as ASEF2) has a role in cell migration and it was shown that it is required for aberrant migration of colorectal tumor cells.33SPATA13 was shown to be part of the pathway that regulates microtubule and actin dynamics involved in brain lamination.34
NCAM2 (also known as OCAM) belongs to the immunoglobulin superfamily, is primarily expressed in the brain and is important for the organization of axons and dendrites in the olfactory system.35NCAM2 was suggested to carry a genetic risk for development of Down syndrome35and Alzheimer’s disease (AD).36In addition, the polymorphism found to be associated with changes in cerebrospinal fluid levels lies near NCAM2 gene.37A case-report of a man with pervasive developmental disorder bearing deletion of 19 genes, including NCAM2, points to the possible role of this gene in autism and other neurobehavioral disorders.38Also, some prostate and breast cancer cell lines showed differential expression of NCAM2 in comparison with healthy cells.39,40
PRKAR2B (also known as RII-BETA) encodes for a regulatory subunit of protein kinase A (a central component of the cAMP pathway which is important for a variety of cellular functions) and it is abundantly expressed in adipose tissue and brain.41,42 Studies in KO mice suggested an important role of this gene in regulating energy balance and adiposity.43PRKAR2BKO mice have markedly diminished white adipose tissue, elevated metabolic rate and body temperature resulting in lean phenotype43and are resistant to diet-induced obesity, insulin resistance and dyslipidemia.41Moreover, deletion of PRKAR2B decreases body weight and increases energy expenditure in the obese, leptin-deficient mice.44PRKAR2B was also found to be down-regulated in obese subjects when compared to lean ones.45In addition, PRKAR2B regulates voluntary consumption of alcohol46and PRKAR2B KO mice drink significantly more ethanol and are resistant to ethanol-induced sedation.47In brain, mRNA for calcium-calmodulin-dependent protein kinase IIbeta is elevated in the frontal cortex of cases with schizophrenia.48PRKAR2B has also been shown to mediate adverse effects of antipsychotic medication49and cataleptic behaviour induced by haloperidol50This gene has been associated with several other disorders. For example, susceptibility locus for knee osteoarthritis lies in a linkage disequilibrium block that contains PRKAR2B gene.51Also, reduced expression levels of this gene were observed in osteoarthritis cartilage compared with control cartilage.52Nuclear PRKAR2B levels are found to be abnormally high in systemic lupus erythematosus T cells.53,54PRKAR2B was also suggested to influence homocysteine levels.55