Appendix e-1

Introduction:

Recently, genome-wide association studies (GWAS) are used to screen for susceptibility loci for common, complex diseases. Many LOAD GWAS published to date utilize a multi-stage approach, where the most significant variants from the initial stage are genotyped in additional series in the follow-up stages. In the first stage of these studies, only APOE and/or its tagging SNPs associate with AD risk with genome-wide significance after Bonferroni correctione1-6. Additional promising candidate LOAD genes with variants that showed replicable association in the follow-up stages have been reported from these LOAD GWAS, though none have yet achieved the strength of association for APOE.

In our two-stage LOAD GWAS (14), we analyzed 313,504 SNPs in 844 LOAD cases and 1,255 controls and evaluated the top 25 SNPs in four additional series (1,547 cases and 1,209 controls). This led to the identification of a SNP (rs5984894) on chromosome Xq21.3 in PCDH11X that encodes protocadherin 11, X-linked, which associate with AD risk with global p values of 5.7 x 10-5 in stage 1, 4.8 x 10-6 in stage 2 and 3.9 x 10-12 in the combined series. Through fine mapping and haplotype analysis we determined that rs5984894 resides in a haplotype block that is entirely within PCDH11X.

In a study of differentially expressed genes from human lymphoblastoid cell lines, 31% were found to have statistically significant heritabilitye7.Quantitative trait loci (QTLs) were identified for 33 of these genes, providing further support for the use of expression levels as endophenotypes in genetic studies. Three GWAS utilizing expression levels from human lymphoblastoid cell lines as quantitative phenotypes led to the identification of associations between SNPs and mRNA expressionlevelsthat were significant at the genomic levele8-10. A recent GWAS of gene expression levels in neuropathologically normal brain tissue from 193 subjects estimated that 58% of the transcriptome had cortical expression e11. Of these transcripts, 21% had expression levels correlating with their genotypes at empiric p values ≤ 0.05. These studies suggest a substantial genetic influence on human gene expression levels.

To pursue the possibility that risk for LOAD may be influenced by SNPs that alter gene expression, the messenger RNA (mRNA) levels of 12 LOAD candidate genes were assessedin the cerebella of 200 LOAD subjects from our autopsy-confirmed series (AUT) who were genotyped in our recent LOAD GWAS. Most of the mRNAs that were measured encode proteins that are likely to be involved in neurodegeneration through their involvement in synaptic function e12, apoptosis e13, inflammation e14-16, the cholinergic pathway e17, or the pathway that produces the amyloid ß protein e18-24. In addition, many of these genes have one or several variants that previously showed significant association with LOAD in at least one genetic association study e12-14, e17, e18,e20, e21, e23-26.

Methods:

LOAD Genome Wide-Association Study (GWAS):

LOAD Case-Control Series for GWAS:

For our GWAS, we genotyped LOAD cases and controls collected at three different sites. Two of our series were composed of subjects clinically diagnosed and recruited at either the Mayo Clinic in Jacksonville, FL (JS_60-80 series) or at the Mayo Clinic in Rochester, MN (RS_60-80 series), and the third was an autopsy-based series (AUT_60-80).

All of the cases and controls in these two series were assessed by a Mayo Clinic neurologist. Control subjects had a Clinical Dementia Rating (CDR) score of 0, and the cases had a diagnosis of probable or possible AD according to NINCDS-ADRDA criteria e27. A third series (AUT_60-80 series) was obtained from the Brain Bank at Mayo Clinic Jacksonville, and consisted of autopsy-confirmed AD cases and of control subjects who lacked AD pathology. All subjects in the AUT_60-80 series had neuropathologic evaluation by Dr. Dennis Dickson, who also maintains the Brain Bank. All AD subjects from the AUT_60-80 series analyzed in the GWAS had definite diagnosis of AD according to the NINCDS-ADRDA criteria e27 and had Braak scores of 4.0 or higher. All control subjects had Braak scores of 2.5 or lower, but many had brain pathology unrelated to AD and pathological diagnoses that included vascular dementia, frontotemporal dementia, dementia with Lewy bodies, multi-system atrophy, amyotrophic lateral sclerosis, and progressive supranuclear palsy.

We genotyped 970 LOAD cases and 1495 controls for the GWAS study (JS_60-80: 381 AD, 350 control; RS_60-80: 291 AD, 787 control, AUT_60-80: 298 AD, 358 control). After stringent quality control (see “GWAS Quality Control” section below), we analyzed 844 LOAD cases and 1255 controls for association with LOAD (JS_60-80: 353 AD, 331 control; RS_60-80: 245 AD, 701 control, AUT_60-80 246 AD and 223 control). The ages, diagnosis method and percent female in the LOAD cases and controls for each series are shown in Table e-1.

Additional LOAD Case-Control Series for rs7910977 AD Association Studies:

The most significant IDE cis-SNP, rs7910977 was genotyped in a total of 7 independent series: The 3 LOAD GWAS series described in section “LOAD Case-Control Series for GWAS” above and 4 additional LOAD case-control series.

The number of subjects from these 4 series with rs7910977 genotypes are as follows: Three of theseadditional series were collected at Mayo Clinic Jacksonville, FL (JS_80+: 222 AD, 243 controls), Rochester, MN (RS_80+: 257 AD, 597 controls) and an autopsy series (AUT_80+: 280 ADs, 101 controls). The subjects in these 3 series from Mayo Clinic had age at diagnosis/entry above 80 years. In addition, samples obtained through the National Cell Repository for Alzheimer’s Disease (NCRAD: 687 AD, 203 control) were analyzed.

One AD case from each of the 715 late-onset NCRAD families was analyzed. NCRAD AD cases were selected based on strength of diagnosis (autopsy-confirmed: 228 > probable: 324 > possible: 55 > family report: 108); the case with the earliest age at diagnosis was taken when several cases had equally strong diagnoses. The 209 NCRAD controls that we employed are unrelated Caucasian subjects from the United States with a Clinical Dementia Rating of 0, specifically collected for inclusion in case-control series.

The number of subjects, methods of diagnosis, age at diagnosis, and percent female subjects for all series are depicted in the Table e-1. As seen in this table all AD subjects were diagnosed using the NINCDS/ADRDA criteria e27.

DNA Preparation

Blood samples were collected in 10 ml EDTA tubes from subjects in the JS and RS series, and genomic DNA was isolated from whole blood using an AutoGen instrument (AutoGen, Inc, Holliston, MA). Genomic DNA from the cerebellum of subjects in the AUT series was obtained by Wizard® Genomic DNA Purification Kit (Promega Corp., Madison, WI). DNA from the RS and AUT series was scarce; so samples from these two series were subjected to whole genome amplification using the Illustra GenomiPhi V2 DNA Amplification Kit (GE Healthcare Bio-Sciences Corp., Piscataway, NJ). To attenuate random amplification errors, we performed four 5 ul reactions for each sample which were then pooled, rather than a single 20ul reaction, as we have found that pooling four 5 ul reactions gives better genotype clusters and fewer miscalls than a single 20 ul reaction. Each 5ul reaction contained 5-15 ng of genomic DNA as template, according to the quality of the genomic DNA. To evaluate the quality of each amplified DNA sample, one SNP (rs2830072) was genotyped for the original genomic DNA and for the amplified DNA using a TaqMan® SNP Genotyping Assay (Applied Biosystems, Foster City, CA). Amplified DNA sampleswere included in the series only if they had genotype calls for rs2830072 that fell withinwell definedgenotype clusters and were concordant withtheir respective genomic DNA genotypes.

LOAD Genome Wide-Association Study (GWAS):

GWAS Quality Control

First, 240 subjects (9.7% of the total GWAS series) with genotyping call rates of <90% were eliminated from further analysis. Genotype clusters were re-generated using Illumina’s Beadstudio 2.0 software after this initial elimination. In the AUT series, only those LOAD subjects with Braak scores of 4.0 or above were analyzed for LOAD association, leading to elimination of 87 AUT subjects (3.5%) with Braak scores of 3.0 or 3.5. All controls from the AUT series had Braak scores of 2.5 or lower. Using the sex check option provided by PLINKe28, we eliminated 21 additional samples (0.9%) with a mismatch between the recorded sex and the sex deduced by evaluating the heterozygosity of SNPs on the X chromosome. Using filters available in PLINK, we then eliminated all SNPs with genotyping call rates <90%, minor allele frequencies < 0.01, and/or Hardy-Weinberg p values < 0.01. We also checked for any sample duplicates using the –genome option in PLINK. These quality control measures left 2098 subjects (85.1%) and 564 cis-SNPs (96.9%) in the 12 LOAD candidate genes that we had selected for the gene expression studies (Tablese-2 and e-7). Association between LOAD and the 564 cis-SNPs in the 12 candidate genes was tested in these subjects.

Association Study Between mRNA Expression Levels of 12 AD Candidate Genes and their cis-SNPs Extracted from the LOAD GWAS:

Autopsy Series for mRNA Expression Analyses:

After the gene expression measurements, the samples that failed in ≥2 of the triplicate expression level measurements were excluded from further analysis as discussed in the “Measurement of mRNA Expression Levels” section below. Zero to 11 subjects from each of the 12 expression studies were thus excluded resulting in an average successful expression level measurement yield of 96%. Samples were also subject to exclusion from analyses based on genotype call rates <90% as described in the “GWAS Quality Control” section above. This led to the exclusion of 17 out of the 200 AUT subjects who had successful genotyping call rates <90% in the GWAS. Table e-8 depicts the number of subjects that were analyzed after exclusion of those with failed expression levels and low genotype rates in the GWAS. Three additional subjects were excluded based on sex discrepancies as determined by the sex check algorithm implemented in PLINK as discussed above in the “GWAS Quality Control” section.

Cerebellar RNA Extraction

Total RNA was extracted from cerebellar samples using the AB(Applied Biosystems) RNA Chemistry for Tissue Samples and an AB PRISM™ 6100 Nucleic Acid PrepStation according to the manufacturer’s instructions. The quantity and quality of the RNA samples were determined by the Agilent 2100 Bioanalyzer using the Agilent RNA 6000 Nano Chip. The High-Capacity cDNA Archive kit from ABwas used to reverse transcribe cDNA from mRNA.

Measurement of mRNA Expression Levels

TaqMan® Arrays e29, e30 were used for the gene expression studies. This assay comes in a 384 well-format, where the wells of the micro fluidic cards are pre-loaded with the assays, composed of gene-specific primers and probes for 15 genes and for 8 cDNA samples. The design of the micro fluidic cards, consisting of miniature reaction chambers each connecting to the sample-loading ports, allows for the addition of the same cDNA to multiple assays in a single step. This, along with pre-loaded assays minimizes liquid handling and sample additions, yielding low variability. Each sample was analyzed in triplicate and 18S ribosomal RNA (18S rRNA) was used as the internal control gene. Each of the reaction replicates had 3-4 ng starting cDNA amount. Following thermal cycling, the plates were read on an AB Prism 7900HT sequence detection system. The SDS 2.2 software package from Applied Biosystems was used to obtain the raw data.

The variable CT within the raw data file indicates the PCR cycle number at which the amount of amplified gene target reaches a fixed threshold. The variable ΔCT denotes the difference between the averaged CT values for the target gene replicates and that for the reference gene replicates (i.e. 18S rRNA). In our study, all average CT values for the target genes were below 35, a level considered as detectable and expressed e31.

The ΔCT values calculated fromeach sample for each of the transcripts were used as quantitative phenotypes to determine associations between SNPs and gene expression. Some samples had one or more replicate measurements that failed to amplify and were obvious outliers. These outliers were eliminated. All samples included for analysis had at least two replicates with a SEM (standard error of the mean) ≤ 0.35. The coefficients of variation for the replicates for each of the expression measurements were <0.045 (range: 0.0001-0.042).

The fold difference in the expression levels for the minor allele was estimated from the regression coefficient (ß) for the additive model that was fit to the ΔCT data, which were expressed on the log2 scale, while also adjusting for APOE4 dosage, age at diagnosis/entry and sex. Therefore, we summarized the associations, using the term 2(-ß) to estimate the fold difference in expression due to a single copy of the minor allele, and the term 2[-(ß±2SEM)] to estimate the 95% confidence interval (CI) for the fold difference. We further summarized the expression differences graphically by preparing box plots showing the fold expression difference between any sample in the series and the mean expression level of all of the samples in the most common genotype category (i.e. major homozygotes). This was done by computing the mean of all ΔCT values for the samples in the major homozygotes category and subtracting it from each of the ΔCT values for all samples in all genotype categories (i.e. major homozygotes, heterozygotes and minor homozygotes) to obtain ΔΔCT values. Given that these ΔΔCTvalues are the log2 fold changes between two measurements, the relative expression levels between any sample and the mean of the major homozygotes is given by 2-ΔΔCT = 2^(-(ΔCT_Any Sample– mean(ΔCT_Major Homozygotes))). These 2-ΔΔCT values are plotted in the box plots.

Statistical Analysis:

Analyses of Association betweenmRNA Expression Levels and 12 LOAD Candidate Gene cis-SNPs:

A total of 564 SNPs were tested, however a total of 619 associations between expression levels and SNPs were evaluated. This is because 55 LRRTM3 SNPs also reside within the 7th intron of VR22e32. Thus, 2 different expression endophenotypes (LRRTM3 and VR22 mRNA expression levels) were tested for these 55 SNPs, yielding a total of 619 (564+55) different tests.

To account for multiple testing, we applied a study-wide Bonferroni correction for the 619 tests performed in the whole study using a corrected p-value threshold of < 0.05. We also obtained q-values e33, which are directly related to the false discovery rate e34 and measure the probability that p-values reflect false positive results. We utilized the q-value package in R (v. 2.7.1) to estimate these q-values.

Subsequent to our primary analyses that were based on ordinal genotypic effects, we performed secondary analyses using dominant and recessive genotypic models based on the minor allele in order to further assess the form of the possible effect of the SNP on expression levels. These secondary analyses of specific genetic models were performed in linear regression models that also accounted for APOE4 dosage, age at diagnosis/entry, and sex as previously discussed.

Evaluation and Correction for Population Stratification:

Given that population stratification can lead to spurious genetic associations, we incorporated analyses to detect and, if present correct for, population stratification for the associations between IDE expression levels and SNPs. We used the principal components methodology implemented in EIGENSTRAT e35. We used the default settings in EIGENSTRAT to generate the top ten axes of variation, while identifying and correcting for outliers whose ancestry was at least 6 standard deviations from the mean on any of the top ten inferred axes, using the GWAS genotype data for our combined case-control series including JS_60-80, RS_60-80 and AUT_60-80. Correction for population stratification was pursued in our quantitative linear regression analyses of IDE expression levels. This correction was achieved by including EIGENSTRAT’s top ten axes of variation as covariates in the linear regression models, where each IDEcis-SNP was evaluated in an additive model based on the minor allele frequency and where APOE ε4 dosage, age at diagnosis/entry and sex were also included as covariates. We performed analyses excluding the subjects detected to be outliers as per the software’s recommendations e35.

Evaluation of Linkage Disequilibrium (LD):

Haploview 3.1 was used to calculate and visualize the extent of LD between all the IDE SNPs e36.The solid spine of LD method was used to identify the haplotype blocks. The results of LD using D’ e37are depicted for all 24 IDEcis-SNPs from the GWAS as well as the conserved region SNP rs6583817, which is in complete LD with rs7910977, the SNP showing the most significant association with expression levels (Figure e-1).

In silico search for transcription factor binding sites:

IDE cis-SNP rs7910977 is in complete LD with rs6583817, which unlike rs7910977, resides in an evolutionarily conserved region of IDEand therefore represents a more likely candidate as a functional variant. Putative transcription factor binding sites were compared for 149 base sequence flanking rs6583817 containing either the major or minor allele of this SNP, using the MatInspector search engine made available by Genomatix (

Results:

Measurement of Cerebellar Transcripts and Identification of GWAS cis-SNPs in 12 LOAD Candidate Genes: Table e-7shows descriptive statistics for the ΔCT values of 12 candidate gene transcripts measured in cerebellar samples obtained at autopsy from the brains of 200 LOAD cases. The values shown are for subjects with RNA and DNA samples that passed quality control criteria (see Methods). Table e-6 describes the number of SNPs genotyped in our recent LOAD GWAS that are located within and flanking the 12 LOAD candidate genes. A total of 564 cis-SNPs were assessed for association with their respective gene expression levels. We defined cis-SNPs as those within the gene or flanking it on either side within 100 kb. The rs numbers, chromosomal, base-pair and genic positions of these 564 SNPs are depicted in Table e-6. For each gene, the first and/or main publication with evidence of genetic association with LOAD or functional role in neurodegeneration is also shown in Table e-2.

Association of cis-SNPs with the Levels of 12 LOAD Candidate Gene Transcripts:

Table e-8 shows results for the 619 cis-SNP/candidate gene transcript associations that were analyzed. Results are sorted from the smallest additive p-value for association to the largest. Q-values derived from the additive p-values as well as study-wide Bonferroni corrected additive p-values are also shown. These results are also shown graphically in Figure e-2, where the additive cis-SNP p values for each transcript are plotted vs. chromosome position and shown in relation to gene location.