Keywords:Colorectal Cancer, GWAS, Pathway Analysis, Transcriptome, Metabolic Pathways

Pathway analysis of genome-wide association study and transcriptome data highlights new biological pathwaysin colorectal cancer[AC1]

Abstract

Colorectal cancer (CRC) is a common malignancy that meets the definition of a complex disease. Genome-wide association study (GWAS) has identified several loci of weak predictive value in CRC, however these do not fully explain the occurrence risk.Recently, gene set analysis hasallowed enhanced interpretation of GWAS data in CRC, identifying a number of metabolicpathways as important fordisease pathogenesis. Whether there are other important pathways involved in CRC, however,remains unclear.We present a systems analysis of KEGG pathways in CRC using (1) a human CRC GWAS dataset and (2) a human whole transcriptome CRC case-control expression dataset.Analysis of the GWAS dataset revealed significantly enriched KEGG pathways related to metabolism, immune system and diseases, cellular processes, environmental information processing, genetic information processing, and neurodegenerative diseases. Altered gene expression was confirmed in these pathwaysusing the transcriptome dataset.Taken together, these findings not only confirm previous work in this area, but also highlight new biological pathways whose deregulation is critical for CRC. These results contribute to our understanding of disease-causing mechanisms and will prove useful for future genetic and functional studies in CRC.

Keywords:colorectal cancer, GWAS, pathway analysis, transcriptome, metabolic pathways

Introduction

Colorectal cancer (CRC), also called colon cancer or large bowel cancer, is the third most common form of cancer and the second leading cause of cancer-related death in the Western world with a lifetime risk in the United Statesof approximately 7%. CRC can be considered a complex disease, with a combination of genetic variants and environmental factors contributing to the illness as a whole(1). Relevant genetic variants have not been completely defined, however, genome-wide association study (GWAS) has recently revealed several novel CRC susceptibility loci( newlyidentified variants exert only very small risk effects and cannot fully explain the underlying CRC genetic risk. A large proportion of the heritability of CRC is therefore yet to be explained.

To overcome one of the well-documented limitations of GWAS(2-3),namely its tendency to find large numbers of relatively weak predictors, investigators have developed a means of analyzingGWAS datathataggregates information over sets of related genes, such as genes in common pathways, to identify gene sets that are enriched for variants associated with disease. This “gene set” analysis strategy has yielded important new insights into the genetic mechanisms of many complex diseases(4), such as Alzheimer’s disease(5,6),rheumatoid arthritis (7,8), Crohn’s disease, celiac disease, type 1 diabetes, multiple sclerosis (9–12), schizophrenia (13,14), bipolar disorder (13), and cancers of the bladder (15), breast (16) and lung (17).

In an attempt to further enhance the power of gene set analysis insituations in which there are a large number of predictors compared with sample size, Chen et al. developed an algorithm termed ‘gene set ridge regression in association studies (GRASS).When the GRASS algorithm was validated on CRC GWAS data, the top two enriched pathways were nicotinate and nicotinamide metabolism and transforming growth factor beta (TGF-beta) signaling (Kyoto Encyclopedia of Genes and Genomes identifiers hsa00760 and hsa04350, respectively) (18). Several other pathways related to metabolism were also found to be enriched including glycosphingolipid biosynthesis – lacto and neolacto series (hsa00601), beta-Alanine metabolism (hsa00410), phenylalanine metabolism (hsa00360) and inositol phosphate metabolism (hsa00562) (18). In addition to GRASS, analysis of the CRC GWAS data using the existing gene set method proposed by Wang et al. (3)revealeda further 15 significant pathways, of which 14 were related to metabolism (18).

To date, therefore, evidence from pathway analysis of CRC GWAS data has predominantly implicated the involvement of altered metabolic pathways in CRCoccurrence.To better investigate whether there are other important pathways involved in CRC, we conducted a systems analysis ofKEGG pathways, using (1) a CRC GWAS datasetand (2) a human whole genome CRC case-control gene expression dataset.[AR3]

Materialsandmethods

CRCGWASdataset

The data were obtained from the Colorectal Tumour Gene Identification (CoRGI) consortium (19). This study included 922 cases and 927 controls. The CRC cases had at least one first-degree relative affected by CRC and one or more of the following phenotypes: CRC at age 75 or below; any colorectal adenoma at age 45 or below; three or more colorectal adenomas at age 75 or below; or a large or aggressive adenoma at age 75 or below. Controls were spouses or partners unaffected by cancer and without a personal family history of colorectal neoplasia. All cases and controls were of European ancestry and from the UK.A total of 555,352 SNPs were genotyped using the Illumina Hap550 BeadChip Array[AG4]. On analysis, satisfactory data was obtained from 550,163 SNPs (99.1%), with mean individual sample call rates of 99.7 and 99.8% in cases and controls, respectively. Of the SNPs satisfactorily genotyped, 2516 were monomorphic, leaving 547,647 SNPs for which genotype data were informative. After quality control, we used the summary genotype data from autosome 1-22, which included534,050 SNPs.The Chi-square test (2×2 allele) was used to investigate the association between each polymorphism and CRC. All the chi-square tests were conducted using the R software environment (

Human CRC whole genome expression dataset

TheexpressiondatasetwasoriginallyanalyzedbyLascorzetal.(20). They collected 23 CRC gene expression profiling studies and performed a meta-analysis. In the 23 independent GEP datasets,1897 different gene identifiers were reported to be differentially expressed(p 0.05). The authors describe 1475 uniquely mapped genes including603 up-regulated and 794 down-regulated genes from the original study. We used IDconverter software[AG6]to convert the gene symbols to Entrez gene IDs (21). In the end, we selected575 up-regulated genes and 763 down-regulated genes with unique Entrez gene IDs for our subsequent analysis.[AR7]

Gene-based GWAS data analysis

ProxyGeneLD software[AG8]was used to perform a gene-based test(22). This software is designed toflexibly consider the complex linkage disequilibrium (LD) patterns of the human genome and correct for the inflation of significance caused by gene length. The program uses the LD structures in the HapMap genotyping data (CEU samples of HapMap phase II, release 22). If a group of markers are in high LD in HapMap (r20.8), they are tied to a “proxy cluster”and taken as a single signal. Next, each marker in the AD GWAS with statistically significant evidence of association is evaluated to determine (a) whether it belongs to any proxy cluster and (b) whether the marker itself or any marker in the cluster is located in a genetic region. If a marker or cluster overlaps with a region extending across a gene, it is assigned as a signal showing the possible association of that gene. Finally, a p-value was assigned for each gene(22).Genes with p0.05 were considered to be significant.

Gene set (pathway) analysis

The WebGestalt toolkit( used to perform a pathway analysis (23). For a given KEGG pathway, a hypergeometric test was used to detect an overrepresentation of the CRC-related genes among all the genes in the pathway (23). The p-value of observing more than KCRC-related genes in the pathway was calculated by

where N is the total number of genes that are of interest, S is the number of all CRC-related genes, m is the number of genes in the pathway, and K is the number of CRC-related genes in the pathway. The Benjamini & Hochberg (BH) method was used to correct for multiple testing. Pathways with adjustedp0.05 was considered to be significant. To reduce the multiple-testing issue, and to avoid testing overly narrow or broad pathways, we selected pathways that contained at least 20 and at most 300 genes for subsequent analysis.

Results

Pathway analysis of the CRCGWASdataset

Using ProxyGeneLD, we identified 981 CRC genes with p0.05.Pathway analysis was conducted of all 981significant genes. We identified 34 significant KEGG pathways (adjustedp0.05)thateach included at least fiveCRC genes identified in the gene-based method (Table 1).According to the KEGG classifications, the functions of these pathways included metabolism (n=8), cellular processes (n=5), environmental information processing (n=5), genetic information processing (n=5), neurodegenerative diseases (n=3) and the immune system (n=2). The eightmetabolic pathways in whichgenes showed significant enrichmentincludedoxidative phosphorylation, glycerophospholipid metabolism, fructose and mannose metabolism, purine metabolism, amino sugar and nucleotide sugar metabolism, arginine and proline metabolism, arachidonic acid metabolism and retinol metabolism. Detailed information is described in supplementary Table 1.

Table 1

Pathway analysis of the human CRC whole genome expression dataset

For the expression dataset, to achieve internal comparison, we analyzed the significantly up-regulated and down-regulated genes (CRC cases vs controls) separately. Using the up-regulated genes, we identified 92 significant KEGG pathways (adjustedp0.05)that each included at least five CRC genes. These pathways were mainly related to immune system and diseases (n=18), environmental information processing (n=12), infectious diseases (n=11), cellular processes (n=10), cancers (n=9), metabolism (n=5) and cardiovascular diseases (n=4). Importantly, we observed an enrichment of significantly up-regulated genes in 25 of 34 significant KEGG pathways identified by analysis of the GWAS data(Table 2). Theseresults are provided in detailinsupplementary Table 2.

Using the down-regulated genes, we identified 105 significant KEGG pathways (adjustedp0.05)that included at least five CRC genes. These pathways were mainly related to metabolism (n=15), immune system and diseases (n=13), cancers (n=13), cellular processes (n=12), environmental information processing (n=11), genetic information processing (n=10) infectious diseases (n=9), neurodegenerative diseases (n=5), and cardiovascular diseases (n=4).As with the over-expressed genes, we observed an enrichment of significantly down-regulated genes in the majority of the significant KEGG pathways identified by the GWAS analysis (27 of 34; Table 3).Detailed results are provided in supplementary Table 3.

Table 2

Discussion

CRC is a complex disease that is likely to caused by a combination of genetic and environmental factors. Recently, Chen et al. applied two pathway analysis methodsto CRC GWAS data and identified several potentially importantmetabolic pathways(18). To investigate whether deregulation of other pathways is criticalfor the development of CRC, we conducted a systems analysis using GWAS and whole transcriptome data.

Our approach identified numerous hitherto unrecognized cellular pathways important for CRC occurrence, andalsoconfirmed the substantial involvement of metabolic pathways in CRC that was first reported by Chen et al. (18).Evidence from recent epidemiologic studies also supportsan association with metabolic processes, showing a link between metabolic syndrome(MS) and the risk for CRC. Kim et al. analyzed 2531 subjects including 731 CRC cases and 1800 controlsand found that the prevalence for MS was 17% in patients with colorectal adenomabut only 11% in the control group(24). In keeping with the latter findings, a similar studyscreened 1771 CRC patients and 4667 controlsand observed that metabolic risk factors such as high waist circumference, blood pressure, and serum triglyceride levels were associated with an increased risk of CRC (25). In the latter cohort MS was also associated with an increased risk of adenoma (OR = 1.44, 95% CI = 1.23–1.70)(25). MS increased the risk of right colon adenomas (OR = 1.50, 95% CI = 1.22–1.85), left colon adenomas (OR = 1.36, 95% CI = 1.05–1.76), and adenomas in multiple anatomical locations (OR = 1.59, 95% CI = 1.19–2.12)(25).

Recently, Esposito et al. conducted a systematic review and meta-analysis to assess the association between MS and the risk of different cancers(26). They analyzed 116 datasets from 43 articles, encompassing 38,940 cases of cancer. The presence of MS was associated with increased incidence of CRCin men (relative risk 1.25 (RR), p< 0.001) andin women (relative risk 1.34, p= 0.006) (26). Aleksandrovaet al. performed a nested case-control study using 1093 CRC cases and 1093 controls(27). The results showed that among individual components of MS, abdominal obesity (RR = 1.51; 95% CI: 1.16–1.96) and abnormal glucose metabolism (RR=2.05; 95% CI 1.57–2.68) were associated with CRC(27).

Teo et al.recently assessed the association between genetic risk factors of MS or related conditions and clinical outcome in stage II CRC patients(28). They analyzed the expression levels of several genes related to MS and associated alterations in two equivalent but independent sets of stage II CRC patients. The results showed that a gene expression profile constituted by genes previously related to MS was significantly associated with clinical outcome of stage II CRC patients.

Despite the abundant evidence from epidemiological studies of an association between MS and CRC, such reports provide little information onthe precise metabolic mechanisms that affect CRC risk. Our results therefore provide clues that may help to explain the link between these twoconditions and provide rational targets for potential therapeutic intervention.

In addition to the metabolic pathways identified, we highlighted the involvement of pathways related to cellular processes, environmental information processing, genetic information processing, neurodegenerative diseases and the immune system in the development of CRC. Indeed, theRNA transport pathway, one of the KEGG-defined genetic information processing pathways,was the most significant pathway from analysis of CRC GWAS data (Table 1). In keeping with this finding wealso observed an enrichment of significantly down-regulated genes in this pathway (Table 2). RNA transport from the nucleus to the cytoplasm is fundamental for gene expression. The different RNA species that are produced in the nucleus are exported through the nuclear pore complexes via mobile export receptors. However, general mRNA export is mechanistically different. Export of transcripts can be modulated in response to cellular signaling or stress(29). mRNA export has been found to be consistently dysregulated in primary material from many different forms of cancer. Aberrant expression of export factors can alter the export of specific transcripts encoding proteins involved in proliferation, survival, and oncogenesis. Thus, like transcription and translation, mRNA export may also play a critical role in cancer genesis and maintenance(29), and strategies to target different aspects of this pathway are now undergoing early-stage clinical trials. [AR10]

A variety of software tools are available for gene set (pathway) analysis of GWASdata(30). Some, such as SNP ratio test (31), GenGen (3), GRASS (18), and PLINK set-test (32), accept raw genotype datasets as input data. Others includingProxyGeneLD (22), ALIGATOR, i-GSEA4GWAS, and GESBAP(30)are used to perform initial calculation of the SNP p-valuesprior to subsequent gene-set analysis. In the present studywe selected ProxyGeneLD for the initial gene-based test because we did not have access to raw CRC genotype data.This software is capable ofadjusting for gene length and LD patterns in the human genome, reducingthe sources of bias and increasing the reliability of the pathway analysis.

In this study, we used the pathways from the KEGG database, but not the GO database based on the following considerations. First, the KEGG database is manually compiled on the basis of biological evidence and does not have a hierarchical structure (33-34), whereas the GO database is based mainly on computer predictions as well as human annotation and has a hierarchical structure. GO analysis typically assumes that each functional category is independent, and less than 1% of the GO annotations have been confirmed experimentally (33-34). Second, Chen et al. performed a pathway analysis of CRC GWAS using KEGG pathways (18). To compare our findings with that from Chen et al., we limited our pathways from KEGG.

For the expression dataset, there are two strategies for enrichment analysis of pathways: the analysis of all differentially expressed genes together or the analysis of up- and down-regulated genes separately (35). Recently, Hong, et al. examined the rationales of these enrichment analysis strategies using gene expression profiles from five types of tumors(35). They concluded that the separate analysis of up- and down-regulated genes could identify more pathways that are genuinely pertinent to phenotypic difference than analyzing all of the differentially expressed genes together (35). Based on the latter findings, we analyzed the significantly up-regulated (CRC case vs. controls) and down-regulated genes (CRC case vs. controls) separately in pathway analysis.

Despite these interesting results, we recognize some limitations in our study.First, our study suffered from the usual drawback of gene set enrichment analysis resulting from use of the hypergeometric distribution to assess enrichment in pathways. This assumes that signals (i.e., pathways) are independent, which is unlikely to be the case(33-34, 36). Second, multiple-testing corrections may not be sufficient to account for all biases. Ideally the results from the CRC GWAS should be adjusted using a permutation test. However, the original SNP genotype data for each individual were not originally available to us. Our intention in future is to obtain this data, and to perform additional pathway analysis using SNP ratio test (31), GenGen (3), GRASS (18), and PLINK set-test (32), which can be used to analyze the SNP genotype data, and to conduct a permutation test. Additionally, as with all findings obtained from GWAS data, further analyses are required from independent cohorts toconfirm our findings.

In summary, we not only provide strong evidence ofthe involvement of specific metabolic pathways in the development of CRC, but also highlight the involvement of pathways related to cellular processes, environmental information processing, genetic information processing, neurodegenerative diseases and the immune system. We believe that our results may advance the understanding of CRC mechanisms and will be highlyuseful for future genetic and clinical studies in CRC.

Table 1.KEGG pathways with P<0.05 by pathway analysis of CRC GWAS

Pathway ID / Pathway Name / KEGG classification / C / O / E / R / rawP / adjP
hsa03013 / RNA transport / Genetic Information Processing / 151 / 18 / 3.42 / 5.26 / 1.14E-08 / 3.13E-07
hsa04145 / Phagosome / Cellular Processes / 153 / 16 / 3.47 / 4.62 / 4.52E-07 / 8.29E-06
hsa05016 / Huntington's disease / Neurodegenerative diseases / 183 / 15 / 4.15 / 3.62 / 2.05E-05 / 3.00E-04
hsa00190 / Oxidative phosphorylation / Metabolism / 132 / 12 / 2.99 / 4.01 / 4.96E-05 / 5.00E-04
hsa05012 / Parkinson's disease / Neurodegenerative diseases / 130 / 12 / 2.95 / 4.07 / 4.27E-05 / 5.00E-04
hsa04350 / TGF-beta signaling pathway / Environmental Information Processing / 84 / 9 / 1.9 / 4.73 / 1.00E-04 / 7.00E-04
hsa04141 / Protein processing in endoplasmic reticulum / Genetic Information Processing / 165 / 13 / 3.74 / 3.48 / 1.00E-04 / 7.00E-04
hsa04144 / Endocytosis / Cellular Processes / 201 / 14 / 4.55 / 3.07 / 2.00E-04 / 1.00E-03
hsa04360 / Axon guidance / Development / 129 / 11 / 2.92 / 3.76 / 2.00E-04 / 1.00E-03
hsa03040 / Spliceosome / Genetic Information Processing / 127 / 11 / 2.88 / 3.82 / 2.00E-04 / 1.00E-03
hsa00564 / Glycerophospholipid metabolism / Metabolism / 80 / 8 / 1.81 / 4.41 / 5.00E-04 / 2.10E-03
hsa04810 / Regulation of actin cytoskeleton / Cellular Processes / 213 / 13 / 4.83 / 2.69 / 1.20E-03 / 4.40E-03
hsa00051 / Fructose and mannose metabolism / Metabolism / 36 / 5 / 0.82 / 6.13 / 1.20E-03 / 4.40E-03
hsa05110 / Vibrio cholerae infection / Infectious diseases: Bacterial / 54 / 6 / 1.22 / 4.9 / 1.40E-03 / 4.80E-03
hsa03008 / Ribosome biogenesis in eukaryotes / Genetic Information Processing / 80 / 7 / 1.81 / 3.86 / 2.30E-03 / 7.40E-03
hsa04080 / Neuroactive ligand-receptor interaction / Environmental Information Processing / 272 / 14 / 6.16 / 2.27 / 4.00E-03 / 1.16E-02
hsa00230 / Purine metabolism / Metabolism / 162 / 10 / 3.67 / 2.72 / 4.00E-03 / 1.16E-02
hsa00520 / Amino sugar and nucleotide sugar metabolism / Metabolism / 48 / 5 / 1.09 / 4.6 / 4.50E-03 / 1.24E-02
hsa05010 / Alzheimer's disease / Neurodegenerative diseases / 167 / 10 / 3.78 / 2.64 / 5.00E-03 / 1.31E-02
hsa00330 / Arginine and proline metabolism / Metabolism / 54 / 5 / 1.22 / 4.09 / 7.50E-03 / 1.88E-02
hsa04260 / Cardiac muscle contraction / Circulatory system / 77 / 6 / 1.74 / 3.44 / 8.10E-03 / 1.94E-02
hsa04010 / MAPK signaling pathway / Environmental Information Processing / 268 / 13 / 6.07 / 2.14 / 8.70E-03 / 1.99E-02
hsa00590 / Arachidonic acid metabolism / Metabolism / 59 / 5 / 1.34 / 3.74 / 1.08E-02 / 2.38E-02
hsa03015 / mRNA surveillance pathway / Genetic Information Processing / 83 / 6 / 1.88 / 3.19 / 1.15E-02 / 2.43E-02
hsa05215 / Prostate cancer / Cancers: Specific types / 89 / 6 / 2.02 / 2.98 / 1.58E-02 / 3.04E-02
hsa04540 / Gap junction / Cellular Processes / 90 / 6 / 2.04 / 2.94 / 1.66E-02 / 3.04E-02
hsa00830 / Retinol metabolism / Metabolism / 64 / 5 / 1.45 / 3.45 / 1.50E-02 / 3.04E-02
hsa04510 / Focal adhesion / Cellular Processes / 200 / 10 / 4.53 / 2.21 / 1.64E-02 / 3.04E-02
hsa05211 / Renal cell carcinoma / Cancers: Specific types / 70 / 5 / 1.59 / 3.15 / 2.13E-02 / 3.66E-02
hsa03320 / PPAR signaling pathway / Endocrine system / 70 / 5 / 1.59 / 3.15 / 2.13E-02 / 3.66E-02
hsa04630 / Jak-STAT signaling pathway / Environmental Information Processing / 155 / 8 / 3.51 / 2.28 / 2.54E-02 / 4.23E-02
hsa04612 / Antigen processing and presentation / Immune system / 76 / 5 / 1.72 / 2.9 / 2.92E-02 / 4.46E-02
hsa04620 / Toll-like receptor signaling pathway / Immune system / 102 / 6 / 2.31 / 2.6 / 2.87E-02 / 4.46E-02
hsa04370 / VEGF signaling pathway / Environmental Information Processing / 76 / 5 / 1.72 / 2.9 / 2.92E-02 / 4.46E-02

C, number of reference genes in the category; O, number of genes in the gene set and also in the category; E, expected number in the category; R, ratio of enrichment, rawP, p value from hypergeometric test; adjP, p value adjusted by the multiple test adjustment.