Supplementary Information (Schabath et al)

Study populationand cancer registry data

The protocol for this study was approved by the University of South Florida Institutional Review Board. The study included 442 lung cancer patients diagnosed with adenocarcinoma whohad i) available primary fresh frozen tissue, ii) gene expression microarray data, and ii) consented to Moffitt Cancer Center’s Total Cancer Care (TCCTM) program(1). The patients in this study were recruited and consented between April 2006 and August 2010. TCCTM is a multi-institutional observational study of cancer patientsthat prospectively collects patient data and tissue samples for research purposes. There are no exclusion or inclusion criteria to provide consent, patients are followed for life and every patient is eligible.Patients for this analysis provided informed consent to the TCCTM protocol either at Moffitt or one of eighteen TCCTM consortium/affiliate institutions. The Cancer Registries at Moffitt and the consortium/affiliate sites abstracted self-reported and clinical data from patient medical records including demographics, diagnosis, and stage. Some patients had missing data as noted in Supplementary Table 1. Patient follow-up for vital status was performed annually. Pathologic TNM staging was utilized when available and clinical stage was used if pathologic staging was missing. Smoking status was categorized as self-reported ever smoker (current or former smoker) or never smoker.

DNA extraction, sanger sequencing, and mutational analysis

DNA was extracted from fresh frozen macrodissected tumor tissues using Qiagen DNeasy Blood & Tissue Kits (Valencia, CA), quantified using a Nanodrop instrument (Wilmington, DE) and its integrity assessed by electrophoresis. Sanger sequencing was performed (Functional Biosciences, Inc., Madison, WI) on tumor DNA from 442 patients using standard cycle sequencing. The following targeted exons were sequenced using previously published primers(2): Exons 1 and 2 of KRAS, Exons 18 to 21 of EGFR, Exons 1 to 9 of LKB1/STK11 and Exons 2 to 11 of p53. All sequencing traces were examined manually by two independent readers and mutant peak heights and identity determined using Mutation Surveyor (SoftGenetics, State College, PA). Mutations were called if the mutant peaks were at least 20% of the WT signal (due to tumor heterogeneity) and were present on both the forward and reverse reads. In addition, to ensure that germline variations were screened out, the following steps were performed:

1)The data were filtered against the 1000 Genomes Project ( remove commoninherited (germline) variants. The data were also filtered against the Exome Sequencing Project (ESP) dataset ( to removeadditional variants observed in these 6503 samples.

2)Next, we examined our mutation calls against calls downloaded from TCGA (LUAD v2.1.5.0 MAF file). Variants observed in TCGA were retained as mutants, resulting in 28 out of total 68 STK11; 73 out of total 111 TP53.

3)Variants not observed in TCGA as mutations, but which encoded frame-shifted/truncated proteins or splice site mutations (matching the mutational profile of STK11 or TP53), were also retained as mutations (38out of total 68STK11; 28 out of total 111 TP53).

4)Next, the Sanger sequencing traces were examined for unknown variants, specifically variants that were not in TCGA and non-truncating variants. Variants with equal peak heights compared to wildtype were classified as "germline" andconsidered wildtype. However, those with unequal peak heights were retained as mutations; these were a relatively small number of total observed mutations (2out of total 68STK11; 10 out of total 111 TP53).

Affymetrix gene expression microarray data

As part of the TCCTM protocol, lung tumors were profiled using a custom Affymetrix GeneChip that measures the expression of ~60,000 transcripts. Tissues were processed and accessed for RNA quality according to the TCCTM protocol prior to expression analysis. Microarray gene expression data from 442 patients was used in the present study. CEL files were normalized against their median sample using IRON (3). An RNA-quality related batch-effect was corrected by a Partial Least Squares (PLS) model (4).

KRAS mutation associated signature

Expression data was normalized and batch-corrected as previously described(3, 5). To generate the signature, we univariately identified probesets that were significantly different between KRAS mutant and KRAS wildtype tumors. A t-test was used to generate p values and differences in mean values for the two groups. A p-value adjustment using the Holm’s method was appliedto correct for multiple testing. Probesets were determined to be significant if the adjusted p value was less than 0.05 and the difference in log2 expression between means was at least 0.585 (fold-change between means of at least 1.5-fold). The goal was to identify probesets differing between wildtype and KRAS mutant tumors, and subsequently validate this potential signature in other datasets (see below). The signature gene list is provided in Supplementary Table 2.

TCGA gene expression and mutation data

The results shown in this study are based upon data generated by TCGA Research Network at: Normalized RNASeqV2 RSEM files were downloaded from TCGA for both adenocarcinoma and squamous lung projects, then merged together into a single larger cohort. Additional normalization was performed using IRON (3) v2.1.5 (iron_generic –rnaseq) to correct for minor differences in dynamic range between samples. A batch effect was identified due to a likely process change in early February 2011. Two batches were assigned, using a date_of_creation cutoff of 2014-02-01. Gene expression was then de-batched using COMBAT (5), with histology and gender as covariates. Mutation calls were downloaded from TCGA (LUAD v2.1.5.0 MAF file) and merged into the gene expression metadata, with synonymous mutations treated as WT. After discarding recurrences, the final adenocarcinoma cohort used in the analyses from TCGA for this study consisted of 483 samples.

Statistical analyses

All statistical analyses in our 442 cohort and the TCGA datasets were performed using R Project for Statistical Computingversion 3.1.1 ( Based on the distribution of the covariates of interest, parametric statistics, including two-sample t-test and ANOVA, were utilized to analyze the data.Thechi-square test using the exact method with Monte Carlo estimation was used to test for differences ofstudy population characteristics by mutational status. Survival analyses were performed to determine if mutational statuswere associated with overall survival (OS) using Kaplan-Meier survival curves and the log-rank test. OS was right-censored at five years and was calculated from the date of surgery until the date of last follow-up or death.

The malignancy risk (MR) (6) and NF-B activity signatures have also been previously described (7). Principal component analysis was used to evaluate activity of different gene signatures as previously described (7). In this analysis, the first principal component (PC1) was used to represent the overall expression level for each gene signature.Two-sample t-test was used to test difference of each signature score (PC1) between wildtype and mutation for each gene (KRAS, EGFR, STK11, and TP53). One-way ANOVA was performed to test any group difference among the combination of two genes mutation (e.g., KRAS and STK11). Tukey honest significant difference method was used to adjust for p value for pairwise comparison. Gene lists ofMR and NF-B signatures are shown in Supplementary Table 3.

Code availability

Code for the statistical analyses performed will be provided upon request.

References

1.Fenstermacher DA, Wenham RM, Rollison DE, Dalton WS. Implementing personalized medicine in a cancer center. Cancer J. 2011;17(6):528-36. Epub 2011/12/14.

2.Wood LD, Parsons DW, Jones S, Lin J, Sjoblom T, Leary RJ, et al. The genomic landscapes of human breast and colorectal cancers. Science. 2007;318(5853):1108-13. Epub 2007/10/13.

3.Welsh EA, Eschrich SA, Berglund AE, Fenstermacher DA. Iterative rank-order normalization of gene expression microarray data. BMC bioinformatics. 2013;14:153. Epub 2013/05/08.

4.Wold S, Ruhe A, Wold H, Dunn I WJ. The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. . SIAM Journal on Scientific and Statistical Computing. 1984;5(3):735-43.

5.Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118-27. Epub 2006/04/25.

6.Chen DT, Hsu YL, Fulp WJ, Coppola D, Haura EB, Yeatman TJ, et al. Prognostic and predictive value of a malignancy-risk gene signature in early-stage non-small cell lung cancer. Journal of the National Cancer Institute. 2011;103(24):1859-70. Epub 2011/12/14.

7.Hopewell EL, Zhao W, Fulp WJ, Bronk CC, Lopez AS, Massengill M, et al. Lung tumor NF-kappaB signaling promotes T cell-mediated immune surveillance. J Clin Invest. 2013;123(6):2509-22. Epub 2013/05/03.