Supplementary Methods

Supplementary Methods

General information

All statistical analysis and data processing steps described below were performed using R (www.r-project.org/). EGFR and KRAS mutation status as well as smoking status for samples in included genomic or mRNA data sets were identified from original publications, or by cross-referencing to other studies using the same samples, e.g. Ding et al. (1) for Weir et al. (2). Explicit information on patient ethnicity or specific EGFR and KRAS mutation type was not available for the majority of included microarray studies and consequently not included in analyses. However, included studies were performed in both western and Asian countries. Specifically, never-smoking status was inferred based on either a present never-smoking annotation, or by an annotation of zero patient pack-years. All included tumors represented primary disease based on information from original studies. All genomic data sets were matched to the hg18/NCBI36 genome build. All genomic profiles were analyzed in an unmatched fashion and sample uniqueness was assured as described (3). Matching of genomic regions to genes was performed using the hg18 gene annotation object from GISTIC 2.0 (4). The TCGA genomic and mRNA expression validation cohorts represent unpublished, but freely available data sets (publication embargo lifted) (5). However, due to the evolving status of the unpublished data for these samples annotations and called mutations could be different when finally reported (due to e.g. mutation validation).

This document outlines:

1. Generation of copy number and B allele estimates for genomic cohorts.

2. Copy number estimates for TCGA and CLCGP cases.

3. Copy number segmentation and post-processing of segmented data for genomic data sets.

4. Quality assessment of samples from genomic data sets for inclusion in the study.

5. Merging of genomic data sets into a final cohort.

6. Definition of recurrent amplifications in the genomic cohort.

7. Statistical testing between subgroups of tumors – genomic regions.

8. Supervised classification of smoking status using genomic regions

9. Gene expression analyses.

10. Mutation analysis in TCGA samples.

Estimation of copy number and allele estimates for Affymetrix SNP data

Estimates of copy number (CN) and B Allele Frequency (BAF) were made in an unmatched fashion, without usage of any matched normal tissue (as such tissues were only available for parts of Weir et al.). Specifically, for the different Affymetrix data sets CN and BAF estimates were generated using CRMAv2 (6) and ACNE (7) in combination as outlined in the ACNE vignette (http://www.aroma-project.org/) and as further described in (3, 8). For Affymetrix SNP6 arrays processed by CRMAv2 and ACNE (not TCGA and CLCGP-cohorts) only SNP probes were used in further analyses, corresponding to approximately 931000 probes. The reason for exclusion of copy number probes present on the 6.0 chip is that no BAF estimates are calculated for these probes by ACNE.

Copy number estimates for the TCGA and CLCGP cohort.

490 lung adenocarcinoma genomic profiles were downloaded from TCGA (5) (accessed September 11, 2013) as level 3 data (segmented data). 413 samples passed quality control (see below) and could be matched to complimentary mutation and or RNAseq data (see below) The no_cnv profiles mapped to hg18 were selected, re-centralized as described below (post-merging centralization), and then post-processed similarly as the other cohorts prior to integration into the final 1398-sample cohort (see below).

All genomic profiles from CLCGP (9) were obtained as pre-segmented data from the author’s website. 359 adenocarcinomas passed quality control and had available patient smoking status. Genomic profiles were re-centralized as described below (post-merging centralization), and then post-processed similarly as the other cohorts prior to integration into the final 1398-sample cohort (see below).

GLAD copy number segmentation and post processing of segmented data

Prior to copy number segmentation, genomic profiles were centralized similarly as described (10). GLAD segmentation was performed using the Bioconductor GLAD (11) package for R (www.bioconductor.org) as described (3). Segmented profiles were post-processed as described (3).

Quality assessment of genomic profiles

Genomic profiles (copy number profiles plus B allele frequency plots when possible, obtained from (8)) were visually inspected for all cases, and cases with poor quality / inconsistent BAF and / or CN estimates were excluded. In addition, cases with flat copy number profiles were also excluded. Exclusion of cases with flat copy number profiles was based on that we did not find larger regions with apparently somatic copy neutral allelic imbalance in cases with flat genomic profiles analyzed by SNP arrays. Consequently it appears difficult to conclude whether these cases in fact harbor tumor cells or not.

Moreover, the GSE34140 and GSE28572 data sets reports inclusion of micro- / macrodissected tissues to some extent (specific sample information is not available), while information for other data sets is missing.

Merging of partitioned genomic profiles from different data sets

Merging of partitioned genomic tumor profiles from different data sets into a 3000bp virtual probe set was performed as described (3).

Post merging recentralization

After merging all data sets to the common 3000bp probe set a final recentralization of copy number profiles were performed similarly as described (3, 10).

Identification of recurrent amplifications in mGISTIC regions

Recurrent amplifications were investigated in the 59 mGISTIC gain regions from Planck et al. (8) for each sample. For mGISTIC regions or individual genes the mean partitioned CN were computed for probes matching region/gene boundaries. A CN threshold of log2ratio >0.8 was used to identify samples with amplification for a specific region or gene similar to Weir et al. (2).

Statistical testing between subgroups of tumors – CNAs

Fisher's exact test or the Chi-square test was used to identify genomic regions associated with different subgroups of tumors. For regions of gain and loss separately, samples were classified as having copy gain or not and copy loss or not, respectively, based on log 2 ratio thresholds of ±0.12. Two-by-two (e.g. never-smokers versus smokers) or two-by-three (e.g. never-/current-/former-smokers) contingency tables tested gained versus no gain and loss versus no loss between subgroups.

Supervised classification of patient smoking status – CNAs

Supervised classification was performed between never-smokers and smokers (former or current), never-smokers and current-smokers, and never-smokers and former-smokers based on smoking-related CNAs from the genome-wide analysis of 200 KBp segments, and smoking-related genomic signatures from Massion et al. (12), Thu et al. (13), Weir et al. (2), and Broet et al. (14) using the caret R package. In addition, we evaluated the 90 recurrent mGISTIC regions reported by Planck et al. (8), representing focal CNAs recurrent in lung adenocarcinoma derived without respect to smoking. For the Massion et al. signature the original 60 BAC probes were mapped to the UCSC hg18 genome build, with 41 probes being present in the database. For these probes, we extended the region covered by each probe by ±1MBp, and matched these extended regions to the sequential 200 KBp segments. For the Thu et al. signature we used the 21 regions reported in that study (regions R1-R21). We extended each region by ±1MBp, and matched these extended regions to the sequential 200 KBp segments. For Weir et al. we used reported regions with borderline significance on 10q, 16p, 15q, 7p, and 7q. We extended each region by ±1MBp, and matched these extended regions to the sequential 200 KBp segments. For Broet et al. we used the 16 regions listed in Table 2 in that study. We extended each region by ±1MBp, and matched these extended regions to the sequential 200 KBp segments. For Planck et al. we used the 90 mGISTIC regions and extended each region by ±1MBp, and matched these extended regions to the sequential 200 KBp segments.

The 1398-sample set was divided into a 50/50% balanced training and test set considering the individual cohort, smoking status, and EGFR mutation status of patients. Pam, linear support vector machine, and linear discriminant analysis models with parameter tuning based on 4-fold cross validation and ROC area as performance criteria were used to derive classifiers in the training set. Each model was reiterated up to 10 times with different samplings of test and training sets to determine an average performance and a standard deviation (using balanced accuracy as measurement of performance). The data matrix for classification consisted of the 12698 sequential 200Kbp segments with ternary values (-1 for copy number loss, 1 for copy number gain, and 0 for copy number neutral status) for copy number status for each sample. For never-smokers versus smokers (current and former), never-smokers/current-smokers, and never-smokers/former-smokers we evaluated each list of genomic regions using three different classifiers: Partitioning around mediods (PAM), linear support vector machine (svm) and linear discriminant analysis (lda).

Gene expression analyses

Microarray cohorts

A total of 8 public Affymetrix gene expression microarray cohorts were analyzed as described in Table 1 (Chitale et al. (15) was divided into two cohorts due to different Affymetrix platforms). The selection of only a single array type (Affymetrix) facilitates multicohort analyses, and the creation of classifiers, as probe sets are shared between the U133A and U133 2 plus platforms. Affymetrix cohorts were normalized individually using GCRMA (16). Next, only adenocarcinoma tumor samples were selected, and normalized intensities were log2 transformed and all cohorts were mean-centered for each probe across all tumors. Gene expression differences are thus interpreted as differences in relative expression between tumors. Prior to analyses we removed probe sets not annotated with a gene symbol. For genes with multiple probe sets, the probe with the highest standard deviation was kept when matching to gene symbols or gene signatures not including Affymetrix probe set identifiers.

Processing of TCGA RNA-seq expression data

Level 3 RNA-seq V2 data was obtained for 435 adenocarcinomas (423 with smoking status) from TCGA (5), accessed September 11, 2013. Summarized gene expression values were used (“*.rsem.genes.normalized_results” files). An offset of 1 was added to all counts, followed by log2 transformation and mean-centering creating relative expression estimates between samples similar to the microarray cohorts. Mutations were summarized from the MAF file of somatic mutations accessed at the same time point. Mutations are to be regarded as preliminary as they have not been validated or reported in a finalized manner. CN-FGA values were obtained from the analysis of the complementary SNP6 data as described above.

Consensus clustering

Consensus clustering was performed using the ConsensusClusterPlus R package (17) on a cohort basis for the six Affymetrix discovery cohorts and the TCGA RNAseq data (see Figure 1B). Prior to clustering, expression data was mean-centered across all samples in each data set. For microarray cohorts probe sets with log2ratio standard deviation < 0.3, <0.5, or <1 were removed (each data set was clustered three times). For the RNAseq data genes with log2ratio standard deviation <1, or <1.5 were removed (this cohort was clustered two times). Parameters for the consensus clustering were 1000 repetitions, pItem and pFeature 0.7, Pearson correlation and complete linkage. Only the k=2,3,4 group solutions were evaluated further.

Calculation of expression metagenes

A proliferation metagene based on the CIN70 signature (18) was calculated as the mean expression of all matching genes for each sample in a specific cohort. A metagene for terminal respiratory unit (TRU-) type tumors was constructed based on 293 differentially expressed genes from Takeuchi et al. (19). A metagene for bronchial / alveolar type tumors was constructed based on 78 differentially expressed genes from Shibata et al. (20). A metagene for the distal airway stem cell (DASC)-like subtype was constructed based on 249 differentially expressed genes from Cheung et al. (21). For both the TRU, DASC and bronchial/alveolar metagene the direction of expression for reported genes were taken into account. Higher scores indicate higher proliferation for CIN70, a more DASC-like subtype (DASC), a more TRU-type like subtype (TRU), and a more alveolar-like subtype (alveolar/bronchial).

Classification according to adenocarcinoma molecular subtypes

Molecular subtype classification was performed using centroids reported by Wilkerson et al. (22). Samples were assigned to the centroid with the highest Pearson correlation. The number of genes in the centroid matching to the different data sets varied due to differences in the content of the different microarray / RNAseq platforms.

Analysis of irreversibly altered genes due to smoking in normal airway epithelium

43 genes reported to irreversible altered by smoking were collected from three studies (23-25) (see below). Affymetrix probe sets were identified from original publications or online resources and used in the analysis.

ACSL5 / GPX2 / PIR
AHRR / HLF / PLA1A
CCDC33 / HN1 / PLA2G10
CCND2 / ITM2A / SEC14L3
CEACAM5 / MAOB / SERPINB13
CEACAM6 / ME1 / SERPIND1
CX3CL1 / MMP10 / SLC6A16
CYFIP2 / MT1F / SRPX2
DPYSL3 / MT1X / SULF1
FAM107A / MUC5B / TLE1
FAM114A1 / NQO1 / TM4SF1
FASN / PDGFC / TMCC1
GANC / PECI / TNS1
GMDS / PI4K2A / TNS3
UPK1B

Supervised classification of patient smoking status – gene expression

Supervised classification was performed for never-smokers versus smokers, or never-smokers versus current-smokers across six and four cohorts, respectively, using the Caret R-package. For the never-smoker versus current-smoker analysis former-smokers were removed from cohorts prior to analysis. Prior to analysis we removed patients without smoking history, probe sets without unique gene annotation, followed by mean-centering across remaining samples for each cohort. Only probe sets present on both U133A and U133 2plus were used in further analyses. For each cohort we trained classifiers based on different feature selection criteria and classifier models (pam, linear support vector machine, random forest, and generalized boosted regression) using parameter tuning based on 4-fold cross validation and ROC area as performance criteria (see below). Each tuned classifier was next evaluated in remaining independent cohorts for never-smokers versus smokers (n=5 test cohorts), and never-smokers versus current-smokers (n=3 test cohorts), respectively, with assessment of the sensitivity and specificity in the classification.

For never-smokers versus smokers we evaluated the following 34 classifier combinations:

PAM for probe sets with log2ratio std >0 / Random forest for probe sets with log2ratio std >0 / PAM for genes matching Woodruff et al. (26) smoking-associated signature
PAM for probe sets with log2ratio std >0.3 / Random forest for probe sets with log2ratio std >0.3 / PAM for irreversibly altered genes
PAM for probe sets with log2ratio std >0.5 / Random forest for probe sets with log2ratio std >0.5 / Linear support vector machine for genes matching Bosse et al. (24) smoking-associated signature
PAM for probe sets with log2ratio std >1 / Random forest for probe sets with log2ratio std >1 / Linear support vector machine for genes matching Boelens et al. (27) smoking-associated signature
PAM for probe sets differentially expressed between groups with p<0.01 (t-test). / Generalized boosted regression for probe sets with log2ratio std >0 / Linear support vector machine for genes matching Spira et al. (23) smoking-associated signature
PAM for probe sets differentially expressed between groups with p<0.001 (t-test) / Generalized boosted regression for probe sets with log2ratio std >0.3 / Linear support vector machine for genes matching Woenckhaus et al. (28) smoking-associated signature (normal tissue signature)
Linear support vector machine for probe sets with log2ratio std >0 / Generalized boosted regression for probe sets with log2ratio std >0.5 / Linear support vector machine for genes matching Woodruff et al. (26) smoking-associated signature
Linear support vector machine for probe sets with log2ratio std >0.3 / Generalized boosted regression for probe sets with log2ratio std >1 / Linear support vector machine for irreversibly altered genes
Linear support vector machine for probe sets with log2ratio std >0.5 / PAM for genes matching Bosse et al. (24) / Random forest for irreversibly altered genes
Linear support vector machine for probe sets with log2ratio std >1 / PAM for genes matching Boelens et al. (27) / Generalized boosted regression for irreversibly altered genes
Linear support vector machine for probe sets differentially expressed between groups with p<0.01 (t-test). / PAM for genes matching Spira et al. (23)
Linear support vector machine for probe sets differentially expressed between groups with p<0.001 (t-test) / PAM for genes matching Woenckhaus et al. (28) smoking-associated signature (normal tissue signature)

For never-smokers versus current-smokers we evaluated the following 29 classifier combinations: