Supplement - Materials and Methods
Materials & Methods
Patient samples. Tumor tissue was collected at the time of resection from 227 patients with primary lung adenocarcinomas operated at MSKCC between 1996 and 2006 (except for four patients operated between 1990 and 1995), under protocols approved by the MSKCC Institutional Review Board. The tissues were snap frozen in liquid nitrogen and stored at -80˚C. All cases were diagnosed as lung adenocarcinomas by MSKCC surgical pathologists subspecializing in thoracic pathology. By current WHO criteria, over 90% of cases would be classified as mixed subtype, based on combinations of areas of papillary, solid, acinar, and bronchioalveolar growth patterns. The first 91 samples were also included in a separate multicenter study (Shedden et al. 2008). Sample selection was based on the availability of materials for analysis, except for approximately 10 cases added on the basis of their prior EGFR-mutant status (in order to improve the statistical power of comparisons involving EGFR-mutant samples). Prior to extraction of nucleic acids, the tumor content (approximately >70% tumor nuclei) of the frozen specimen was confirmed by frozen section and the histopathologic diagnosis was verified by a pathologist. In spite of this frozen section histologic review, we noted that 28 samples with few or no CNAs by aCGH showed a statistically lower proportion of samples with mutations detected by direct sequencing (no P53 mutations, 1 LKB1 mutation, 2 KRAS mutations) than the remaining 199 cases (3/28 vs 111/199; p<0.0001). These 28 cases were excluded from further analyses based on presumptive excess stromal contamination, leaving a final study set of 199 samples.
Clinical outcome data were available on all 199 patients (median follow-up: 28 months). For patients alive at last follow-up, the median follow-up was 30 months. Basic clinical, pathologic and mutation data are summarized in Supplementary Table 1. Unless otherwise stated, all P-values are for two-sided Fisher exact tests. Clinical outcomes, including overall survival of patients harboring DUSP4 deletion and combinations of genotypes including EGFR mutation, were plotted by Kaplan-Meier method and excluded a single patient lacking outcome information. Statistical analyses are as stated in the text and performed in R 2.3.1. The full annotation table for the 199 patients is provided online at http://cbio.mskcc.org/Public/lung_array_data/ .
Mutation Screening. EGFR exon 19 deletions and exon 21 L858R mutations, which
1. BRAF_EXON11_1397G-T_G466V
2. BRAF_EXON11_1406G-C_G469A
3. BRAF_EXON15_1789C-G_L597V
4. BRAF_EXON15_1790T-G_ L597R
5. BRAF_EXON15_1799T-A_ V600E
6. EGFR_EXON18_2155G-A-T_G719S-C
7. EGFR_EXON18_2156G-C_ G719A
8. EGFR_EXON18_2159C-T_ S720F
9. EGFR_EXON19_2281G-T_ D761Y
10. EGFR_EXON20_2369C-T_T790M
11. EGFR_EXON21_2573T-G_L858R
12. EGFR_EXON21_2582T-G_L861Q
13. KRAS_EXON2_34G-A-C-T_G12S-R-C
14. KRAS_EXON2_35G-A-C-T_G12D-A-V
15. KRAS_EXON2_37G-C-T-A_G13R-C-S
16. KRAS_EXON2_38G-A_G13D
17. KRAS_EXON3_181C-A_Q61K
18. KRAS_EXON3_182A-G-T-C_Q61R-L-P
19. KRAS_EXON3_183A-C_Q61H
20. PIK3CA_EXON20_3140A-G-T_H1047R-L
21. PIK3CA_EXON9_1624G-A_E542K
22. PIK3CA_EXON9_1633G-A_E545K
23. PIK3CA_EXON9_1636C-A_Q546K
together comprise at least 90% of EGFR mutations, were detected using mutation-specific PCR-based assays, described in detail elsewhere (Pan et al. 2005). Direct sequencing of EGFR exons 18-21 was also performed in select cases. KRAS mutation status was determined by direct sequencing. Major known recurrent point mutations in EGFR, KRAS, BRAF, and PIK3CA were also screened for by Sequenom mass-spectrometry-based genotyping assays listed in Methods Table 1, and described in technically greater detail in the next section. Sequenom assays are less suitable for indels. HER2 exon 20 insertions, which make up the majority of recurrent mutations in this gene, were detected by size analysis of fluorescently labeled PCR products, analogous to the method used to screen for EGFR exon 19 deletions (Pan et al. 2005). Two sets of Her2 Exon20 primers were used: forward: GCTGTGGTTTGTGATGGTTG, and reverse: GGTGCATACCTTGGCAATCT. A second set of primers was designed outside the first for confirmation of mutations: forward: TCCAGGCTGGTACTTTGAGC, reverse: AATGGGAAGCACCCATGTAG. For the TP53, LKB1, and PTEN genes, direct sequencing of all exons was carried out at Agencourt Biosciences (Boston, MA). This was followed by automated screening for SNPs/ mutations and insertion/ deletions using Polyphred software. All the SNPs/ mutations, and indels were visually confirmed on individual traces and cases were annotated accordingly. There were a few amplicons in exon 6 of LKB1 that showed high background peaks yielding numerous positive calls by the Polyphred software. A new set of primers was designed and all the cases with ‘positive’ calls were re-sequenced. None of these cases showed mutations. These LKB1 Exon 6 primers were as follows: forward: GTATCACCCAGGGCCTGAC, reverse: GTCTGGCCGGTAACAGGA. Finally, Sequenom assays were designed to confirm the absence of the following most highly recurrent LKB1 mutations: nt109C/T_Q37* in exon 1 and nt508C/T_Q170* in exon 5.
Additionally, we also processed p53 traces from Agencourt contract sequencing using a separate in-house pipeline (developed by J.M. and C.S.). Briefly, reads are assembled against the genomic reference sequence, and three mutation detection packages are run on the resulting assemblies (polyscan, polyphred, bic-scan). The mutation calls are parsed into a common format, annotated, and filtered to reduce the number of false positives.
Sequenom-based mutation screens. The Sequenom MassARRAY system is based on matrix-assisted laser desorption/ionisation-time of flight mass spectrometry (MALDI-TOF MS). In these assays, the mutant and germline alleles for a given point mutation produce single-allele base extension reaction products of different masses that are then resolved by MALDI-TOF MS. Both the amplification and extension primers were designed using Sequenom Assay Designer v3.1 software. The amplification primers were designed with a 10mer tag sequence to increase their mass so that they fall outside the range of detection of the MALDI-TOF mass spectrometer. Results were generated using the SpectroTYPER v3.4 software (Sequenom). All the positive cases were confirmed by visually reviewing the spectra. For the PCR amplification, a total of 15 ng of genomic DNA (in 1 ul) was amplified in a 5 μl reaction mixture containing 0.1μl (0.5 U) HotStar Taq enzyme (Qiagen, Valencia, CA), 0.625 μl of10x HotStar buffer, 0.325 μl of 25 mM (total) MgCl2, 0.25 μl of 10mM (each) deoxynucleotide triphosphate, 1 μl of 100 nM of each forward and reverse primers and 1.7μl of water. The PCR step was initiated with a 95°C soak for 15 min, followed by 45 cycles, consisting of 95°C for 20 s, 56°C for 30 s, 72°C for 60 s, and a final extension of 3 min at 72°C. After PCR, the remaining unincorporated dNTPs were dephosphorylated by adding 2 μl of the SAP cocktail, containing 1.33 μl of water, 0.17 μl of reaction buffer (Sequenom) and 0.5 μl of SAP (Sequenom). The 384-well plate was then sealed and placed in a thermal cycler with the following conditions: 37°C for 40 min, 85°C for 5 min and then held at 4°C indefinitely. After the SAP treatment, a 2 μl cocktail, consisting of 0.755 μl water; 0.2 μl iPlex 10x buffer (Sequenom), 0.2 μl iPlex terminator mix (Sequenom); 0.804 μl of 7 μM/14μM (depending on the low vs. high mass primers) extension primer mixture and 0.041 μl iPlex enzyme (Sequenom) was added. After the iPlex cocktail addition, the plate was again sealed and placed in a thermal cycler with the following program: 94°C for 2 min followed by 40 cycles of 94°C for 5 s, [5 cycles (52°C for 5 s, 80°C for 5 s) and 72°C for 5 s]. The reaction mixture was then desalted by adding 16 μl of a water and 6 mg cationic resin mixture, SpectroCLEAN (Sequenom). The plate was then sealed and placed in a rotating shaker for 20 min to desalt the iPlex solution. Completed genotyping reactions were spotted in nanoliter volumes onto a matrix arrayed silicon chip with 384 elements (Sequenom SpectroCHIP) using the MassARRAY Nanodispenser. SpectroCHIPs were analysed using the Bruker Autoflex MALDI-TOF mass spectrometer and the spectra were processed using the SpectroTYPER v3.4 software (Sequenom).
aCGH profiling. Genomic DNA was extracted from primary tumors using standard techniques. DNA was then digested and labeled by random priming using Bioprime reagents (Invitrogen) and Cy3 or Cy5-dUTP. Labeled DNA was hybridized to Agilent 44K CGH arrays. This array consists of >43,000 coding and non-coding sequences allotted to assembly map positions (NCBI, Build 35) covering approximately 41,000 genes with median interval of 37.6 kb between unique probes. Normal genomic DNA (Promega, Madison, WI) was used as a reference for 163 samples and matched normal DNA for 64 samples. After washing, the slides were scanned with an Agilent scanner and images quantified using Feature Extraction 8.5 (Agilent). Fluorescence ratios of the scanned images were calculated and the raw aCGH profiles were processed to identify statistically significant transitions in copy number using the Circular Binary Segmentation algorithm (Olshen et al. 2004). Each profile was centered so that log2 ratio of zero is assigned to the predominant copy number, determined by the mode of the distribution of the mean log2 ratio for each segment, weighted by the number of probes per segment . After mode-centering, gains and losses for a subset of analyses were defined as segment mean log2 ratios of > 0.2 or < -0.2 and amplification and deletions as >1 or < -1, respectively. Additionally, sample-specific thresholds for alterations were computed for all other analyses. This is a log2 ratio threshold based on the residual between probe-level data and the segmentation mean to which those probes belong. This measure will scale with relative noise, or variance, per array. If rij is the original log2 ratio for a marker i in sample j, then sij is its segmented value, and x is a multiplier, with a value of either 1.5 or 2 for moderate and stringent analyses respectively. The threshold tj is set such that:
Segments, and by extension markers i, are amplified or deleted if they exceed ±tj. As a quality control measure, aCGH profiles with minimal or no copy number alterations were removed from certain data analyses, based on presumptive low tumor cell content. The GCnorm version of the Agilent data for complete n=199 sample set is provided online at http://cbio.mskcc.org/Public/lung_array_data/ .
Non-negative matrix factorization (NMF). The algorithm utilized is a modification of NMF which entails conversion of aCGH data to non-negative values followed by NMF classification (Brunet et al. 2004; Maher et al. 2006). The aCGH dataset is segmented and dimension-reduced by eliminating redundant probes, defined as two or more probes showing identical segmented values in all samples. The reduced aCGH data was converted to non-negative values by assigning two dimensions to each of the regions: a “gain” dimension for log2 ratios greater than zero and a “loss” dimension for the absolute value of log2 ratios less than zero. The resultant dataset is a non-negative matrix which was subjected to NMF using the software package published by Brunet et al. (Brunet et al. 2004) and run in MATLAB (Mathworks, Inc.). Clusters were defined by using the resulting NMF components to assign each sample to a cluster based on the nearest component. For each factor level, NMF was repeated 1000 times to build a consensus matrix which was used to determine stability of clustering. This unsupervised method was applied to samples which individually demonstrated distinct copy number alterations.
Algorithm for Definition of MCRs. MCRs are defined by application of an automated algorithm to segmented data according to the following rules, as previously described (Aguirre 2004; Tonon et al. 2005; Yao et al. 2006; Carrasco et al. 2006; Bardeesy et al. 2006):
i. Segments with values >0.4 or <−0.4 are identified as altered.
ii. If two or more similarly altered segments are adjacent in a single profile or separated by <500 kb, the entire region spanned by the segments is considered to be an altered span.
iii. Altered segments or spans <20Mb are retained as “informative spans” for defining discrete locus boundaries. Longer regions are not discarded, but are not included in defining locus boundaries.
iv. Informative spans are compared across samples to identify overlapping amplified or deleted regions; each is called an “overlap group”.
v. Overlap groups are divided into separate loci at boundaries where the recurrence rate falls <25% of the peak recurrence for the whole group. Recurrence is calculated by counting the number of samples with alteration at high threshold.
vi. MCRs are defined as contiguous spans within a locus, having at least 75% of the peak recurrence. If there are more than three MCRs in a locus, the whole region is reported as a single complex MCR.
Finally, MCRs were screened against CNVs from the Database of Genomic Variants (DGV) (hosted at the UCSC Genome Browser) and from the list provided in McCarroll et al (McCarroll et al. 2008). This confirmed that none of the novel CNAs listed in Supplementary Table 2 represent polymorphic CNVs in the huge sample set available in these databases.
Expression profiling. mRNA was extracted from all 199 primary tumors using standard techniques. RNA quality was adequate for microarray hybridization in 193 of 199 tumors. All microarray hybridization and scanning steps were performed in the MSKCC Genomics Core Laboratory. Briefly, total RNA was converted to double-stranded cDNA with oligo d(T) primers and reverse transcriptase before in vitro transcription with biotinylated UTP and CTP. The resulting biotinylated cRNA was then fragmented and hybridized for 16 hours at 45˚C to the HG-U133A (first 91 samples, up to 2001) or HG-U133A 2.0 (108 samples from 2002 to 2006) Affymetrix oligonucleotide arrays. The latter array contains the same probe sets as the former but with smaller feature sizes. The robust multichip average (RMA) method was used to estimate expression of probe sets (Irizarry et al. 2003). For unsupervised clustering, we applied a two-dimensional hierarchical clustering algorithm on all probe sets, with the Pearson correlation coefficient as the measure of similarity and average linkage as the method to join clusters. For supervised analyses, differentially expressed genes were identified by either two-sample t-tests, or by stratified Wilcoxon of the two chipsets. P-values were adjusted for multiple comparisons using the false discovery rate (FDR) method (Benjamini and Hochberg 1995). The threshold for significance was set to control the expected FDR at values <5%, as indicated. For those analyses incorporating copy number status, we excluded those cases with flat or borderline flat (copy-neutral) aCGH profiles based on their presumed low tumor content. For analyses excluding copy number, as with mutation-specific signatures, n=193. Because of subtle differences in signal between the Affymetrix U133A and U133A 2.0 chips, expression was mean-centered per array and standardized to unit variance to generate a Z-score. The raw data (Affymetrix CEL files) for complete n=193 expression microarray sample set is provided online at http://cbio.mskcc.org/Public/lung_array_data/.