Brief Descriptions and Rationales of the Framework 9

Supplemental Information

Contents

Supplemental Methods 2

Brief descriptions and rationales of the framework 9

Discussion on COPD data sets used in the current study 12

Supplemental Figures 17

Supplemental Tables 20

References25

Supplemental Methods
Animal model

The animal model system used in this study was an established model for rapidly progressing pulmonary disease, the ADA-deficient mouse model[1]. Mice genetically deficient in ADA will spontaneously develop features of chronic lung diseases[2, 3]. At birth, the litters bred from the ADA +/- cross ADA +/- matings were screened for ADA enzymatic activity using zymogram analysis. The ADA-/- mice were injected with pegylated ADA (PEG-ADA) enzyme from birth through day 25 for every 4 days to allow for normal lung development. On day 26 PEG-ADA injections were discontinued to allow adenosine levels to build up and pulmonary phenotypes to develop in the ADA -/- animals. Bronchial secretions and blood plasma were collected from the ADA -/- and ADA +/- animals (3 animals per group per time point) on postnatal days 26, 30, 34, 38, and 42. These data represent the responses to withdrawal of enzyme in the transgenic mice and the relevant controls at each time point.

1.2.Mouseplasma and BALF sample collections and preparation

The mouse plasma samples were depleted of their seven most abundant proteins (serum albumin, IgG, fibrinogen, α1-antitrypsin, transferrin, haptoglobin and IgM) using a Seppro mouse IgY7 LC10 column (Genway Biotech, San Diego, CA) following the manufacturer’s protocols. Depleted proteins were precipitated with 10% trichloroacetic acid (TCA) and subsequently denatured by the addition of urea to 8M, thiourea to 2M, dithiothreitol (DTT) to 5mM, and heated to 60°C for 30min. The samples were then diluted fourfold with 50mM ammonium bicarbonate, and calcium chloride was added to 1mM.

BALF samples were collected from the ADA+/- and ADA -/- mice as described in the previous literature[1]. In order to concentrate proteins in BALF samples prior to trypsin digestion, ice-cold TCAwas added to the samples to a final concentration of 10%. All samples were incubated at 4 C overnight followed by centrifugation at 14K RCF for 5 minutes. The pellet was washed one time with cold acetone and allowed to dry at room temperature for 5 min. The protein pellet was resuspended in 25 L of denaturing buffer (100 mM ammonium bicarbonate, 8M urea, 2 M thiourea, and 5 mMDTT) and heated to 60 C for 30 min. Following denaturation, the samples (plasma and BALF) were diluted fourfold with 50 mM ammonium bicarbonate, pH 7.8, and calcium chloride was added to 1 mM.All samples were digested using the methylated, sequencing-grade trypsin (Promega, Madison, WI) with a substrate-to-enzyme ratio of 50:1 (mass:mass) and incubated at 37 C for 15 hours. Sample cleanups were done by using a 1-mL SPE C18 column (Supelco, Bellefonte, PA). The peptides were eluted from each column with 1 mL of methanol and concentrated via SpeedVac. The samples were reconstituted to 1 μg/μL with 25 mM ammonium bicarbonate and frozen at -20C until analyzed.The minimum requirement for these experiments was 65 uL of mouse plasma or BALF yielding at least 200 ug protein following immunoaffinity depletion of the most abundant plasma proteins.

1.3.Human plasma sample

A subset of plasma samples originated from representative participants in a large cohort (n=467) from the Genetics of Addiction program (University of Utah Medical School). All subjects were recruited and samples were collected under institutional review board-approved protocols at the University of Utah. All applicable requirements of the federal and state regulations were complied with, and informed consent from each subject was obtained before the study began.These protocols were reviewed by the Institutional Review Board of the Pacific Northwest National Laboratory before transfer and analysis of the samples. Selected plasma samples were from current smokers or never smokers with low body mass index (BMI) values (< 25). Never smokers were subjects who had smoked less than one cigarette in their lifetime. The two groups analyzed include pooled plasma from 7 low BMI never smokers and 7 low BMI smokerswith COPD. Additional details regarding study participants have been described previously [4].

1.4.Human plasma depletion and protein digestion

The individual human plasma in each group was pooled for protein digestion. The plasma samples were first subjected to the separation of 12 high abundance proteins using a ProteomeLabTM 12.7 × 79.0-mm IgY12 LC10 affinity LC column (Beckman Coulter, Fullerton, CA) with a column capacity of 250 uL of plasma using an Agilent 1100 series HPLC system. The protein samples from IgY12 bound fractions were denatured and reduced in 50 mM NH4HCO3 buffer, pH 8.0, 8 M urea, 10 mM DTT for 1 h at 37 ℃. The resulting protein mixture was diluted 6-fold with 50 mM NH4HCO3, pH 8.0, before sequencing grade modified trypsin (Promega, Madison, WI) was added at a trypsin:protein ratio of 1:50 (w/w). The sample was incubated at 37 °C for 3 h. The tryptically digested sample was then loaded onto a 1-ml SPE C18 column (Supelco, Bellefonte, PA) and washed with 4 ml of 0.1% TFA, 5% acetonitrile. Peptides were eluted from the SPE column with 1 ml of 0.1% TFA, 80% acetonitrile and lyophilized. Final peptide concentration was determined by BCA protein assay (Pierce). Peptide samples were stored at -80 °C until further analysis.

1.5.Strong cation exchange (SCX) fractionation for theaccurate mass and time (AMT) databases

AnAMT database [5, 6]was established for each type of samples using the corresponding pooled samples:twotryptic peptide pools from plasma samples of the ADA+/- and ADA-/- mice at the early (days 26, 30 and 34) and the late time points (days 38 and 42) in disease progression, two from the mouse BALF samples at the early and late time points, one from the low BMI never smokers, and one from the low BMI smokers, respectively. The pooled peptide samples were fractionated by SCX-high performance liquid chromatography (HPLC) as described previously [5, 7]. Briefly, the peptides were resuspended in mobile phase A and 900 µL were injected onto a Polysulfoethyl A column (200 x 2.1 mm, 5 µm, 300 A; PolyLC, Inc., Columbia, MD) and separated using an Agilent 1100 HPLC system (Agilent, Palo Alto, CA). The autosampler and automated fraction collector were cooled to 4 C using Peltier coolers. The mobile phasesconsisted of 10 mM ammonium formate, 25% acetonitrile, pH 3.0 (mobile phase A) and500 mM ammonium formate, 25% acetonitrile, pH 6.8 (mobile phase B). Mobile phase A was maintained at 100% for the first 10 min and then mobile phase B was increased from 0 to 50% over the next 40min and from 50 to 100% over the following 10 min before maintaining100% mobile phase B for a final 10 min. A flow rate of 0.2 mL/min was maintained throughout the gradient. Spectrawere obtained at 280 nm. A total of 24 - 26 fractions were collected, lyophilized and stored at -80 C prior to the reversed-phase LCtandem mass spectrometry (MS/MS) analyses.

1.6. Reversed-phase capillary LC-MS analyses for AMT databases

Peptide samples obtained from the individual SCX fractions were analyzed using an automated in-house designed high-resolution reversed phase capillary LC system [8]. This LC system was interfaced to an LTQ ion trap mass spectrometer (Thermo Scientific, San Jose, CA) with electrospray ionization (ESI). The mass spectrometer operated in a data-dependent MS/MS mode over a full m/z range (400–2000) and a series of seven smaller segmented m/z ranges (400–700, 700–900, 900–1100, 1100–1300, 1300–1500, 1500–1700, and 1700–2000) for each sample. For each cycle, the ten most abundant ions from each LC-MS scan were selected for the MS/MS analysis using the 35% collision energy.

1.7.Generation of peptide AMT Tag databases

The resulting MS/MS measurements from the previous step were used to construct a peptide AMT database from each sample[6]. The raw data from LC-MS/MS analyses were converted into .dta files using an in-house software, DeconMSn (version v2.1.4.1), which accurately calculates the parent monoisotopic mass for each spectrum from the parent isotopic distribution using a modified THRASH algorithm [9]. For the mouse samples, the MS-Generating Function software[10] was used to search the MS/MS spectral data against the mouse Uniprotfasta file containing 16,383 proteins. Porcine trypsin was added into the database as an expected contaminant.For the human samples, the MS/MS data were then searchedagainst the human International Protein Index (IPI) database with atotal of 75,419 total protein entries with the reversed sequence decoy database searching option(for assessing false positive rate) using X!Tandem 2 software. To reduce protein mapping redundancy, all peptide sequences were subsequently mapped to protein entries in the Human UniProt database. No cleavage specificity was defined in the database searching. Peptide identifications in the AMT database were further refined by controlling the spectral FDR < 1% [5].

1.8.Reversed-phase capillary LC-LTQ-Orbitrap analyses for individual samples

Peptide samples from plasma and BALF (from Supplemental Methods 1.2) obtained on postnatal days 26, 30, 34, 38 and 42 were individually analyzed using an LTQ-Orbitrap™ mass spectrometer (ThermoScientific, San Jose, CA) coupled using an in-house-manufactured ESI interface. The reversed-phase capillary column was prepared by slurry packing 3-μm Jupiter C18 bonded particles (Phenomenex, Torrence, CA) into a 65-cm-long, 75-μm-inner diameter fused silica capillary (Polymicron Technologies, Phoenix, AZ). The mobile phases were consisted of 0.1% formic acid in water (solvent A) and 0.1% formic acid acetonitrile (solvent B). After loading 5 µg (1 μg/μL) of peptides onto the column, the mobile phase was held at 100% solvent A for 50 min. Exponential gradient elution was performed by increasing the mobile phase composition from 0 to 55% solvent B over 100 min. Orbitrap™ spectra were collected from 400-2000 m/z at a resolution of 100 k [11]. The ten most abundant ions from the MS analysis were selected for MS/MS analysis using a normalized collision energy setting of 35%. A dynamic exclusion of 1 min was used to avoid repetitive analysis of the same abundant precursor ion.All samples were analyzed in triplicate. The heated capillary temperature and spray voltage were maintained at 200 C and 2.2 kV, respectively.For the human samples, slightly different mobile phases were used. The mobile phase A consisted of 0.2% acetic acid and 0.05% TFA in water, and the phase B was 0.1% TFA in 90% acetonitrile. The gradient of solvent B was increased to 60% for the human samples. The rest of the procedures done on mice and human were identical.

1.9.LC-LTQ-Orbitrap data analysis

The Orbitrap spectra were analyzed using the AMT tag approach [12, 13]. Briefly, high resolution LC-MS features were deconvoluted using Decon2Ls (version 1.0.2, using default parameters) and aligned to the AMT tag database in VIPER (version 3.45 using default parameters) using the theoretical mass and observed normalized elution times for each peptide [14]. This approach to proteomics research is enabled by a number of both published[14-18] and unpublished in-group development tools that are freely available for download at The peptide alignments scores were filtered to control the FDR < 10% and uniqueness probability score > 0.5. A minimum of two unique peptides were required per protein identification. The peak intensity values (i.e. abundances) were available for the final identified peptides.

2Brief descriptions and rationales of the framework

2.1. Data reduction

Disease marker identification often starts with a list of differentially expressed genes, proteins, or metabolites in diseased conditions relative to their controls [19, 20]. Although some rules of thumb are available for different types of experimental measurements, determining a list of differentially expressed features is frequently constrained by various biological and technical limitations [21, 22]. The case-specific efforts, to some extent, are inevitable in order to properly address specific limitations in the majority of disease marker studies. Therefore, the exact approaches implemented in the data reduction component are often decided by the scientists who designed or conducted the studies.

2.2. Distance-based hierarchical clustering

Clustering is a common approach used in the process of knowledge discovery. It aims for grouping data in such a way that patterns in the same group are more similar to each other than to those in other groups [23]. The list of differentially expressed features between different conditions, determined in the previous step, is hierarchically clustered into several subsets based on a specified distance criterion. The distance criterion proposed here is based on dissimilarity and can be derived from feature expression profiles, functional annotations between the features, or a combination of both that is considered as an integration of data-driven and knowledge-driven information. Our speculation here is that this integrated distance may facilitate to group the features, such as genes or proteins, into several clusters that, ideally, contain orthogonal information between the clusters. If this is feasible, the individual subsets should contain reduced noise relative to the entire data set, and the robustness of the marker candidates selected within individual subsets could be improved relative to those extracted from the full data set.

2.3. Expert knowledge-driven disease-model-related functional selection

Addition to the semi-automated clustering approach, we also include an expert-knowledge-driven disease-model-related functional selection in the pipeline. This approach identifies the biological processes that contain significantly changed proteins, which can potentially be important for the disease of interest. This selection may serve as a means to validate the results from the distance-based clustering approach as well.

2.4. Bayesian integration and classification

Bayesian fusion analyses are implemented for capturing and integrating the information derived from the individual subsets in order to determine the sub data sets for providing the best performances. The performances of individual clusters or sets of clusters are numerically measured by CA, our defined evaluation metric. CA is a flexible measurement which can be used in studies not only with binary responses, such as diseased vs. healthy, but also with multi-categorical variables, such as cases with more than two diseased stages.

2.5. Selection of biomarker candidates and validation

Marker candidates at the cluster and individual levels are extracted and their validation on an independent human sample data set is highly desirable whenever possible. Note, validation can also be performed on the methodological level, i.e., evaluating a specific approach for biomarker identification, in addition to a detailed assessment for a list of biomarker candidates.

Discussion on COPD data sets used in the current study

3.1. Biological significance of selected individual protein candidates

In the demonstration data set, it actually is a quite striking observation that the four biomarker candidate proteins from the COPD-related functional selection convey as much information about the presence of COPD-like lung destruction as longer lists of proteins identified solely by the clustering approach. On the other hand, however, none of the four candidates from expert-driven functional selection are specific to lung functions, but instead reflect biological processes that are more indicative of the generalized tissue destruction seen in COPD. Interestingly, all of them have been reported as showing statistical linkage with COPD and/or other lung diseases. Specifically, prothrombin (THRB) and complement C3 (CO3) would both naturally be increased during wound healing and inflammation. The former is cleaved during the clotting process to produce thrombin that converts fibrinogen to fibrin [24], and the latter plays a central role in the activation of the complement system [25]. Vitamin D binding protein (VTDB) has shown influences on respiratory function both by determining vitamin D bioavailability and by direct effects on innate cell function. An emerging hypothesis suggests that VTDB may have a direct role in the pathogenesis of COPD [26, 27] as well as an indirect role in macrophage activation in the airway as part of the innate immune response[28]. The evidences of the associations between VTDB and COPD were reported by Metcalf and Robbins groups in the early 90s [29, 30] . The last but not least intriguing marker candidate, adiponectin (ADIPO) is a unique adipokine with multiple salutary effects such as antiapoptotic, anti-inflammatory, and anti-oxidative activities in many organs and cells[31]. Recent studies have, though inconclusively, suggested that adiponectin plays a role in signaling activity in the lung and can be associated with inflammatory pulmonary diseases such as COPD and asthma. Novel cross talk between lung and adipose tissues is currently under investigation[32].

3.2. Biomarker feasibility

Granted, many distinctions exist in respiratory physiology and anatomy as well as the innate and adaptive immune responses between mice and human. However, the shared biological pathways between the two provide some levels of justifications for using the ADA-deficient mouse model to study human COPD.

3.3. Some interesting points associated with the demonstration data sets

The time course information available from the different types of sample materials provides several valuable insights for not only COPD also the biomarker identification schemes in general.

First off, the selection of appropriate specimens in which the biomarkers will be measured is an essential issue. Ideally, the selected sample materials need to be; 1) easily accessed from patients, e.g., saliva, urine, plasma or serum, 2) reliably measured in routine clinical settings, and 3) able to provide accurate information that distinguishes the disease state in patients. BALF is an example of the proximal yet inconvenient-to-collect specimens. Its location potentially enables it to contain more concentrated disease-related biomolecules and thus provide more direct biological and pathological information. In contrast, the easily accessible but distal-to-the-disease-site sample materials, for instance plasma, in which the disease-related biological information carried by the biomolecules can be potentially diluted during the transport from the disease site to plasma, is also more likely modulated by morbidities other than specific disease of interest. Although the first type of samples may provide more accurate biological information, the easy accessibility of the second type of samples, such as plasma, serum and urine, and the economical cost associated with their collection are also practical issues that cannot be overlooked in clinical applications.