Prob & Stat Group Project Paper

Probabilistic Models in Proteomics ResearchM. DePalo

Probabilistic Models in Proteomics Research

Maury DePalo

RBIF-103: Probability and Statistics

Brandeis University

November 22, 2004

Preface

This paper is one outcome of a group project completed for the above course. Our group was composed of three members: Patrick Cody, Barry Coflan and myself. We three collaborated on the identification and development of an appropriate topic; on the basic research of the subject matter; and on the preparation and delivery of the class presentation. For both the research and the presentation, we each primarily concentrated on a single aspect of the overall topic, and we have each prepared and submitted an individual paper with the knowledge that our separate papers fit together as part of a larger whole that covers the overall topic more completely. This synergy is evident in the class presentation and in the associated materials.

This present paper focuses on the development and performance of a probabilistic model being used to identify proteins in a complex mixture. At the start of this project, we expected that we would research and present different methods and different metrics used for protein identification. As our individual research proceeded, it became apparent that proteomics researchers who were moving beyond the individual search scores and metrics, and who were integrating these simpler scores into a more comprehensive probability-based model were having very good success. In my own research, it became evident that a particular research group at the Institute for Systems Biology was having success with what I describe in this paper as a “two-level model”, one that considers both peptide assignments to MS spectra and the consequent peptide-based evidence for the underlying proteins [13][19]. Furthermore, this model was shown to perform better than the individual scores and metrics. Consequently, the primary focus of this paper is on summarizing the rationale for and the successive development and refinement of this two-level model by these researchers, and on its performance in estimating the probabilities of proteins being present in a complex mixture.

Introduction

The primary goals of proteomics research are to separate, identify, catalog and quantify the proteins and protein complexes that are present in a mixed sample, as representative of a change in the metabolic or signaling state of cells under a variety of experimental conditions. Researchers hope to characterize specific changes in protein levels as either disease or diagnostic markers for a variety of complex diseases. Numerous studies have been performed to this end, using a variety of experimental and analytical techniques [9][11][12].

The process of preparing protein samples for measurement and analysis and the subsequent interpretation of the data resulting from these studies are complex and subject to substantial variability from a number of sources. Each variable introduces an additional dimension of uncertainty in the conclusions that can be drawn from such experiments. Researchers are using a variety of analytical techniques to reduce or eliminate the uncertainty inherent in these methods. This paper examines the application of a particular set of Bayesian-inspired probabilistic models through which a particular group of proteomics researchers are making notable progress toward a clearer understanding of the sensitivities and specificities underlying the effective measurement of protein expression patterns.

Measuring Mixed Protein Samples

Tandem mass spectrometry (MS/MS) is becoming the method of choice for determining the individual protein components within a complex mixture [1][10][[17]. The proteins in a mixed sample are first digested using a proteolytic enzyme, such as trypsin, resulting in a set of shorter peptides. The peptides are subjected to reversed-phase chromatography or some other separation technique, and are run through any of a number of different types of mass spectrometer. A MS/MS ionizes and then fragments the peptides to produce characteristic spectra that can be used for identification purposes. The collected MS/MS spectra are usually then searched against a protein sequence database to find the best matching peptide in the database. The matched peptides, those assigned to the generated spectra and used during searching, are then used to infer the set of proteins in the original sample.

The Process

Although conceptually straightforward, the process of obtaining the protein mixture can encounter a number of variables that can lead to significant variation in the results. First, the extraction process used to acquire the protein sample from the tissue or fluid under study must be reproduced precisely, using the exact sequence of centrifugation, fractionation, dissolution and extraction techniques. Once the sample is obtained, the proteolytic enzyme must be chosen, since each enzyme attacks the proteins at specific amino acid junctures, with different efficiencies, leading to different collections of peptides depending upon the preponderance of those specific amino acids in the sample and the number of missed cleavages. The separation technology, such as gel electrophoresis or any of dozens of types of chromatography, must be performed consistently. And then the specific type of mass spectrometry equipment must be used. In this paper we will be mostly concerned with MS/MS, during which selected peptides are further fragmented and computationally reconstructed to provide greater resolution into the exact sequence composition of the peptides. Each of these steps introduces the potential for variability in the resulting peptide population, in terms of composition, concentration, accuracy and various other factors.

The end result of the spectrometry stage is a set of MS/MS spectra that presumably correspond to some subset of the individual peptides that comprised the proteins in the original sample.

Searching the Protein Database

Once the spectra are obtained, each spectrum is searched against a reference database of proteins and their corresponding spectra and/or sequences. Most search algorithms begin by comparing each spectrum against those that would be predicted for peptides obtained from the reference database, using their masses and the precursor ion masses within an acceptable error tolerance. Each spectrum is then assigned a peptide from the database, along with a score that reflects various aspects of the match between the spectrum and the identified peptide. The scores are often based on the number of common fragment ion masses between the spectrum and the peptide (often expressed as a correlation coefficient), but also reflect additional information pertaining to the match, and can be used to help discriminate between correct and incorrect peptide assignments [3][4][7][8][16][18][20][21].

Improving on the Search Results

In an effort to substantiate or further increase the level of confidence associated with the search scores returned with the peptide assignments, researchers have applied various additional criteria to evaluate the search results. Properties of the assigned peptides, such as the number of tryptic termini or the charge state(s), are often used to filter the search results to try to improve their accuracy. Often the search results must be verified by an expert, but this is a time-consuming process that is generally not feasible for larger datasets. And these approaches do little to reduce the inherent variability in the process. Furthermore, the number of false negatives (correct identifications that are rejected) and false positives (incorrect identifications that are accepted) that result from the application of such filters are generally not well understood. This is complicated by the fact that different researchers use different filtering criteria, making it difficult to compare results across studies.

Applying Probabilistic Models

The bulk of this paper focuses on a pair of statistical models developed by proteomics researchers at the Institute for Systems Biology (ISB), described in detail in [13] and [19]. The inherent two-step process used to first decompose proteins into peptides, and then peptides into spectra, is reflected in a two-level model used to reconstruct the composition of the original protein mixture. The peptide-level model estimates the probabilities of the peptide assignments to the spectra. The protein-level model uses the peptide probabilities to estimate the probabilities of proteins in the mixture. Using these models in tandem has been shown to provide very good discrimination between correct and incorrect assignments, and leads to predictable sensitivity (true positive) and specificity (false positive error rates). We now examine these models individually in greater detail.

The Peptide-Level Model

The next several sections describe how the peptide-level model was derived and refined by successive execution against the available experimental data. Of particular note is the manner in which the model was repeatedly extended, successively introducing additional information known about potential target proteins, and the processes and technologies being used to identify them, into the model [13].

Inferring Peptides from Mass Spectra

The first step toward identifying the proteins in the sample is to identify the peptides represented by the individual MS/MS spectra observed from the sample. As described above this consists of matching the individual observed spectra against a reference database of proteins and the spectra that correspond to the peptides that comprise each protein. The spectra and the peptides recorded in the database are either actual spectra and peptides observed by previous experiments with the corresponding protein, or spectra and peptides that are predicted for the protein on the basis of computationally anticipated proteolytic activity on that protein. Since the search-and-match process is not exact, a degree of uncertainty exists in any results returned by the database. Most searching and matching algorithms in use today return one or more scores to aid in assessing the accuracy of the matched peptides returned from the database [3][18]. For example, the SEQUEST search algorithm returns a number of individual scores (described further below), each representing an assessment by the matching software of the quality of the match between the experimental spectra and the reference spectra.

The challenge facing researchers is how to evaluate and interpret the scores returned by these search algorithms and databases, and how to use them to systematically and consistently to reach a conclusion about the presence of the identified peptides in the original sample.

Researchers at ISB describe a peptide-level statistical model that estimates the accuracy of these peptide assignments to the observed spectra. The model uses a machine-learning algorithm to distinguish between correct and incorrect peptide assignments, and computes probabilities that the individual peptide assignments are correct using the various matching scores and other known characteristics about proteins in general and the individual proteins in the mixture.

Experimental Conditions and Datasets

To begin, the authors generated a number of individual datasets from various control samples of known, purified proteins at various concentrations. Each sample was run through the process summarized above, subjecting each sample to proteolytic cleavage by trypsin, and subsequent decomposition by ESI (electro-spray ionization)-MS/MS. They also generated a training dataset of peptide assignments of known validity by searching the spectra against a SEQUEST peptide database appended with the sequences of the known control proteins. This allowed them to observe the behavior of the SEQUEST searching and matching algorithms against a known sample of peptides, in the context of a larger database of peptides known to be incorrect with respect to the sample proteins. This was done with both a Drosophila database and a human database. Each of the resulting spectra matches was reviewed manually by an expert to determine whether they were correct. The result was a set of approximately 1600 peptide assignments of [M + 2H]+ ions and 1000 peptide assignments of [M + 3H]+ ions determined to be correct for each of the species databases.

Interpreting the Search Scores

The authors recognized that in order to be useful beyond their initial value, the individual scores returned by the matching algorithm needed to be combined in some manner. Using Bayes’ Law [2][14,][15][22][23] they reasoned that the probability that a particular peptide assignment with a given set of search scores (x1, x2, … xS) is correct (+) could be computed as:

[Eq 1] p(+ | x1, x2, … xS) =

p(x1, x2, … xS | +) p(+) / ( p(x1, x2, … xS | +) p(+) + p(x1, x2, … xS | –) p(–) )

where p(x1, x2, … xS | +) and p(x1, x2, … xS | –) represent the probabilities that the search scores (x1, x2, … xS) are found among correctly (+) and incorrectly (–) assigned peptides, respectively, and the prior probabilities p(+) and p(–) represent the overall proportion of correct and incorrect peptide assignments represented in the dataset, determined through the prior analysis using the control samples and searches. These latter values can be considered an indication of the quality of the dataset.

Rather than attempting the complex process of computing these probabilities using a joint probability distribution for the several scores (x1, x2, … xS), the authors employed discriminant function analysis [5] to combine together the individual search scores into a single discriminant score that was devised to separate the training data into two groups: correct and incorrect peptide assignments. The discriminant score, F, is a weighted combination of the database search scores, computed as:

[Eq 2] F(x1, x2, … xS) = c0 + Sum Of( ci xi )

where c0 is a constant determined through experimentation, and the weights, ci, are derived to maximize the distinction of between-class variation versus in-class variation, in this case to maximize the distinction between the correct and incorrect peptide assignments. The function is derived using the training datasets with known peptide assignment validity. Once derived, the discriminant score can be substituted as a single combined value back into [Eq 1] in place of the original individual search scores and the resulting probabilities computed as follows:

[Eq 3] p(+ | F) = p(F | +) p(+) / ( p(F | +) p(+) + p(F | –) p(–) )

where p(+ | F) is the probability that the peptide assignment with discriminant score, F, is correct, and p(F | +) and p(F | –) are the probabilities of F using the discriminant score distributions of correct and incorrect peptide assignments, respectively. The authors show that the resulting probabilities retain much of the discriminating power of the original combination of scores, but offer a simpler calculation than the joint distributions required in [Eq 1].

Using this discriminant scoring approach against a variety of search scores returned from the SEQUEST algorithm, four specific SEQUEST scores were found to contribute significantly to effective discrimination: 1) Xcorr, a cross-correlation measure based on the number of peaks of common mass between observed and predicted spectra; 2) Delta Cn, the relative difference between the first and second highest Xcorr score for all peptides queried from the database; 3) SpRank, a measure of how well the assigned peptide scored, relative to those of similar mass in the database; and 4) dM, the absolute value of the difference in mass between the precursor ion of the spectrum and the assigned peptide.

They further discovered that transformation of some (raw) search scores significantly improved the discrimination power of this approach. For example, Xcorr shows a strong dependence on the length of the assigned peptides. This is because Xcorr reflects the number of matches identified between ion fragments in the observed and predicted spectra, leading to larger values for assignments of longer peptides with more fragment ions than for assignments of shorter peptides with fewer fragments. Consequently, assignments of shorter peptides can be difficult to classify as correct or incorrect, since even the correct assignments will often result in low Xcorr scores. They found that this length dependence could be reduced by transforming Xcorr to Xcorr’, which was computed as the ratio of the log of Xcorr to the log of the number of fragments predicted for the peptide, using a two-part function that included a threshold for the length of the peptide. It was found that beyond a certain length threshold, Xcorr was largely independent of peptide length, so this factor was used in the calculation [13].

Using this analysis, it was found that the SEQUEST scores Xcorr’ and Delta Cn were found to contribute the most to the discrimination achieved by the function between the correct and incorrect peptide assignments. After reviewing these results against the training dataset, they observed excellent distinction between the correct and incorrect peptide assignments, with 84% of correct peptide assignments having F scores of 1.7 or greater, and 99% of incorrect assignments having F scores below that value.

Recognizing that [Eq 3] would be sensitive to F score distributions, the authors computed the distributions of the F scores for the training datasets. By binning the F scores into 0.2 wide discrete value intervals, the distributions of the scores among the correct and incorrect assignments were determined. The probability that a correct peptide assignment has discriminant score, F, was found to fit a Gaussian distribution, with calculated mean, m, and standard deviation, s, as follows:

[Eq 4]p(F | +) = ( 1 / Sqrt( 2 pi s) ) e– (F – m)^2 / 2a^2

Furthermore, the probability that an incorrect peptide assignment has a discriminant score, F, was found to fit a gamma distribution, with parameter g set below the minimum F in the dataset, and parameters a and b computed from the population. The resulting distribution is computed as follows:

[Eq 5] p(F | –) = ( ( F – g )a-1 e– (F – g) / b) / ( ba Gamma(a) )

These two expressions for p(F | +) and p(F | –) were then substituted back into [Eq 3], which improved the calculation of accurate probabilities that the peptides assigned to the spectra in the training dataset are correct.

Considering the Number of Tryptic Termini

If we know that the process to generate the peptides includes a specific proteolytic enzyme, we can exploit our knowledge of the specific amino acids cleaved by that enzyme to further inform our database search and our probability calculations. The ISB researchers used their knowledge that trypsin cleaves proteins on the –COOH side of lysine or arginine to first determine the number of tryptic termini (NTT) that would be present in a sample, and then use the NTT of peptides assigned to spectra as additional information for assessing whether the assignments are correct. The value for NTT is either 0 (indicating that no tryptic terminus was assigned to a spectrum peptide), 1 (indicating that only one of the two tryptic termini corresponding to a cleavage was assigned to the spectrum peptide), or 2 (indicating that both of the tryptic termini corresponding to a cleavage were assigned to the spectrum peptide), indicating how many of the peptide termini are consistent with cleavage by the specific proteolytic enzyme used, trypsin. Similar consideration of alternative cleavage enzymes would also be valid.