Modified Association Rule Mining Approach for the MHC-Peptide Binding Problem

A Modified Associated Rule Mining Approach for the MHC-Peptide Binding Problem 9

Modified Association Rule Mining Approach for the MHC-Peptide Binding Problem

Galip Gürkan Yardımcı, Alper Küçükural, Yücel Saygın, Uğur Sezerman

{yardimci, kucukural}@su.sabanciuniv.edu

{ysaygin, ugur}@sabanciuniv.edu

Faculty of Engineering and Natural Sciences,

Sabanci University, Turkey

Abstract. Computational approach to predict peptide binding to major histocompatibility complex (MHC) is crucial for vaccine design since these peptides can act as a T-Cell epitope to trigger immune response. There are two main branches for peptide prediction methods; structural and data mining approaches. These methods can be successfully used for prediction of T-Cell epitopes in cancer, allergy and infectious diseases. In this paper, association rule mining methods are implemented to generate rules of peptide selection by MHCs. To capture the binding characteristics, modified rule mining and data transformation methods are implemented in this paper. Peptides are known to bind to the same MHC show sequence variability, to capture this characteristic, we used a reduced amino acid alphabet by clustering amino acids according to their physico-chemical properties. Using the classification of amino acids and the OR-operator to combine the rules to reflect that different amino acid types and positions along the peptide may be responsible for binding are the innovations of the method presented. We can predict MHC Class-I binding with 75-97% coverage and 76-100% accuracy.
Keywords: Peptides, MHC Class-I, Association rule mining, reduced amino acid alphabet, data mining.

1 Introduction

Peptide binding prediction is a crucial step for vaccine design since it enables the understanding of the mechanism of the immune response to foreign bodies and how vaccines work. There are numerous experimental research results regarding this subject. These experiments take too much time and are costly since there are a vast number of peptides to be tried as a vaccine candidate even for a single MHC. Therefore, there is an urgent need for developing effective computational methods to solve the peptide binding problem to the MHC. The methods developed in finding peptide sequences specific for the target MHCs can be also used for developing therapeutical proteins as well for other types of receptors.

MHCs recognize antigens which are foreign macromolecules that cause an immune response in the body. There are two types of immune responses to the antigens: humoral and cellular immune response. Class II MHC molecules are involved in humoral immune response whereas Class I MHC molecules are involved in the cellular immune response which is the response after the antigen enters the cell [9]. In this paper we will focus on cellular response which involves recognition of antigenic fragments by Class I MHCs. After foreign bodies enter the cell they are cleaved into smaller pieces that are called peptides. These peptides are picked up by Class I MHCs and brought to the cell surface. There are on average three to four different type of Class I MHCs in the human cell, which all bind to different types of peptides including self and antigenic peptides. The T-Cells recognize the infected cell upon binding to the antigenic peptide-MHC complex, which triggers a cascade of events leading to the cellular immune response to foreign bodies. In both Class I and Class II pathways the most important molecule initiation of the recognition of infected cells is major histocompatibility complex (MHC). Knowing which peptides that are yielding from the cleavage of antigens will be picked up by the MHC molecule and understanding the mechanism of the binding of the peptides (sequence motifs) will be of great use in vaccine design. A peptide presented to a T-Cell together with a MHC molecule is called T-Cell epitope. If the cell is infected, it can be induced to apoptosis by T-Cell. In this paper, we investigate the Class I pathway for prediction of T-Cell epitopes.

Laboratory experiments can be used to determine which peptides bind to which kind of MHC molecules. The peptides that are known to bind to Class I MHCs have variable length but the majority of them have between 8 to 10 residues. Conducting laboratory experiments for all types of peptide binding combinations is not feasible since there are 208 to 2010 possible peptides using 20 amino acid alphabet, but only a few are selected by the MHC[12]. We combine structural and data mining based methods for prediction of T-Cell epitopes. Association rule mining techniques are used for finding correlations between positions of the bound peptides and determining the binding motifs for each type of MHC. These rules will be useful for understanding the mechanism of peptide binding.

2 Background and Related Work

There are two main approaches to the peptide prediction problem: profile based approaches and machine learning. Profile based approaches build profile scoring matrices from the alignment of the binding peptides. These methods control the peptide sequence for the availability of the preferred sequences at certain positions of the peptide as predicted by the scoring matrix. Up to now most successful methods are machine learning methods, like SVMHC[7].

Profile based methods, SYFPEITHI[11], Rankpep[13], and ProPred1 [15], only take into account the positive cases to derive the information therefore they do not have high specifity as compared to machine learning approaches where the non binder class information is also taken into account to distinguish the properties of binders. [4]

The second group of researchers used machine learning approaches such as Support Vector Machines and Artificial Neural Networks to find the correlations between the positions of the peptide to build a valid probabilistic model using both the binding and non binding peptides’ data [5],[6],[7]. Another method done by Milledge et. al. that was used for predicting peptides for HLA 0201 type of MHC has created sequence structural patterns by using association rules to reflect the MHC binding characteristics of peptides [10].

2.1 Association rule mining

The problem of finding association rules among items is formally defined by Agrawal et al. in [1], [2] as follows:

Let I = {i1, i2, ..., im} be a set of all items. Let T be a transaction consisting of a set of items such that T Í I. We call D a database of transactions. We say that a transaction T contains X, a set of some items in I, if XÍ T. An association rule is an implication of the form X Þ Y, where X Ì I, Y Ì I and X Ç Y = Æ. An item set X has support s if s% of the transactions contain X. We say that the rule X ÞY holds with confidence c if c% of the transactions in D that contain X also contain Y. The rule X ÞY has support s if s% of transactions in D contain XÈY.

Association rule mining algorithms scan the database of transactions and calculate the support and confidence of the candidate rules to determine if they are significant or not. For that purpose, threshold values are used by the algorithms to prune the insignificant rules. A rule is significant if its support and confidence is higher than the user specified minimum support and minimum confidence threshold. In this way, algorithms do not retrieve all the association rules that could possibly be generated from a database, instead only a very small subset of rules which satisfies the threshold values are retrieved.

Support of an association rule mimics the coverage of that rule, and confidence of the rule specifies the accuracy. Both of these measures are important for determining the significance of a rule. Therefore we used a combined support confidence measure (CSC-Measure)[1]. The formula for the CSC-measure is obtained by taking the harmonic mean of the support and confidence measure, which is formulated below:

where s is the support and c is the confidence of the rule. CSC-Measure takes both the confidence and support of the rule into account, so rules which have high confidence values and which cover more transactions over the data set will be more valuable.

3 Association Rule Mining Methods for (Peptide-Binding) Prediction

Our data set D contain amino acid sequences of peptides which are known to bind to Class I MHC molecules [3]. In D, there are 198 transactions (peptides) known to bind to 4 different MHCs. We have worked with nine amino acid long sequences only since the majority of the known peptides were nine-mers. Each peptide is represented by an item-set of nine elements, based on it sequence. So in our case there are 180 different items since there are nine different positions and twenty different amino acids. Set I has 209 different item-sets, each set has nine elements for nine positions and each element can be one of the 20 different items. The position of each amino acid in the sequence is important so we have turned the sequences into item sets X of the form AP where A is the one letter code of each amino acid and p is the position of the amino acid in the sequence. The rules mined will be as follows {V1} Þ {G2}, meaning that the presence of a Valine in first position of the peptide sequence implies that there will be Glycine in the second position in the peptide sequence. For simplicity, we’ll omit the curly brackets in the following sections.

But MHC molecules are not very decisive when binding the peptides, it can accommodate different types of amino acids at the same position of the peptide. There are pockets at the binding site of MHC, some of these pockets have to be filled with certain types of amino acids for the binding requirements to be fulfilled [19]. Sometimes the second position of the peptide fills the appropriate pocket and sometimes the third position of the peptide occupies the same pocket. Therefore different amino acids and different positions of the peptide may have the same role in defining the peptides’ binding characteristics; association rules cannot catch this property well. So we have decided to change the rule structure to deal with this problem.

Our association rules have the form {V2} V {A2} V {L3} Þ {I9} meaning that the presence of a Valine or an Alanine in the second position or a Leucine in the third position of the peptide implies that the ninth position of the peptide sequence will be an Isoleucine. Such rules can capture the binding characteristics better. This rule structure with ORs ( V ) will also increase the CSC-Measure of the rules, resulting with more globally correct binding characteristic rules. The support and the confidence measures' definitions remain unchanged, the only difference is that the calculations are done taking the OR into account.

3.1 Candidate generation and rule mining

The candidate generation step is generally done by the apriori algorithm and its variations [2]. Since using OR as a rule increases the number of candidates so much that the apriori algorithm will not have a reasonable runtime. We first extracted rules with one amino acid on each side by the conventional rule mining algorithm. Then we have combined these rules with the OR operator, to yield rules which reflect the binding characteristic better. The confidence of a new combined rule will be between the values of the minimum and the maximum of confidences of the rules which were combined to yield the new rule. The support will obviously increase as the number of sequences which contain the amino acids on the left side are increasing because of the OR operator between them.

First we have mined the database for association rules of the form XiÞ Yj where X and Y are amino acids and i and j are their positions. Small confidence (50%) and support (20%) thresholds are used for two reasons. The first is that we expect these values to go up as we combine the rules with the OR operator so we want as many rules as possible. The second is that low support values imply that the number of sequences or the transactions which contain both of the combined amino acids will be small.

The combining process will be as follows, over the set of all one amino acid rules of the form Xi Þ Yj, we will combine the rules which have the same implication, then generate all the possible two amino acid combinations of these rules.

After we have the two amino acid rules, again we combine these rules to yield three amino acid rules. This time the process will be similar to the apriori algorithm. We combine k amino acid rules which share k-1 amino acids and which have the same implication. Combining these rules yield k+1 amino acid rules. The fact that we are using the OR operator guarantees that new rules’ support values will never decrease so we don’t have to check support values. The pruning criterion is CSC-Measure of the new rules. If CSC measure does not improve by at least 2% by addition of the new OR rule, the new rule is pruned.

3.2 Amino acid classification

Evolution allows for sequence variability; to capture this information, we have also classified the amino acids according to their physico-chemical properties as given in Table 1. Different classes of amino acids are obtained from a previous study by Sezerman et. al. which used an encoding decoding algorithm that classified amino acids based on similarity scoring matrices [14]. The classification scheme given in Table 1 yielded the best results for us. Using the classification table enabled us to distinguish the binding rules according to their physico-chemical properties e.g. HLA-A2 molecule prefers a peptide with a bulky hydrophobic residue at position two (Class F) and a small hydrophobic residue at position nine (Class A) for binding.

The classification step reduces the number of items and item-sets, reducing the number of rules but making the rules more compact. The number of possible item-sets reduces to 129 from 209 and number of items reduces to 108 from 180.