Lazy Student Analogy (Cheating Student Analogy)

Consensus Algorithm
for Structure Prediction

The ROAWBetter Algorithm
and Bachalor Framework

Ross Bayer

Alex Chan

William Lu

Olga Russakovsky

Introduction

Protein structure prediction is an active area of research with many algorithms, none of which completely solve the problem. In recent years, consensus-based algorithms that combine the results from multiple algorithms have proven to perform best among all prediction algorithms. In this paper, we propose a novel approach for a consensus-based algorithm.Our approach is based on dividing the space of all proteins into areas where an algorithm performs the best and using that algorithm to predict the protein structure. The idea is more intuitive with an analogy.

Lazy Student Analogy

Suppose you are a lazy student preparing for a comprehensive exam. The exam consists of several areas. In the example mentioned below, the areas are Literature, History, Math, and Science. Unfortunately, the questions on the exam are not separated by sections nor are they labeled by the area they are from; however, each question has an equal probability of coming from one of the areas.

Since you are lazy, you don’t want to study all of these areas (not to mention that it may not be possible to study everything in the time before the exam).Fortunately, the exam is an open collaboration exam, meaning you may freely discuss any of the questions with the other exam takers.In addition, you know how well each person did on a previous comprehensive exam and his score in each area.A concrete example of this problem with four exam takers other than youis depicted inFigure 1. How would you collaborate to come up with the answers to the exam?

Figure 1: A specific example of the Lazy Student Analogy. In the example, the comprehensive exam has four areas: Literature, History, Math, and Science. The letter grade of how well each exam taker did on a previous comprehensive exam is given.

The first solution would be to simply collaborate with the student that had the highest score on the last comprehensive exam. This makes sense since if the exam is a robust measure, the student would score equally well on the current exam and as a result, you would score as well as he does. Unfortunately, this may not be the highest score possible. In the above example, this solution corresponds to an overall score of C, if the previous result is an indication of the current exam.

The first solutions had a few drawbacks. First, it doesn’t take advantage of the fact that there are many exam takers, not just the one student with the highest score on the last exam. Second, it doesn’t use the information given by the exam areas. By examination, we will note that although several of the students didn’t score the highest on the comprehensive exam, they have areas they scored extremely well on. If we could take advantage of this fact, it would be possible to score higher than the above score of C.

The second and better solution isjust this – take advantage of the information of how well each exam taker did on each area of the exam.Although an exam taker may have a lower overall score, he may do well in a particular area of the exam. If it is possible to break the exam down into finer chunks, then this may be evident. Therefore, the better solution is to use specialized information to get a better answer. For each question, determine the area to which it belongs and then collaborate with the exam taker that had the best score in that area. This process is repeated for all of the questions. In the above example, three of the students did well on Literature, History, and Math, respectively; even though their overall scoresare lower than the highest scoring student. One obvious way to break the exam down is by the different areas. So one way to take the exam would be to first determine the area of the question and then ask the student that did the best in that area. Using this approach, we would get a score of A in Literature, A in History, A in Math, and C in Science. The overall score is an A-/B+, which is much better than the overall score of C obtained above with the first solution. This is the idea behind our consensus algorithm.

Although we are able to obtain a better score with the second solution, this is by no means an easy feat to accomplish. The difficulty with this solution lies in the fact that the questions are not divided nor are they labeled by area. Therefore, how well we perform with this solution is in no small part affected by how well we determine the area a question is from. In fact, if we determined the area incorrectly, the actual score may be worse than that of the first solution.

Parallels to Protein Structure Prediction

The parallels between the Lazy Student Analogy and protein structure prediction are as follows. The task of predicting the structure for a prediction is essentially a comprehensive exam since any algorithm must take into account numerous factors that influence the protein structure, such as the energies involved in folding, the function of the protein, and the amino acid sequence. Each algorithm that tries to predict protein structure is an exam taker. Unfortunately, protein structure prediction algorithms vary in performance. Every algorithm has a specialty such as threading to a particular profile, computing certain natural energies, or sequence homology and database search. A typical performance versus space of all proteins graph is depicted in Figure 2.

Figure 2: A typical plot of performance over the space of all proteins for three algorithms.

Taking a cue from the Lazy Student Analogy, if we were able to break the space of all proteins into discrete areas corresponding to the places where a particular algorithm performs well, we can obtain a much better prediction than using any one algorithm alone. But how do we divide the space of all proteins?

Statistics on the number of new folds discovered in recent years have shown that it is a relatively small number, which suggests that the number of folds may approach a finite number asymptotically. If this is indeed the case, then it would make sense to divide the space of all proteins by protein folds. Each algorithm would predict proteins with particular folds. Using this theory and the assumption that the number of folds is constant, we have divided the space of all protein using the families in SCOP. Again, just as in the analogy, this is not a trivial problem. This is evident in the fact that protein classification is not a trivial problem in practice; however, the advantage of our approach is that it is able to combine solutions in both protein classification and protein structure prediction to obtain better predictions.

ROAWBetter Algorithm

Our project consisted of three basic parts: choosing the right algorithms to include in our consensus, determining which of those algorithms to run on which protein families, and deciding how to classify the unknown protein into the correct family. We will discuss each of these in detail below.

Servers

We spent quite a long time researching different algorithms and deciding which ones to use for our project. In the end, we chose algorithms based on several factors. The first one was their performance in CASP and LiveBench. Obviously, we wanted algorithms that did a good job with predicting the structure of most proteins, so that when we combined them we could achieve even impressive results. For example, Robetta, which beat all other fully-automated algorithms by a long shot in CASP was an easy pick.

The second factor was participation in both LiveBench and CASP. We wanted to be able to both train and test our program, and so we wanted two different sets of data. We decided that it would be best to chose algorithms that participated in both of these competitions so that we are able to obtain an idea as to how well our consensus algorithm would have performed. However, if we were to expand this project further, this would not be a necessary restriction.

The third factor was the ability to communicate with the servers. Some algorithms (such as ORFeus-2 and SUPERFAMILY) performed very well, yet their websites were perpetually unavailable, and so we were forced to consider other servers instead.

After eliminating all servers that didnot participate in both CASP and LiveBench, and those that we couldnot communicate with, we were left with a relatively small group. In the end, we settled on four algorithms, namely Robetta, FFAS03, SAM-T02, SAM-T99, mostly based on their performance. We considered things such as performance on “easy” versus “hard” models in LiveBench, and we tried to have algorithms with a variety of approaches to structure prediction. For example, Robetta is an abinitio method, while FFAS03 is based on profile-profile comparison algorithm, and the SAM servers create multiple alignments and use the alignment to predict secondary structure and then to build an HMM used for searching the protein databank for similar proteins.

We used two versions of the SAM server because they both performed well, and we weren’t sure which one to choose. Ideally, however, we would probably not want to use two very similar servers because, chances are, they perform well on the same proteins. So to decrease the running time of our program, we would probably eliminate one of these servers from the final version. We would do so after extensive training to determine which of the servers performs better overall.

However, it is also a valid approach to leave both of them in, because having more servers means better results overall, even if the servers are similar. Even if they perform well on similar proteins, chances are one will not outperform the other on all of proteins, and so there will still be some increase in performance of our consensus algorithm if we include both servers.

Unfortunately, due to time constaints, we were unable to do extensive analysis of all the algorithms. In the future, it would be very good to do more research into the nature of each algorithm and to analyze which proteins it performs best on and why. It would also be a good idea to try our program using a variety of servers, and to see empirically which combination works best.

Training

So now that we have discussed which algorithms we decided to include in our project, we can say a few words about how we determined which algorithmto run when.

We did so by creating a databank using data from LiveBench6 and 8 (these being the only LiveBench competitions where all 4 of our algorithms were run). For each protein that was tested in the competition, we knew how well each of the 4 algorithms we chose performed when predicting its structure, and also which SCOP family (if any) the proteinbelongs to. So we ended up with some representative proteins for many families. Ideally, we would want to have many representative proteins for every family; however, we unfortunately did not have access to that much data. In the end, this is what our databank looked like:

Figure 3: A schematic diagram of our databank. The bright blue proteins are those that we had data for. The families represent all families in SCOP – the yellow ones are those that contained at least one of the proteins from LiveBench, i.e. those families that we had some data for.

After reading in all the data, we considered all proteins belonging to a particular family and averaged the performance results for each algorithm on that family. Then given a new protein, we would classify it into the correct family as described below, and run the algorithm that had the highest average score for that family.

Note that some families (actually, most to be honest) did not have any data, because there were only 256 proteins tested in LiveBench 6 and 8. In this case, we used the default algorithm, Robetta, on any protein that was classified into one of those families. We chose Robetta because it was the best performing algorithm out of the 4 we were using (this choosing of a “default” algorithm was done automatically by averaging results from all proteins in the data source).

Classifier

After deciding which algorithms to useon which family, the final step of our project was figuring out how to classify the new protein into the correct family.

A problem that we faced during classification is that classifying proteins into families is inherently a structure prediction problem. In SCOP, proteins are first classified into classes according to their folds (such as “all-alpha” proteins), and then from there classified further into families, based on even more intricate structure patterns. So in order to be able to correctly classify a protein into its family, we need to figure out its structure. Yet we want to use knowledge about the protein’s family to figure out the structure of this protein. This is a chicken-and-egg problem.

A solution we came up with is rather simple. Instead of trying to look at the potential structures of our unknown protein, we can instead compare its sequence with proteins of known structure. Thus, given a new protein, we use BLAST to compare it to all proteins in the PDB(Protein Data Bank) to determine which ones it is most similar to. The BLAST was performed on the PDB because it is a database of protein structure, and thus provides an unbiased link between structure and sequence. We then used the weighted average of the SCOP families of the results to predict the family for this protein.Of course, this is not a completely rigorous method of classifying SCOP family, and the predicted family might possibly be far from the actual family the protein belonged to. However, it seemed to work reasonably well for our purposes: it correctly classified 11 out of the 12 proteins in CASP5 that were in the SCOP database.

Ultimately, it would be a good idea to iterate our algorithm and check that the protein was indeed classified into the correct family. A way of doing this is to first predict the family using the method described above, then to run the best algorithm for that family and predict the protein structure, and then use the structure to obtain a more accurate prediction of the family the protein belongs in and repeat the whole process again. This would be a good addition to our program in the future; however, we believe that for now, this approach works well enough.

One final note is that sometimes this approach failed to identify a family for the protein at all, because BLAST returned no results (given that the PDB database is not all encompassing), or results not in SCOP. This was rather rare, and in this case we used the same fix as above, in the case where the family did not have any training data: we ran the default server (Robetta).

Results

Overall, we believe we obtained fairly good results, especially taking into account the fact that we had very limited training data. We tested our algorithm using data from CASP 5 and 6, based on comparison of GDT (Global Distance Test) scores on best models. GDT is the percent of residues that fall under a certain distance cutoff, as determined by the evaluators from CASP.

Note that actual rank in CASP is calculated not only based on GDT scores, which explains the discrepancies in the table below. However, GDT scores are used in the calculation, and thus there is a high correlation between the two. Since we did not have the benefit of human experts who could judge our predicted structure, we had to evaluate based on an automated measure, such as GDT scores only.

Using this system of judging, we would have come 7th in CASP 6, which included human-expert systems, and 1st in CAFASP 3, which means we beat all the fully automated servers. On CASP 6 proteins, our GDT score was 47.21. Our average for CASP 5 was even higher, at 47.69. Figure 4 shows the results of other algorithms in CASP 6.

Algorithm / Rank / Average GDT score
of best models
Ginalski* / 1 / 51.5099
KOLINSKI-BUJNICKI* / 2 / 49.7810
BAKER* / 3 / 49.1034
Skolnick-Zhang* / 4 / 49.0531
GeneSilico-Group* / 5 / 47.2468
CHIMERA* / 6 / 45.5583
CBRC-3D* / 7 / 46.9423
SAM-T04-hand* / 8 / 47.3330
Jones-UCL* / 9 / 45.9990
FISCHER* / 10 / 45.8653
Sternberg* / 11 / 44.6241
SBC* / 12 / 44.4561
SBC-Pmodeller5* / 13 / 44.6399
BAKER-ROBETTA_04 / 14 / 46.5282
MCon* / 15 / 42.8894
CAFASP-Consensus* / 16 / 42.4321
TOME* / 17 / 44.8038
ACE / 18 / 44.6786
BAKER-ROBETTA / 19 / 46.4150
SAM-T02 / 69 / 38.2556
SAM-T99 / 102 / 28.9695
FFAS03 / 106 / 36.2487

Figure 4: Table showing the top servers of CASP6, along with the 4 servers we used in our consensus. Algorithms denoted by a * are human-expert predictions. The 6 algorithms highlighted in blue are those that beat ROAWBetter. The 4 algorithms in green are those we used.