Supplementary Information for

“FRAGSION: ultra-fast protein fragment library generation by IOHMM sampling”

Debswapna Bhattacharya1, Badri Adhikari1, Jilong Li1and Jianlin Cheng1, 2, 3, *

1Department of Computer Science, University of Missouri, Columbia, MO 65211, USA

2Informatics Institute, University of Missouri, Columbia, MO 65211, USA

3Bond Life Science Center, University of Missouri, Columbia, MO 65211, USA

*To whom correspondence should be addressed. Phone: (573)-882-7306. Fax: (573)-882-8318. E-mail: .

Supplementary Item / Title
Supplementary Method
Supplementary Results
Supplementary Figure 1 / Architecture of Input-Output Hidden Markov Model (IOHMM).
Supplementary Figure 2 / Training and optimal model selection.
Supplementary Figure 3 / Density of TM-score and RMSD of the FRAGSION and ROSETTA models.
Supplementary Figure 4 / Target by target comparison between FRAGSION and ROSETTA in terms of precision.
Supplementary Figure 5 / Target by target comparison between FRAGSION and ROSETTA in terms of coverage.
Supplementary Figure 6 / Target by target comparison between FRAGSION and ROSETTA in terms of RMSD.
Supplementary Figure 7 / Target by target comparison between FRAGSION and ROSETTA in terms of computation time.
Supplementary Table 1 / Template Free Modeling (FM) targets for CASP 11 experiment.
Supplementary Table 2 / Mean and standard deviation of TM-score and RMSD of the FRAGSION and ROSETTA models.
Supplementary Table 3 / Highest TM-score and lowest RMSD for each target by FRAGSION and ROSETTA.
Supplementary Table 4 / TM-score and RMSD of the lowest energy model for each target by FRAGSION and ROSETTA.

Supplementary Methods

Description of IOHMM

FRAGSION is developed using our recently proposed Input-Output Hidden Markov Model(Bhattacharya and Cheng, 2015). The proposed model captures sequential dependencies between the sequence space (input) and structural space (output) of protein through a Markov chain of hidden states. In each slice, as shown in Supplementary Fig. 1, an input node (A) captures the sequence space. It represents eight groups of residues showing distinct structural behavior selected from twenty standard residue types as previously found through analysis of high-resolution experimental structures (Karplus, 1996; Lovell, et al., 2003)Theses eight classes are:(1) glycines not preceding prolines, (2) prolines not preceding prolines, (3) β-branched amino acid residues, isoleucines and valines, not preceding prolines, (4) all amino acids except glycines, prolines, isoleucines, and valines not preceding prolines, (5) glycines preceding prolines, (6) prolines preceding prolines, (7) β-branched residues isoleucines and valines preceding prolines, and (8) all amino acids except glycine, proline, isoleucine, and valine preceding prolines. Connections between the input nodes represent the transition probabilities between residues along the protein chain. Output (i.e., emission) nodes correspond to structural space, modeled using secondary structure (S), dihedral angle pair (D: ϕ, ψ), and peptide bond conformation (P: ω). Secondary structure node (S) is a discrete node that can assume 3 states (Helix, Strand and Coil). We model backbone torsion angles pairs (ϕ, ψ) using mixtures of bivariate von Mises distributions(Mardia, et al., 2007) and ω dihedral angle of the peptide bonds using mixtures of univariate von Mises distributions(Mardia and Jupp, 2009).The output emission nodes can be flagged as observed or hidden for a specific sequence position. Sampling sequence of hidden nodes H and the emission nodes marked as hidden, Ohidden from the conditional distribution P(H, Ohidden | Oobs, I) is achieved using forward-backtrack algorithm(Cawley and Pachter, 2003), where input node I and observed emission nodes Oobs are given. This enables us to deal with noise in the sequence-derived predicted secondary structure by flagging secondary structure as observed only in residue positions for highly confident prediction and leaving the rest as hidden. Furthermore, using a probabilistic model makes it possible to sample potentially unlimited sequence of angles accessible to proteins with associated probabilities for a given stretch of sequence.

Supplementary Figure 1.Architecture of Input-Output Hidden Markov Model (IOHMM).In each slice, an input node indicated eight classes ofresidues in the amino acid sequence (A) and a Markov chain of hidden nodes (H) captures the sequential dependencies along the peptide chain where each hidden node corresponds to three kinds of emission distributions: (1) three-state secondary structure labels (S): helix (H), strand (E), and coil (C), (2) backbone (ϕ, ψ) dihedral angle pairs, and (3) ω angles associated with peptide bonds.

Training Data

To train the IOHMM, we collected 1,740 non-redundant protein domains,from the SABmark dataset, version 1.65(Van Walle, et al., 2005). Eight classes of residue types and three backbone dihedral angleinformation werecalculated directly from the training protein and three-state secondary structures (helix, strand, and coil) were assigned using DSSP(Kabsch and Sander, 1983). The training dataset contains 270,350 observations.

Training and Optimal Model Selection

We trained the IOHMM using Stochastic Expectation-Maximization (S-EM) (Nielsen, 2000) algorithm, as implemented in Mocapy++ software package (Paluszewski and Hamelryck, 2010). Choosing the optimal hidden node size is crucial for the model to succeed. For low size, the model will be too coarse; however, if the size is too high, it will lead to overfitting. We estimated the optimal hidden node size using the Akaike Information Criterion (AIC)(Burnham and Anderson, 2002), a widely-used model selection criterion:

where, L(θ|d) is the likelihood of the model given the data d, and n is the number of parameters. The AIC value reaches a minimal value for the optimal model. We computed AIC values for hidden node sizes ranging from 10 to 100 (with a step size of 5). For each hidden node size, we repeated the training four times with different starting conditions in order to avoid getting stuck in local optima. For a model with a hidden node size of 30, the AIC value reached its minimum value, resulting in 7,812 parameters (Supplementary Fig. 2). We chose this model as the optimum one.

Supplementary Figure 2.Training and optimal model selection. (a) AIC values verses varying hidden node sizes are shown, with four models trained for each hidden node size. The curved line is tendency line constructed by fitting sixth degree polynomial to the data. The minimum AIC value corresponds to the optimal model (highlighted in red circle). (b) Convergence of log likelihood of the completed data during training is shown with respect to the number of S-EM iterations.

Test dataset

We tested the accuracy of fragment library using 30 CASP11 FM targets. The sequences, and the experimental PDB structures were downloaded from the CASP11 website at The domain definitions and the PDB accession codes were provide by CASP assessors at summary of the targets have been provided in Supplementary Table 1.

Supplementary Table 1. Template Free Modeling (FM) targets for CASP 11 experiment.

# / Target / Domain / Residue Range / Residues in Domain / PDB
1 / T0761 / T0761-D1 / 62-149 / 88 / 4pw1
2 / T0761 / T0761-D2 / 150-178,202-285 / 113 / 4pw1
3 / T0763 / T0763-D1 / 31-160 / 130 / 4q0y
4 / T0767 / T0767-D2 / 133-312 / 180 / 4qpv
5 / T0771 / T0771-D1 / 27-76,91-191 / 151 / 4qe0
6 / T0777 / T0777-D1 / 18-362 / 345 / -
7 / T0781 / T0781-D1 / 41-240 / 199 / 4qan
8 / T0785 / T0785-D1 / 3-114 / 112 / 4d0v
9 / T0789 / T0789-D1 / 6-113,117-151 / 143 / 4w4i
10 / T0789 / T0789-D2 / 152-277 / 126 / 4w4i
11 / T0790 / T0790-D1 / 1-135 / 135 / 4l4w
12 / T0790 / T0790-D2 / 136-265 / 130 / 4l4w
13 / T0791 / T0791-D1 / 6-44,52-161 / 149 / 4kxr
14 / T0791 / T0791-D2 / 162-262,264-300 / 138 / 4kxr
15 / T0794 / T0794-D2 / 291-462 / 172 / 4cyf
16 / T0806 / T0806-D1 / 1-256 / 256 / -
17 / T0808 / T0808-D2 / 150-418 / 269 / 4qhw
18 / T0810 / T0810-D1 / 24-136 / 113 / -
19 / T0814 / T0814-D1 / 23-159 / 137 / 4r7f
20 / T0814 / T0814-D2 / 160-242,387-419 / 116 / 4r7f
21 / T0820 / T0820-D1 / 2-91 / 90 / -
22 / T0824 / T0824-D1 / 2-109 / 108 / -
23 / T0827 / T0827-D2 / 212-328,337-369 / 150 / -
24 / T0831 / T0831-D2 / 109-168,183-261,295-352 / 197 / 4qn1
25 / T0832 / T0832-D1 / 10-218 / 209 / 4rd8
26 / T0834 / T0834-D1 / 2-37,130-192 / 99 / 4r7q
27 / T0834 / T0834-D2 / 38-65,72-129 / 86 / 4r7q
28 / T0836 / T0836-D1 / 1-204 / 204 / -
29 / T0837 / T0837-D1 / 1-121 / 121 / -
30 / T0855 / T0855-D1 / 5-119 / 115 / 2mqd

Fragment library generation using ROSETTA

We used the fragment picker application of ROSETTA 3.5 (Leaver-Fay, et al., 2011) with default papameter settings in order to generate fragment library using ROSETTA. For each target, at first, we predicted secondary structure using PSIPRED (Jones, 1999) and supplied it to ROSETTA (by setting ‘psipred_ss2’ to appropriate file path).

The fragment picker command used was:

./rosetta-3.5/rosetta_source/rosetta_tools/fragment_tools/make_fragments.pl \

-rundirDIR-TARGET \

-id TARGET \

-nopsipred \

-nohoms \

-psipredfile TARGET.ss2 \

-frag_sizes3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 \

TARGET.fasta \

Model Generation using ROSETTA

A locally installed ROSETTA 3.5 (Leaver-Fay, et al., 2011) was used to build three-dimensional models using the fragment files predicted by ROSETTA and FRAGSION as inputs. For each target, at first, we predicted secondary structure using PSIPRED and then generated 100 models (by setting ‘nstruct’ option to 100) with all default parameters as input to the ‘AbinitioRelax’ program. Supplying Rosetta’s default input of three-size and nine-size fragments (parameters ‘in:file:frag3’ and ‘in:file:frag9’) we ran single thread of ‘AbinitioRelax’ for short targets and three parallel threads for long targets.

The ‘AbinitioRelax’ command used was:

./rosetta-3.5/rosetta_source/bin/AbinitioRelax.linuxgccrelease \

-database DIR-ROSETTA-DB\

-in:file:fasta ./TARGET.fasta \

-in:file:frag3 ./TARGET.200.3mers \

-in:file:frag9 ./TARGET.200.9mers \

-psipred_ss2 TARGET.ss2 \

-nstruct100

-abinitio:relax \

-relax:fast \

-abinitio::increase_cycles 10 \

-abinitio::rg_reweight 0.5 \

-abinitio::rsd_wt_helix 0.5 \

-abinitio::rsd_wt_loop 0.5 \

-use_filters true \

-out:pdb \

Supplementary Results

Assessment of theoverall accuracy of predicted models

The predictedprotein models by FRAGSION and ROSETTA were analyzed based on domains as done in the CASP experiments. Residues in the predicted models that the true structures missed were removed, and the modelswere superposed onto the true structuresfor 30 CASP11 domains. TM-score and RMSD of the models were calculated by the TM-score program (Zhang and Skolnick, 2004).Supplementary Table 2 reports mean and standard deviation of TM-score and RMSD of the FRAGSION and ROSETTA models. Supplementary Fig. 8 shows density of TM-score and RMSD of the FRAGSION and ROSETTA models. The average TM-score and RMSD of the FRAGSION and ROSETTA models are 0.198 and 0.259, 19.980Å and 17.995Å separately. The average TM-score of the FRAGSION models is ~23.55% lower than that of the ROSETTA models. The average RMSD of the FRAGSION models is ~2Å higher than that of the ROSETTA models.

Supplementary Table 2.Mean and standard deviation of TM-score and RMSD of the FRAGSION and ROSETTAmodels.

FRAGSION / ROSETTA
TM-score / RMSD / TM-score / RMSD
Target / Mean / STD / Mean / STD / Mean / STD / Mean / STD
T0761-D1 / 0.213 / 0.032 / 17.112 / 2.676 / 0.283 / 0.035 / 15.528 / 3.409
T0761-D2 / 0.213 / 0.016 / 19.165 / 2.834 / 0.224 / 0.021 / 18.040 / 2.883
T0763-D1 / 0.187 / 0.024 / 19.000 / 3.124 / 0.209 / 0.026 / 17.646 / 2.648
T0767-D2 / 0.179 / 0.022 / 22.799 / 2.441 / 0.224 / 0.036 / 20.711 / 2.473
T0771-D1 / 0.191 / 0.025 / 19.259 / 2.215 / 0.244 / 0.030 / 18.359 / 2.493
T0777-D1 / 0.196 / 0.023 / 23.650 / 1.946 / 0.227 / 0.033 / 22.283 / 2.950
T0781-D1 / 0.174 / 0.018 / 25.566 / 4.015 / 0.188 / 0.020 / 23.384 / 2.273
T0785-D1 / 0.187 / 0.020 / 16.001 / 1.613 / 0.217 / 0.023 / 15.258 / 1.766
T0789-D1 / 0.193 / 0.025 / 19.392 / 2.366 / 0.278 / 0.033 / 16.729 / 1.851
T0789-D2 / 0.195 / 0.028 / 19.098 / 2.451 / 0.289 / 0.034 / 15.792 / 2.242
T0790-D1 / 0.210 / 0.028 / 18.593 / 2.625 / 0.382 / 0.052 / 13.102 / 2.192
T0790-D2 / 0.198 / 0.025 / 18.506 / 2.424 / 0.290 / 0.055 / 15.661 / 2.230
T0791-D1 / 0.191 / 0.033 / 19.735 / 2.507 / 0.248 / 0.033 / 19.037 / 2.503
T0791-D2 / 0.188 / 0.025 / 20.787 / 2.758 / 0.243 / 0.028 / 17.853 / 2.482
T0794-D2 / 0.137 / 0.020 / 28.234 / 4.515 / 0.181 / 0.023 / 24.343 / 3.574
T0806-D1 / 0.180 / 0.020 / 22.561 / 1.765 / 0.214 / 0.030 / 20.518 / 2.016
T0808-D2 / 0.153 / 0.018 / 28.436 / 2.857 / 0.203 / 0.028 / 24.898 / 2.232
T0810-D1 / 0.161 / 0.019 / 18.920 / 2.705 / 0.267 / 0.037 / 16.562 / 2.651
T0814-D1 / 0.147 / 0.017 / 25.855 / 3.886 / 0.182 / 0.026 / 22.469 / 3.037
T0814-D2 / 0.158 / 0.022 / 25.206 / 4.493 / 0.194 / 0.028 / 23.645 / 4.106
T0820-D1 / 0.264 / 0.028 / 14.864 / 1.697 / 0.301 / 0.033 / 15.152 / 2.485
T0824-D1 / 0.206 / 0.023 / 14.547 / 1.177 / 0.259 / 0.021 / 14.306 / 1.146
T0827-D2 / 0.201 / 0.031 / 18.667 / 2.178 / 0.277 / 0.041 / 16.813 / 1.685
T0831-D2 / 0.214 / 0.023 / 26.869 / 5.949 / 0.243 / 0.026 / 23.496 / 3.312
T0832-D1 / 0.212 / 0.025 / 19.669 / 2.463 / 0.290 / 0.035 / 18.893 / 2.338
T0834-D1 / 0.207 / 0.023 / 17.356 / 2.391 / 0.291 / 0.050 / 16.969 / 2.574
T0834-D2 / 0.218 / 0.026 / 13.909 / 1.899 / 0.279 / 0.032 / 12.698 / 1.673
T0836-D1 / 0.247 / 0.035 / 17.226 / 2.660 / 0.283 / 0.034 / 16.403 / 2.769
T0837-D1 / 0.249 / 0.034 / 14.602 / 1.957 / 0.365 / 0.055 / 11.884 / 2.539
T0855-D1 / 0.267 / 0.037 / 13.806 / 1.741 / 0.379 / 0.067 / 11.413 / 2.634
Average / 0.198 / 0.025 / 19.980 / 2.678 / 0.259 / 0.034 / 17.995 / 2.505

Supplementary Figure 3.Density of TM-score and RMSD of the FRAGSION and ROSETTA models.X-axis represents TM-score (a) and RMSD (b) and Y-axis represents density of models. The mean TM-score for FRAGSION and ROSETTA are 0.198 and 0.259 respectively with the standard deviation 0.025 and 0.034 respectively. The mean RMSD for FRAGSION and ROSETTA are 19.98Å and 17.995Å respectively with the standard deviation 2.678Å and 2.505 Å respectively.

Evaluation of the best predictions

To investigate how the quality of the best prediction is affected by the choice of fragment library, we identified the highest TM-score and lowest RMSD prediction generated by FRAGSION and ROSETTA after comparing with the corresponding experimental domains. In Supplementary Table 3, we report the performance of FRAGSION and ROSETTA in terms of best prediction. The assessment offers some interesting insights. For four targets, ROSETTA achieved TM-score higher than 0.5 indicating correctness in the overall fold; while FRAGSION’s TM-score for those targets were lower than ROSETTA. Nevertheless, for targets T0837-D1 and T0855-D1, the best models produced by FRAGSION reach close to 0.5 TM-score. Over the entire dataset, ROSETTA outperformed FRAGSION in terms of TM-score. In terms of RMSD, FRAGSION outperformed ROSETTA for six targets. For example, in case of target T0836-D1, FRAGSION generated a model having RMSD of 11.6 Å while ROSETTA’s best prediction has an RMSD of 15 Å.

Supplementary Table 3. Highest TM-score and lowest RMSD for each target by FRAGSION and ROSETTA. Numbers in bold indicate that the best prediction by FRAGSION is better than ROSETTA.

Target / FRAGSION / ROSETTA
TM-score / RMSD / TM-score / RMSD
T0761-D1 / 0.2904 / 15.152 / 0.3552 / 9.799
T0761-D2 / 0.2749 / 16.455 / 0.2867 / 13.103
T0763-D1 / 0.2597 / 16.977 / 0.2768 / 16.065
T0767-D2 / 0.2475 / 17.802 / 0.3339 / 18.178
T0771-D1 / 0.2526 / 14.48 / 0.3812 / 12.214
T0777-D1 / 0.26 / 21.603 / 0.3321 / 18.986
T0781-D1 / 0.23 / 21.209 / 0.2467 / 22.501
T0785-D1 / 0.2457 / 11.299 / 0.2817 / 14.253
T0789-D1 / 0.2972 / 19.925 / 0.3602 / 13.554
T0789-D2 / 0.2636 / 18.517 / 0.3779 / 8.938
T0790-D1 / 0.2901 / 16.768 / 0.5716 / 9.502
T0790-D2 / 0.2664 / 13.574 / 0.5131 / 6.896
T0791-D1 / 0.3011 / 18.496 / 0.3304 / 16.432
T0791-D2 / 0.2602 / 19.145 / 0.3222 / 10.363
T0794-D2 / 0.2027 / 20.633 / 0.2494 / 18.704
T0806-D1 / 0.2559 / 17.219 / 0.3076 / 20.628
T0808-D2 / 0.2033 / 22.8 / 0.3015 / 20.354
T0810-D1 / 0.2105 / 21.442 / 0.384 / 9.568
T0814-D1 / 0.2059 / 24.283 / 0.2777 / 18.983
T0814-D2 / 0.2218 / 23.425 / 0.2877 / 20.553
T0820-D1 / 0.3386 / 12.592 / 0.4433 / 11.923
T0824-D1 / 0.2843 / 13.067 / 0.319 / 15.194
T0827-D2 / 0.3209 / 18.583 / 0.3712 / 14.908
T0831-D2 / 0.2777 / 25.916 / 0.3468 / 19.362
T0832-D1 / 0.2688 / 21.315 / 0.4285 / 18.178
T0834-D1 / 0.2802 / 18.811 / 0.4267 / 11.662
T0834-D2 / 0.312 / 12.322 / 0.3985 / 8.534
T0836-D1 / 0.344 / 11.621 / 0.3679 / 15.051
T0837-D1 / 0.3874 / 8.994 / 0.518 / 7.947
T0855-D1 / 0.3905 / 11.343 / 0.5477 / 5.3

Assessment of the lowest-energy predictions

To further examine the effect of fragment library on the quality of the best models, we analyzed the lowest energy models generated by FRAGSION and ROSETTA using ROSETTA’s scoring function. This analysis is much more realistic than the analysis based on best models, particularly in blind structure prediction scenarios. In Supplementary Table 4, we report the performance of FRAGSION and ROSETTA in terms of lowest-energy prediction. FRAGSION outperformed ROSETTA for ten targets in terms of RMSD. For three targets, FRAGSION achieved TM-score higher than ROSETTA.

Supplementary Table 4.TM-score and RMSD of the lowest energy model for each target by FRAGSION and ROSETTA. Numbers in bold indicate that the lowest energy model by FRAGSION is better than ROSETTA.

Target / FRAGSION / ROSETTA
TM-score / RMSD / TM-score / RMSD
T0761-D1 / 0.1961 / 15.741 / 0.2219 / 11.94
T0761-D2 / 0.2 / 17.757 / 0.2111 / 18.704
T0763-D1 / 0.2028 / 17.131 / 0.2198 / 16.226
T0767-D2 / 0.1526 / 22.611 / 0.1863 / 23.534
T0771-D1 / 0.2238 / 18.676 / 0.2405 / 18.23
T0777-D1 / 0.1833 / 24.693 / 0.2476 / 18.281
T0781-D1 / 0.1902 / 20.468 / 0.2055 / 22.139
T0785-D1 / 0.1928 / 14.399 / 0.1953 / 15.0
T0789-D1 / 0.1479 / 24.265 / 0.2458 / 17.642
T0789-D2 / 0.1912 / 18.436 / 0.3332 / 16.383
T0790-D1 / 0.1895 / 18.12 / 0.3467 / 15.766
T0790-D2 / 0.1921 / 19.35 / 0.2896 / 14.4
T0791-D1 / 0.188 / 20.723 / 0.2481 / 20.853
T0791-D2 / 0.1531 / 24.013 / 0.2392 / 24.116
T0794-D2 / 0.1723 / 37.499 / 0.1553 / 29.063
T0806-D1 / 0.2197 / 23.53 / 0.2695 / 16.616
T0808-D2 / 0.1563 / 30.448 / 0.207 / 23.654
T0810-D1 / 0.1444 / 21.869 / 0.293 / 12.353
T0814-D1 / 0.1445 / 27.472 / 0.1656 / 22.183
T0814-D2 / 0.1283 / 24.247 / 0.2098 / 19.712
T0820-D1 / 0.2943 / 15.31 / 0.2823 / 15.001
T0824-D1 / 0.2843 / 13.067 / 0.2451 / 13.974
T0827-D2 / 0.2103 / 21.881 / 0.2846 / 13.198
T0831-D2 / 0.22 / 21.711 / 0.2512 / 23.117
T0832-D1 / 0.2096 / 18.784 / 0.2772 / 21.337
T0834-D1 / 0.1917 / 16.025 / 0.2546 / 20.177
T0834-D2 / 0.2107 / 12.965 / 0.3086 / 14.781
T0836-D1 / 0.2469 / 18.624 / 0.3626 / 16.68
T0837-D1 / 0.2638 / 14.599 / 0.4043 / 11.813
T0855-D1 / 0.2568 / 14.803 / 0.3909 / 10.213

Supplementary Figure 4.Target by target comparison between FRAGSION and ROSETTA in terms of precision. Precision at various RMSD cutoffs for each target in the dataset generated by FRAGSION (red) and ROSETTA (blue).

Supplementary Figure 5.Target by target comparison between FRAGSION and ROSETTA in terms of coverage. Coverage at various RMSD cutoffs for each target in the dataset generated by FRAGSION (red) and ROSETTA (blue).

Supplementary Figure 6.Target by target comparison between FRAGSION and ROSETTA in terms of RMSD. RMSD at different fragment lengths for each target in the dataset generated by FRAGSION (red) and ROSETTA (blue).

Supplementary Figure 7.Target by target comparison between FRAGSION and ROSETTA in terms of computation time. Computation time at different fragment lengths for each target in the dataset generated by FRAGSION (red) and ROSETTA (blue).

References

Bhattacharya, D. and Cheng, J. (2015) De novo protein conformational sampling using a probabilistic graphical model, Scientific reports, 5.

Burnham, K.P. and Anderson, D.R. (2002) Model selection and multimodel inference: a practical information-theoretic approach. Springer Science & Business Media.

Cawley, S.L. and Pachter, L. (2003) HMM sampling and applications to gene finding and alternative splicing, Bioinformatics, 19, ii36-ii41.

Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices, Journal of molecular biology, 292, 195-202.

Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features, Biopolymers, 22, 2577-2637.

Karplus, P.A. (1996) Experimentally observed conformation‐dependent geometry and hidden strain in proteins, Protein Science, 5, 1406-1420.

Leaver-Fay, A., et al. (2011) ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules, Methods in enzymology, 487, 545.

Lovell, S.C., et al. (2003) Structure validation by Cα geometry: ϕ, ψ and Cβ deviation, Proteins: Structure, Function, and Bioinformatics, 50, 437-450.

Mardia, K.V. and Jupp, P.E. (2009) Directional statistics. John Wiley & Sons.

Mardia, K.V., Taylor, C.C. and Subramaniam, G.K. (2007) Protein bioinformatics and mixtures of bivariate von Mises distributions for angular data, Biometrics, 63, 505-512.

Nielsen, S.F. (2000) The stochastic EM algorithm: estimation and asymptotic results, Bernoulli, 457-489.

Paluszewski, M. and Hamelryck, T. (2010) Mocapy++-A toolkit for inference and learning in dynamic Bayesian networks, BMC bioinformatics, 11, 126.

Van Walle, I., Lasters, I. and Wyns, L. (2005) SABmark—a benchmark for sequence alignment that covers the entire known fold space, Bioinformatics, 21, 1267-1268.

Zhang, Y. and Skolnick, J. (2004) Scoring function for automated assessment of protein structure template quality, Proteins: Structure, Function, and Bioinformatics, 57, 702-710.