Supplementary Information
Prototypes of elementary functional loops unravel evolutionary connections between protein functions
Alexander Goncearenco and Igor N. Berezovsky

APPENDIX A: ArchaEAL PROTEOMES USED IN PROTOTYPE DERIVATION

The following 68 archaeal proteomes were used to derive sequence prototypes. Four proteomes (bold font), each representing a separate archaeal phylum, were used as source of origins:

3

  1. Methanococcus maripaludis C7
  2. Candidatus Methanoregula boonei 6A8
  3. Caldivirga maquilingensis IC-167
  4. Methanococcus maripaludis C5
  5. Methanothermobacter thermautotrophicus str Delta H
  6. Staphylothermus marinus F1
  7. Haloarcula marismortui ATCC 43049
  8. Methanococcus aeolicus Nankai-3
  9. Nitrosopumilus maritimus SCM1
  10. Thermoproteus neutrophilus V24Sta
  11. Thermococcus kodakarensis KOD1
  12. Pyrobaculum islandicum DSM 4184
  13. Methanococcus vannielii SB
  14. Sulfolobus islandicus L S 2 15
  15. Sulfolobus islandicus Y N 15 51
  16. Halorhabdus utahensis DSM 12940
  17. Ignicoccus hospitalis KIN4
  18. Ignicoccus hospitalis KIN4/I
  19. Methanoculleus marisnigri JR1
  20. Sulfolobus solfataricus P2
  21. Methanospirillum hungatei JF-1
  22. Sulfolobus tokodaii str 7
  23. Methanosaeta thermophila PT
  24. Halobacterium sp NRC-1
  25. Methanosphaerula palustris E1-9c
  26. Archaeoglobus fulgidus DSM 4304
  27. Picrophilus torridus DSM 9790
  28. Aeropyrum pernix K1
  29. Thermoplasma acidophilum DSM 1728
  30. Desulfurococcus kamchatkensis 1221n
  31. Thermococcus sibiricus MM 739
  32. Methanosphaera stadtmanae DSM 3091
  33. Sulfolobus islandicus M 16 4
  34. Pyrobaculum arsenaticum DSM 13514
  35. Methanobrevibacter smithii ATCC 35061
  36. Methanosarcina acetivorans C2A
  37. Pyrococcus horikoshii OT3
  38. Sulfolobus islandicus M 16 27
  39. Halomicrobium mukohataei DSM 12286
  40. Natronomonas pharaonis DSM 2160
  41. Candidatus Korarchaeum cryptofilum OPF8
  42. uncultured methanogenic archaeon RC-I
  43. Pyrococcus furiosus DSM 3638
  44. Thermoplasma volcanium GSS1
  45. Halorubrum lacusprofundi ATCC 49239
  46. Halobacterium salinarum R1
  47. Methanosarcina mazei Go1
  48. Pyrobaculum aerophilum str IM2
  49. Methanopyrus kandleri AV19
  50. Methanosarcina barkeri str Fusaro
  51. Methanocorpusculum labreanum Z
  52. Sulfolobus islandicus Y G 57 14
  53. Pyrococcus abyssi GE5
  54. Methanococcus maripaludis C6
  55. Methanococcoides burtonii DSM 6242
  56. Methanococcus maripaludis S2
  57. Haloquadratum walsbyi DSM 16790
  58. Thermococcus gammatolerans EJ3
  59. Sulfolobus acidocaldarius DSM 639
  60. Metallosphaera sedula DSM 5348
  61. Thermofilum pendens Hrk 5
  62. Methanocaldococcus fervens AG86
  63. Hyperthermus butylicus DSM 5456
  64. Pyrobaculum calidifontis JCM 11548
  65. Nanoarchaeum equitans Kin4-M
  66. Sulfolobus islandicus M 14 25
  67. Thermococcus onnurineus NA1
  68. Methanocaldococcus jannaschii DSM 2661

3


APPENDIX B: SEQUENCE LOGOs of the PROTOTYPES
APPENDIX C: Computional procedure


Figure S1: Workflow of the prototype derivation procedure.

1.1  Scoring function and the model

Given a proteomic amino acid composition the probability to observe any random sequence segment of length n, will then be .

The is a likelihood of the random sequence given the set of random sequences , representing a random or reshuffled proteome. All sequences in are assumed to be unrelated, therefore is used as background. Since a profile represents a set of related sequences, the log-likelihood ratio, , determines whether a particular sequence is related to profile , or whether it is random and belongs to . The profile is a matrix of amino acid frequencies on position: . Therefore,

. (1)

In order to score the matches between profile and sequences, a position specific scoring matrix (PSSM) is constructed. For a profile with 50 positions and 20 amino acids, the PSSM matrix is calculated as follows: , , where is the matrix of observed frequencies corrected with pseudo-counts (Altschul, et al., 2009). The latter are used in order to score unobserved amino acids by giving them negligible frequencies proportional to proteomic composition c: , where α is the number of sequences in the profile and β is empirically chosen pseudo-count coefficient. Consequently, is always defined, because .

The overall score of sequence to profile is then:

, (2)

with an upper boundary: , where .


Figure S2: Convergence dynamics of the prototype. This plot shows how the number of matches changes as the prototype converges (red line). At the iteration 4 profile becomes more generic and collects many more matches (note the logarithmic scale). Profile is considered converged, when distance D is lower than epsilon (dotted line).

The procedure stops when the profile is no longer updated, which is determined by distance between the current profile and the previous profile :

, where i = 1..n is position in profile.

When is smaller than a given threshold ε, profiles are considered converged. Figure 4 shows how the profile converges over iterations. The number of sequences included in the profile increases over the iterations (red) thus increasing the sensitivity of the profile. At the same time, the profile converges and does not degenerate, which is manifested in the decreasing distance (black). At some point (iteration 4 in the figure), a more generic signature is obtained, resulting in a dramatic increase in the number of matches, which means that the profile now corresponds to a larger number of EFLs. Nevertheless, the more generic profile converges quickly.

3


Figure S3: Hierarchical clustering of profiles. Converged profiles are compared, and the most similar profiles are merged together iteratively: (A) Diagram showing how profiles are iteratively joined until a certain threshold of profile-profile distance is reached. (B) The difference between merged profiles increases with the iterations; two thresholds (dotted lines) 150 and 163 show iterations where distances between joined profiles start to grow faster, ending up with undesired merging of completely dissimilar ones. (C) Determining threshold to prevent merging of dissimilar profiles. In order to find the stop point of clustering procedure we modeled the distribution of pair-wise distances between random (unrelated) profiles by shuffling them. Distribution of pairwise distances between dissimilar reshuffled profiles is shown as green curve. With the iterations the bulk of similar profile pairs is merged and the distribution shape gradually approaches the modeled random distribution. Clustering should not continue further and the procedure is terminated.

The pair-wise distance matrix Q is obtained by comparing all converged profiles. To make clustering more efficient we apply greedy heuristics where all pairs of patterns with a distance below a certain threshold t are merged:

, where indicates how greedy the clustering is (we use ).

With the iterations, as similar patterns are joined, the variance of Q decreases, at the same time decreasing the greediness of clustering.

3

1.2  Comparison with PSI-BLAST

Profile sequence analysis is a widely used technique implemented in PSSM based tools, such as PSI-BLAST. In general, all profile methods produce comparable results, but given the specific requirements of the task we expect that a specially designed procedure would perform better. For example, if PSI-BLAST is applied to derive prototypes of EFLs, we would face the following problems: (a) neither the substitution matrix, nor the gap cost model can be inferred for events of pre-domain evolution; (b) positions in the profiles are considered independent, therefore the presence of a signature is not reflected; (c) the significance of the match is estimated by using extreme value distribution (EVD), pre-fitted by simulated random alignments. Fit of parameters, especially lambda scale parameter of the EVD is a known problem (Altschul, et al., 2001), and it depends on sequence length and on the composition of the proteome. Therefore, search of the database with different composition would require recalibration. For short loop-sized sequence segments error of the EVD approximation becomes unacceptably high (Wolfsheimer, et al., 2007). It results in erroneous estimation of significance and incorrect E-values. And indeed, our preliminary calculations show that if we use PSI-BLAST to derive prototypes of EFLs, it would start to collect false positives at lower E-values (E-value (E) - 0.001 / number of families (F) - 1; E/F-0.01/2; E/F-0.05/2; E/F-0.1/2; E/F-0.5/22); for comparison at E-value starting from 1 see Table S1. The other difference that would be pronounced is that PSI-BLAST considers equal contribution from each profile position to the score, while our procedure assumes the presence of the functional signature and penalizes mismatches on conserved positions, which is reflected in the overall score. Therefore, mismatches on functionally important positions are less probable with our method.

Table S1.Comparison of diversity of PSI-BLAST matches with PSI-BLAST

E-value / Diversity of SCOP families in PSI-BLAST / Diversity of SCOP families in our procedure
1 / 23 / 7
2 / 27 / 7
3 / 29 / 9
4 / 30 / 13
5 / 37 / 17
10 / 44 / 30
20 / 55 / 39
30 / 58 / 48
40 / 61 / 60

Sequence database of non-redundant domains from 68 archaeal proteomes was used. PSI-BLAST and the above discussed procedure were used to converge the profiles against the sequence database starting from the same origin. Different E-values were applied in both PSI-BLAST and our procedure. The number of iterations for PSI-BLAST was fixed 10 iteration. Each converged profile was then analyzed against ASTRAL/SCOP database (at 95% sequence redundancy) and the corresponding matches compared. Numbers in the table represent how many different SCOP families were encountered among the matches of the prototype. It is important to note that PSI-BLAST procedure starts to collect unreasonably high number of families already at as low as 0.5 E-value. The logo of the corresponding profile is shown below.


References

Altschul, S.F., Bundschuh, R., Olsen, R. and Hwa, T. (2001) The estimation of statistical parameters for local alignment score distributions, Nucleic Acids Res, 29, 351-361.

Altschul, S.F., Gertz, E.M., Agarwala, R., Schaffer, A.A. and Yu, Y.K. (2009) PSI-BLAST pseudocounts and the minimum description length principle, Nucleic Acids Res, 37, 815-824.

Wolfsheimer, S., Burghardt, B. and Hartmann, A. (2007) Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail, Algorithms for Molecular Biology, 2, 9.

3