Prototypes of elementary functional loops unravel evolutionary connections between protein functions
Alexander Goncearenco and Igor N. Berezovsky
APPENDIX A: ArchaEAL PROTEOMES USED IN PROTOTYPE DERIVATION
The following 68 archaeal proteomes were used to derive sequence prototypes. Four proteomes (bold font), each representing a separate archaeal phylum, were used as source of origins:
3
- Methanococcus maripaludis C7
- Candidatus Methanoregula boonei 6A8
- Caldivirga maquilingensis IC-167
- Methanococcus maripaludis C5
- Methanothermobacter thermautotrophicus str Delta H
- Staphylothermus marinus F1
- Haloarcula marismortui ATCC 43049
- Methanococcus aeolicus Nankai-3
- Nitrosopumilus maritimus SCM1
- Thermoproteus neutrophilus V24Sta
- Thermococcus kodakarensis KOD1
- Pyrobaculum islandicum DSM 4184
- Methanococcus vannielii SB
- Sulfolobus islandicus L S 2 15
- Sulfolobus islandicus Y N 15 51
- Halorhabdus utahensis DSM 12940
- Ignicoccus hospitalis KIN4
- Ignicoccus hospitalis KIN4/I
- Methanoculleus marisnigri JR1
- Sulfolobus solfataricus P2
- Methanospirillum hungatei JF-1
- Sulfolobus tokodaii str 7
- Methanosaeta thermophila PT
- Halobacterium sp NRC-1
- Methanosphaerula palustris E1-9c
- Archaeoglobus fulgidus DSM 4304
- Picrophilus torridus DSM 9790
- Aeropyrum pernix K1
- Thermoplasma acidophilum DSM 1728
- Desulfurococcus kamchatkensis 1221n
- Thermococcus sibiricus MM 739
- Methanosphaera stadtmanae DSM 3091
- Sulfolobus islandicus M 16 4
- Pyrobaculum arsenaticum DSM 13514
- Methanobrevibacter smithii ATCC 35061
- Methanosarcina acetivorans C2A
- Pyrococcus horikoshii OT3
- Sulfolobus islandicus M 16 27
- Halomicrobium mukohataei DSM 12286
- Natronomonas pharaonis DSM 2160
- Candidatus Korarchaeum cryptofilum OPF8
- uncultured methanogenic archaeon RC-I
- Pyrococcus furiosus DSM 3638
- Thermoplasma volcanium GSS1
- Halorubrum lacusprofundi ATCC 49239
- Halobacterium salinarum R1
- Methanosarcina mazei Go1
- Pyrobaculum aerophilum str IM2
- Methanopyrus kandleri AV19
- Methanosarcina barkeri str Fusaro
- Methanocorpusculum labreanum Z
- Sulfolobus islandicus Y G 57 14
- Pyrococcus abyssi GE5
- Methanococcus maripaludis C6
- Methanococcoides burtonii DSM 6242
- Methanococcus maripaludis S2
- Haloquadratum walsbyi DSM 16790
- Thermococcus gammatolerans EJ3
- Sulfolobus acidocaldarius DSM 639
- Metallosphaera sedula DSM 5348
- Thermofilum pendens Hrk 5
- Methanocaldococcus fervens AG86
- Hyperthermus butylicus DSM 5456
- Pyrobaculum calidifontis JCM 11548
- Nanoarchaeum equitans Kin4-M
- Sulfolobus islandicus M 14 25
- Thermococcus onnurineus NA1
- Methanocaldococcus jannaschii DSM 2661
3
APPENDIX B: SEQUENCE LOGOs of the PROTOTYPES
APPENDIX C: Computional procedure
Figure S1: Workflow of the prototype derivation procedure.
1.1 Scoring function and the model
Given a proteomic amino acid composition the probability to observe any random sequence segment of length n, will then be .
The is a likelihood of the random sequence given the set of random sequences , representing a random or reshuffled proteome. All sequences in are assumed to be unrelated, therefore is used as background. Since a profile represents a set of related sequences, the log-likelihood ratio, , determines whether a particular sequence is related to profile , or whether it is random and belongs to . The profile is a matrix of amino acid frequencies on position: . Therefore,
. (1)
In order to score the matches between profile and sequences, a position specific scoring matrix (PSSM) is constructed. For a profile with 50 positions and 20 amino acids, the PSSM matrix is calculated as follows: , , where is the matrix of observed frequencies corrected with pseudo-counts (Altschul, et al., 2009). The latter are used in order to score unobserved amino acids by giving them negligible frequencies proportional to proteomic composition c: , where α is the number of sequences in the profile and β is empirically chosen pseudo-count coefficient. Consequently, is always defined, because .
The overall score of sequence to profile is then:
, (2)
with an upper boundary: , where .
Figure S2: Convergence dynamics of the prototype. This plot shows how the number of matches changes as the prototype converges (red line). At the iteration 4 profile becomes more generic and collects many more matches (note the logarithmic scale). Profile is considered converged, when distance D is lower than epsilon (dotted line).
The procedure stops when the profile is no longer updated, which is determined by distance between the current profile and the previous profile :
, where i = 1..n is position in profile.
When is smaller than a given threshold ε, profiles are considered converged. Figure 4 shows how the profile converges over iterations. The number of sequences included in the profile increases over the iterations (red) thus increasing the sensitivity of the profile. At the same time, the profile converges and does not degenerate, which is manifested in the decreasing distance (black). At some point (iteration 4 in the figure), a more generic signature is obtained, resulting in a dramatic increase in the number of matches, which means that the profile now corresponds to a larger number of EFLs. Nevertheless, the more generic profile converges quickly.
3
Figure S3: Hierarchical clustering of profiles. Converged profiles are compared, and the most similar profiles are merged together iteratively: (A) Diagram showing how profiles are iteratively joined until a certain threshold of profile-profile distance is reached. (B) The difference between merged profiles increases with the iterations; two thresholds (dotted lines) 150 and 163 show iterations where distances between joined profiles start to grow faster, ending up with undesired merging of completely dissimilar ones. (C) Determining threshold to prevent merging of dissimilar profiles. In order to find the stop point of clustering procedure we modeled the distribution of pair-wise distances between random (unrelated) profiles by shuffling them. Distribution of pairwise distances between dissimilar reshuffled profiles is shown as green curve. With the iterations the bulk of similar profile pairs is merged and the distribution shape gradually approaches the modeled random distribution. Clustering should not continue further and the procedure is terminated. 
The pair-wise distance matrix Q is obtained by comparing all converged profiles. To make clustering more efficient we apply greedy heuristics where all pairs of patterns with a distance below a certain threshold t are merged:
, where indicates how greedy the clustering is (we use ).
With the iterations, as similar patterns are joined, the variance of Q decreases, at the same time decreasing the greediness of clustering.
3
1.2 Comparison with PSI-BLAST
Profile sequence analysis is a widely used technique implemented in PSSM based tools, such as PSI-BLAST. In general, all profile methods produce comparable results, but given the specific requirements of the task we expect that a specially designed procedure would perform better. For example, if PSI-BLAST is applied to derive prototypes of EFLs, we would face the following problems: (a) neither the substitution matrix, nor the gap cost model can be inferred for events of pre-domain evolution; (b) positions in the profiles are considered independent, therefore the presence of a signature is not reflected; (c) the significance of the match is estimated by using extreme value distribution (EVD), pre-fitted by simulated random alignments. Fit of parameters, especially lambda scale parameter of the EVD is a known problem (Altschul, et al., 2001), and it depends on sequence length and on the composition of the proteome. Therefore, search of the database with different composition would require recalibration. For short loop-sized sequence segments error of the EVD approximation becomes unacceptably high (Wolfsheimer, et al., 2007). It results in erroneous estimation of significance and incorrect E-values. And indeed, our preliminary calculations show that if we use PSI-BLAST to derive prototypes of EFLs, it would start to collect false positives at lower E-values (E-value (E) - 0.001 / number of families (F) - 1; E/F-0.01/2; E/F-0.05/2; E/F-0.1/2; E/F-0.5/22); for comparison at E-value starting from 1 see Table S1. The other difference that would be pronounced is that PSI-BLAST considers equal contribution from each profile position to the score, while our procedure assumes the presence of the functional signature and penalizes mismatches on conserved positions, which is reflected in the overall score. Therefore, mismatches on functionally important positions are less probable with our method.
Table S1.Comparison of diversity of PSI-BLAST matches with PSI-BLAST
E-value / Diversity of SCOP families in PSI-BLAST / Diversity of SCOP families in our procedure1 / 23 / 7
2 / 27 / 7
3 / 29 / 9
4 / 30 / 13
5 / 37 / 17
10 / 44 / 30
20 / 55 / 39
30 / 58 / 48
40 / 61 / 60
Sequence database of non-redundant domains from 68 archaeal proteomes was used. PSI-BLAST and the above discussed procedure were used to converge the profiles against the sequence database starting from the same origin. Different E-values were applied in both PSI-BLAST and our procedure. The number of iterations for PSI-BLAST was fixed 10 iteration. Each converged profile was then analyzed against ASTRAL/SCOP database (at 95% sequence redundancy) and the corresponding matches compared. Numbers in the table represent how many different SCOP families were encountered among the matches of the prototype. It is important to note that PSI-BLAST procedure starts to collect unreasonably high number of families already at as low as 0.5 E-value. The logo of the corresponding profile is shown below.
References
Altschul, S.F., Bundschuh, R., Olsen, R. and Hwa, T. (2001) The estimation of statistical parameters for local alignment score distributions, Nucleic Acids Res, 29, 351-361.
Altschul, S.F., Gertz, E.M., Agarwala, R., Schaffer, A.A. and Yu, Y.K. (2009) PSI-BLAST pseudocounts and the minimum description length principle, Nucleic Acids Res, 37, 815-824.
Wolfsheimer, S., Burghardt, B. and Hartmann, A. (2007) Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail, Algorithms for Molecular Biology, 2, 9.
3
