Prototypes of Elementary Functional Loops Unravel Evolutionary Connections Between Protein

Supplementary Information
Prototypes of elementary functional loops unravel evolutionary connections between protein functions
Alexander Goncearenco and Igor N. Berezovsky

APPENDIX A: ArchaEAL PROTEOMES USED IN PROTOTYPE DERIVATION

The following 68 archaeal proteomes were used to derive sequence prototypes. Four proteomes (bold font), each representing a separate archaeal phylum, were used as source of origins:

Methanococcus maripaludis C7
Candidatus Methanoregula boonei 6A8
Caldivirga maquilingensis IC-167
Methanococcus maripaludis C5
Methanothermobacter thermautotrophicus str Delta H
Staphylothermus marinus F1
Haloarcula marismortui ATCC 43049
Methanococcus aeolicus Nankai-3
Nitrosopumilus maritimus SCM1
Thermoproteus neutrophilus V24Sta
Thermococcus kodakarensis KOD1
Pyrobaculum islandicum DSM 4184
Methanococcus vannielii SB
Sulfolobus islandicus L S 2 15
Sulfolobus islandicus Y N 15 51
Halorhabdus utahensis DSM 12940
Ignicoccus hospitalis KIN4
Ignicoccus hospitalis KIN4/I
Methanoculleus marisnigri JR1
Sulfolobus solfataricus P2
Methanospirillum hungatei JF-1
Sulfolobus tokodaii str 7
Methanosaeta thermophila PT
Halobacterium sp NRC-1
Methanosphaerula palustris E1-9c
Archaeoglobus fulgidus DSM 4304
Picrophilus torridus DSM 9790
Aeropyrum pernix K1
Thermoplasma acidophilum DSM 1728
Desulfurococcus kamchatkensis 1221n
Thermococcus sibiricus MM 739
Methanosphaera stadtmanae DSM 3091
Sulfolobus islandicus M 16 4
Pyrobaculum arsenaticum DSM 13514
Methanobrevibacter smithii ATCC 35061
Methanosarcina acetivorans C2A
Pyrococcus horikoshii OT3
Sulfolobus islandicus M 16 27
Halomicrobium mukohataei DSM 12286
Natronomonas pharaonis DSM 2160
Candidatus Korarchaeum cryptofilum OPF8
uncultured methanogenic archaeon RC-I
Pyrococcus furiosus DSM 3638
Thermoplasma volcanium GSS1
Halorubrum lacusprofundi ATCC 49239
Halobacterium salinarum R1
Methanosarcina mazei Go1
Pyrobaculum aerophilum str IM2
Methanopyrus kandleri AV19
Methanosarcina barkeri str Fusaro
Methanocorpusculum labreanum Z
Sulfolobus islandicus Y G 57 14
Pyrococcus abyssi GE5
Methanococcus maripaludis C6
Methanococcoides burtonii DSM 6242
Methanococcus maripaludis S2
Haloquadratum walsbyi DSM 16790
Thermococcus gammatolerans EJ3
Sulfolobus acidocaldarius DSM 639
Metallosphaera sedula DSM 5348
Thermofilum pendens Hrk 5
Methanocaldococcus fervens AG86
Hyperthermus butylicus DSM 5456
Pyrobaculum calidifontis JCM 11548
Nanoarchaeum equitans Kin4-M
Sulfolobus islandicus M 14 25
Thermococcus onnurineus NA1
Methanocaldococcus jannaschii DSM 2661

APPENDIX B: SEQUENCE LOGOs of the PROTOTYPES
APPENDIX C: Computional procedure

Figure S1: Workflow of the prototype derivation procedure.

1.1 Scoring function and the model

Given a proteomic amino acid composition the probability to observe any random sequence segment of length n, will then be .

The is a likelihood of the random sequence given the set of random sequences , representing a random or reshuffled proteome. All sequences in are assumed to be unrelated, therefore is used as background. Since a profile represents a set of related sequences, the log-likelihood ratio, , determines whether a particular sequence is related to profile , or whether it is random and belongs to . The profile is a matrix of amino acid frequencies on position: . Therefore,

. (1)

In order to score the matches between profile and sequences, a position specific scoring matrix (PSSM) is constructed. For a profile with 50 positions and 20 amino acids, the PSSM matrix is calculated as follows: , , where is the matrix of observed frequencies corrected with pseudo-counts (Altschul, et al., 2009). The latter are used in order to score unobserved amino acids by giving them negligible frequencies proportional to proteomic composition c: , where α is the number of sequences in the profile and β is empirically chosen pseudo-count coefficient. Consequently, is always defined, because .

The overall score of sequence to profile is then:

, (2)

with an upper boundary: , where .

Figure S2: Convergence dynamics of the prototype. This plot shows how the number of matches changes as the prototype converges (red line). At the iteration 4 profile becomes more generic and collects many more matches (note the logarithmic scale). Profile is considered converged, when distance D is lower than epsilon (dotted line).

The procedure stops when the profile is no longer updated, which is determined by distance between the current profile and the previous profile :

, where i = 1..n is position in profile.

When is smaller than a given threshold ε, profiles are considered converged. Figure 4 shows how the profile converges over iterations. The number of sequences included in the profile increases over the iterations (red) thus increasing the sensitivity of the profile. At the same time, the profile converges and does not degenerate, which is manifested in the decreasing distance (black). At some point (iteration 4 in the figure), a more generic signature is obtained, resulting in a dramatic increase in the number of matches, which means that the profile now corresponds to a larger number of EFLs. Nevertheless, the more generic profile converges quickly.

Figure S3: Hierarchical clustering of profiles. Converged profiles are compared, and the most similar profiles are merged together iteratively: (A) Diagram showing how profiles are iteratively joined until a certain threshold of profile-profile distance is reached. (B) The difference between merged profiles increases with the iterations; two thresholds (dotted lines) 150 and 163 show iterations where distances between joined profiles start to grow faster, ending up with undesired merging of completely dissimilar ones. (C) Determining threshold to prevent merging of dissimilar profiles. In order to find the stop point of clustering procedure we modeled the distribution of pair-wise distances between random (unrelated) profiles by shuffling them. Distribution of pairwise distances between dissimilar reshuffled profiles is shown as green curve. With the iterations the bulk of similar profile pairs is merged and the distribution shape gradually approaches the modeled random distribution. Clustering should not continue further and the procedure is terminated.

The pair-wise distance matrix Q is obtained by comparing all converged profiles. To make clustering more efficient we apply greedy heuristics where all pairs of patterns with a distance below a certain threshold t are merged:

, where indicates how greedy the clustering is (we use ).

With the iterations, as similar patterns are joined, the variance of Q decreases, at the same time decreasing the greediness of clustering.

1.2 Comparison with PSI-BLAST

Profile sequence analysis is a widely used technique implemented in PSSM based tools, such as PSI-BLAST. In general, all profile methods produce comparable results, but given the specific requirements of the task we expect that a specially designed procedure would perform better. For example, if PSI-BLAST is applied to derive prototypes of EFLs, we would face the following problems: (a) neither the substitution matrix, nor the gap cost model can be inferred for events of pre-domain evolution; (b) positions in the profiles are considered independent, therefore the presence of a signature is not reflected; (c) the significance of the match is estimated by using extreme value distribution (EVD), pre-fitted by simulated random alignments. Fit of parameters, especially lambda scale parameter of the EVD is a known problem (Altschul, et al., 2001), and it depends on sequence length and on the composition of the proteome. Therefore, search of the database with different composition would require recalibration. For short loop-sized sequence segments error of the EVD approximation becomes unacceptably high (Wolfsheimer, et al., 2007). It results in erroneous estimation of significance and incorrect E-values. And indeed, our preliminary calculations show that if we use PSI-BLAST to derive prototypes of EFLs, it would start to collect false positives at lower E-values (E-value (E) - 0.001 / number of families (F) - 1; E/F-0.01/2; E/F-0.05/2; E/F-0.1/2; E/F-0.5/22); for comparison at E-value starting from 1 see Table S1. The other difference that would be pronounced is that PSI-BLAST considers equal contribution from each profile position to the score, while our procedure assumes the presence of the functional signature and penalizes mismatches on conserved positions, which is reflected in the overall score. Therefore, mismatches on functionally important positions are less probable with our method.

Table S1.Comparison of diversity of PSI-BLAST matches with PSI-BLAST

E-value / Diversity of SCOP families in PSI-BLAST / Diversity of SCOP families in our procedure
1 / 23 / 7
2 / 27 / 7
3 / 29 / 9
4 / 30 / 13
5 / 37 / 17
10 / 44 / 30
20 / 55 / 39
30 / 58 / 48
40 / 61 / 60

Sequence database of non-redundant domains from 68 archaeal proteomes was used. PSI-BLAST and the above discussed procedure were used to converge the profiles against the sequence database starting from the same origin. Different E-values were applied in both PSI-BLAST and our procedure. The number of iterations for PSI-BLAST was fixed 10 iteration. Each converged profile was then analyzed against ASTRAL/SCOP database (at 95% sequence redundancy) and the corresponding matches compared. Numbers in the table represent how many different SCOP families were encountered among the matches of the prototype. It is important to note that PSI-BLAST procedure starts to collect unreasonably high number of families already at as low as 0.5 E-value. The logo of the corresponding profile is shown below.

References

Altschul, S.F., Bundschuh, R., Olsen, R. and Hwa, T. (2001) The estimation of statistical parameters for local alignment score distributions, Nucleic Acids Res, 29, 351-361.

Altschul, S.F., Gertz, E.M., Agarwala, R., Schaffer, A.A. and Yu, Y.K. (2009) PSI-BLAST pseudocounts and the minimum description length principle, Nucleic Acids Res, 37, 815-824.

Wolfsheimer, S., Burghardt, B. and Hartmann, A. (2007) Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail, Algorithms for Molecular Biology, 2, 9.