Discussion of a Representative Helix-Coil Prediction Algorithm and Its Statistical Mechanics

Helix Coil Transitions — Part 2April 19, 2004

Discussion of a Representative Helix-Coil Prediction Algorithm and its Statistical Mechanics Foundation

Edward Freedman

Introduction

The prediction algorithm described in detail in this, Part 2, is the Helix-Coil Transition Theory (HCTT) of Misra & Wong. It is derived from the Munoz & Serrano approach but has considerable modifications. Results of trying the Munoz & Serrano algorithm, as implemented in the program AGADIR, are described in Part 3 of this paper. The method described here relies on statistical mechanics models, and in particular utilizes training data sets, thus it can be considered to be a knowledge-based approach.

To describe the prediction algorithm, it is first necessary to develop some key concepts in statistical mechanics.

Gibbs Free Energy

When a protein spontaneously transitions from its unfolded (denatured) conformation to its folded (native) conformation, it is moving from a less ordered state to a more highly ordered state. The reader may recall from thermodynamics that the universe (and closed systems within) are said to move towards a state of maximum disorder (i.e. entropy increases). The attainment of a more ordered state by a subsystem (i.e. a decrease in entropy) can be accomplished by the expenditure of work by another subsystem interacting with it. However, proteins in aqueous solutions will fold spontaneously without an external agent, provided the temperature and pH of the environment is within a certain range. This occurs because the transition is also from a higher energy state to a lower one; that is, energy is released by the transition (it is exothermic). The more ordered folded state is said to be energetically favorable.

Note: Folding will occur spontaneously only at temperatures below T = ∆H/∆S where ∆His the change in enthalpy, and ∆S is the change in entropy of the transition. Heating the protein (i.e. adding energy to the system) can cause it to unfold (denature).

A quantitative measure of a transition being energetically favorable (or unfavorable) is the change in the Gibbs Free Energy due to the transition, ∆G. In absolute terms,

G = H - TS where H is the enthalpy (or if pressure and volume are constant, simply the energy), T is the temperature, and S is the entropy of the system. We decline to define or expand on these concepts here. We are only interested in the change given by ∆G, which we will show later is calculated based on experimentally determined data.

Let us assume for now that we are given ∆G for a transition of interest, in particular the transition from a random-coil state to a helical state. We are now interested in the probability of being in one state or the other.

Boltzmann Distribution / Probabilities

Consider a system consisting of a number of indistinguishable particles N that can be in various energy states such that n0 are in energy state E0, n1 in state E1, etc. and that state E0 is the lowest energy state available, the ground state. We can write {n0, n1, n2, ...} to describe this configuration. In addition, we postulate equal a priori probabilities, i.e. all states of a given energy level can occur with equal probability. It is easy to see that there is only one way to have all the particles in the lowest energy configuration {N, 0, 0, ...} and there are N ways to arrange the next higher energy configuration {N-1, 1, 0, 0, ...}. Since all states are equally likely, we would most often find the second configuration, its aggregate probability being much higher (by a factor of N). Thus its characteristics and behavior will predominate. If we examine other configurations of particles in energy states, certainly some will be even more likely, by a large margin, than the two we just considered. In general, the number of possible arrangements W for a given configuration {n0, n1, n2, ...} is:

W = N!/n0! n1! n2! ...

Give two constraints on this system, that the total number of particles ∑ ni = N and total energy ∑ ei ni = E are both fixed, it's particularly interesting to find the configuration with the greatest number of possible arrangements, as this will dominate all the other configurations. It's possible to derive this from the previous expression for W, through a series of manipulations (which we omit here for simplicity). The result, describing this most likely configuration of ni as a function of Ei, is:

ni = N exp(-βEi) / ∑ exp(-βEj)

Or as a probability density function, this is the Maxwell-Boltzmann Distribution:

Pi = ni / N = exp(-βEi) / ∑ exp(-βEj)

This gives the most probable energy distribution for the system.

In this model, β is constant for a given system, β = 1/KT , where K is Boltzmann's constant and T is the absolute temperature in degrees Kelvin.

Partition Function

The requirement that the sum of individual probabilities ∑ Pi = 1 is satisfied via the normalization term in the denominator. This is known as the Partition Function:

Z = ∑ exp(-Ej / KT)

The partition function is thus a measure of the total number of states accessible to the system.

Note: If we allow for "degeneracy", that is, multiple states giat the same energy level Ei, the partition function becomes:

Z = ∑ gi exp(-Ej / KT)

The partition function is a measure of the aggregate number of states that a system can occupy. In that sense, it is related (ln Z) to the entropy of the system; the greater the number of possible indistinguishable states, the higher the entropy.

There are several alternative forms of the partition function, depending on the definition of the system and subsystems it applies to.

There are a couple points worth noting here. For a given system, the denominator is constant; the numerator varies with the negative exponent of Ei. So the probability is greatest for the lowest value of Ei (since energy level does not go below the ground state). This is consistent with what was stated regarding Gibbs Free Energy, that the lower energy, more ordered folded state is energetically favored. Yet at the beginning of the discussion of the Boltzmann Distribution, we said that configurations where higher energy levels were occupied offered more arrangements and thus were more likely. This is apparently in conflict with the result.

Such is not the case, for the following reason. The system, in our case a protein molecule, is not an isolated closed system. It is in thermal contact with a large heat reservoir, notably the aqueous solution surrounding it. Most importantly, it has far fewer degrees of freedom than the heat reservoir. When the protein moves to a lower energy state, some of its energy is transferred to the reservoir. The number of states accessible to the larger reservoir is a quickly increasing function of its energy level. Thus the reservoir gains many more accessible states than the protein looses; energy is conserved, and entropy increases.

Statistical Weights

We need one more conceptual piece to begin the discussion of the prediction algorithm itself. Let's consider a hypothetical protein consisting of four amino acids, or residues as they are often called, and assume that each can be either in a helical state h or a coiled state c. We define a partition function for it:

q = ∑ gi exp(-ei / KT)

Ignoring the finer details of vibrational and atomic energy levels, we consider only the energy associated with the helix or coil states. There are 24 = 16 possible conformations of these states. There is only one unique state cccc that contains no helices, four states that contain one helix (e.g. hccc, chcc, etc.), six states that contain two helices, four states that contain three helices, and one state that contains four helices. We define qi to be the collective contribution from a state with i helices and further assume that qi does not vary depending on which residue is in the helical state. We can then write the partition function as a sum:

q = q0 + 4q1 + 6q2 + 4q3 + q4 = q0 (1 + 4k1 + 6k2 + 4k3 + k4) defining ki = qi/q0

Note that we can consider the ki parameters as effective equilibrium constants (EEC) describing the change from the cccc state to one of the i-th states. This property will be utilized later in creating the training database for the prediction algorithm.

Since we are only interested in the partition function for obtaining probabilities, i.e. the proportion of states, we can drop q0 from the expression. For example, the fraction of proteins containing 2 helices is given by:

θ2 = 6q2 / q = q0(6k2) / q0(1 + 4k1 + 6k2 + 4k3 + k4) = 6k2 / (1 + 4k1 + 6k2 + 4k3 + k4)

Note that q0 describes the contribution of the coil state to the partition function. Equivalently, we can assign q0 a value of 1 and use it as the reference state when predicting helical states. Our new partition function then becomes:

q = (1 + 4k1 + 6k2 + 4k3 + k4) = ∑ gi exp(-ei / KT)

Thus the free energy change to move from the reference state to a state of i helices is given by:

∆G = - KT ln ki

If we assume the energy change to be independent of which residue changes state, we can define s :

s = k1 = exp(-∆G / KT)

s2 = k2 = exp(-2∆G / KT) etc.

Where s is called the statistical weight of a residue in the helical state. Likewise q0 = 1 is the statistical weight of the reference state. We can then write:

q = 1 + 4s + 6s2 + 4s3 + s4 = (1 + s)4

We note that this is a binomial expansion of a single statistical weight s assigned to the helical state, relative to a reference weight 1 assigned to the coil state.

Helical Propensity

We now have all the pieces we need to discuss the prediction algorithm. The specific goal of the algorithm is to calculate for each residue in the protein, a figure of merit indicating how likely it is to be part of a helix. In the Helix-Coil Transition Theory (HCTT) of Misra & Wong, this is simply the probability that the residue is in the helical state. However, in the case of the AGADIR program of Munoz & Serrano, this is not strictly a probability value, as the number can be greater than 1.0.

Some researchers, beginning with Chou & Fasman, define a helical propensity for each residue. Given a frequency that a given type of amino acid occurs in a set of proteins:

Fa = na / n

Where na is the number of amino acid residues of the specified type that occur in helices and n is the total number of residues of that type in the set of proteins, we can define the propensity as:

Pa = fa / <fa

Where <fa is the average value of fa for all 20 types of amino acid residues. It is easy to see that this propensity can easily exceed unity.

HCTT Algorithm of Misra & Wong

This algorithm has two main phases, an analysis phase which acquires training data from a large set of proteins of known structure (experimentally determined), and a prediction phase which utilizes the stored training data to calculate the helical probabilities for new proteins. It is convenient to describe the prediction phase first, even though it is preceded by the analysis phase.

Prediction Phase

The desired result, the probability that a residue was in the helical state, is calculated as

Presj ∑ 2N Pconfi * Istatej

where Presj is the probability that the j-th residue is in the helical state, Istatej is the state of the j-th residue (equal to 0 if residue j is in the coil state and equal to 1 if residue j is in the helical state) in configuration i, and Pconfi is the Boltzmann probability of finding configuration i. This sum of probabilities of all the ways the j-th residue can be in the helical state is quite reasonable. The difficulty comes from calculating Pconfi for all 2N states. To keep this from becoming computationally intractable, N is limited to be the length of a short segment of the protein of interest. That segment advances along the full length of the protein in a sliding window. For now, let us just consider the calculation of Pconfi for a segment of length 12.

Each of the 212 = 4096 configurations is considered. For each, the Boltzmann probability is given by

Pconfi = exp(-∆Gi / KT) / ∑ exp(-∆Gj / KT)

Thus for each configuration, the Boltzmann term in the numerator (a statistical weight) must be evaluated, where ∆Gi is given as

∆Gi = ∑ ∆Gint + ∑ ∆Gpair

∆Gint is the free energy change when isolated helical residues are formed in configuration i, and ∆Gpair is the energy contribution of residue-pair interactions within a helical segment and of the interactions between the helical residues and the non-helical residues comprising the N and C capping motifs. This algorithm defines 24 possible pair wise interaction: 4 intra-helical, and 10 each for the N and C caps. Note that not all 24 possible interaction will apply in every configuration. They depend on the constraints of a contiguous intra-helix and on the correct capping motif. If the interaction does apply, the corresponding energy is found in a 20 by 20 matrix whose rows and columns are selected by the corresponding amino acids of the interacting pair.

The denominator is the partition function, common to all 4096 Pconfi, and is the sum of these statistical weights for all 4096 configurations.

Analysis Phase

Both the 20 values for ∆Gint, specific to each type of isolated amino acid, and the 20 by 20 matrices of values for pair wise interactions ∆Gpair, are obtained as training data during an analysis phase. A database of 351 proteins were analyzed; selected for constraints in crystal structure resolution and sequence identity so as to yield a good quality sample. Pair wise interactions between amino acid types i and j were assumed to be proportional to ln kij, where kij was an "effective equilibrium constant" (EEC, as described in the section on statistical weights) indicating the amount this interaction shifted the helix-coil equilibrium to favor the helical state. We omit the details of calculating and normalizing the kijvalues and then using them to derive the stabilization energies proportional to the logarithm of the EECs. Instead we refer the reader to the paper by Misra & Wong. Suffice it to say that these were used to populate all 24 matrices for the various types of interactions without manual adjustment of parameters or human judgment on which interactions might be significant or otherwise, instead letting the training database dictate via the appropriate weights.

Conclusion

The statisitcal mechanics model and, in particular, the knowledge-based approach of the Misra&Wong algorithm, yields very good results. On proteins that were favorable, i.e. where tertiary interactions did not affect secondary structural preferences, >90% prediction was obtained. The knowledge based approach also allows for insight into physically significant helix stabilizing/destabilizing factors, unclouded by manual adjustment of parameters. Further work should be able to incorporate more interactions and improve the results.

References

This work summarizes, synthesizes, and extrapolates, texts, journal papers, and online reference materials that describe the theory, derivations, algorithms, and results of Helix-Coil transition prediction. The references include:

Books

Cantor, C. R., Schimmel, P. R., Biophysical Chemistry Part III The Behavior of Biological Macromolecules, W.H.Freeman 1980

Reif, F., Statistical Physics, McGraw-Hill 1965

Voet , D., Voet, J., Biochemistry 2ed, Wiley, 1995

Journals

Misra, G.P. and Wong, C.F. 1997. Predicting helical segments in proteins by a helix-coil transition theory with parameters derived from a structural database of proteins. Proteins 28: 344-359, 1997

Munoz, V., Serrano, L. Elucidating the folding problem of helical peptides using empirical parameters. Struct. Biol. 1:399-409, 1994.

Munoz, V., Serrano, L. Elucidating the folding problem of helical peptides using empirical parameters. II. Helix macrodipole effects and rational modification of the helical content of natural peptides. J. Mol. Biol. 245:275-296, 1995.

Schellman, J. A., Qian, H., Helix-coil theories: a comparative study for finite length polypeptides, J. Phys. Chem, 96, 3987-3994, 1992

Zimm, B. H., Bragg, J. K., Theory of the phase transition between helix and random coil in polypeptide chains, Journal Chem Physics, V31, N2526-535, 1959

- 1 -