Rorick-Wagner Supplementary Material
V. Appendix
1. Derivations of Model Equations
Defining the Markov Process:
We consider a discrete stochastic process where the random variable, b, is the length of a domain at a time-step t. The transition matrix M consists of the probabilities Oh,g of a functional domain transitioning from any length b=g to any other length b=h in a single time-step. These Oh,g probabilities that constitute M can be calculated from the assumptions of our model. The minimum length for b is m and the maximum length is R-- the length of the peptide. Transitions to lengths above and below these maximum and minimum values, respectively, are set to zero.
Each of the elements of M that is immediately off the diagonal (i.e., that represent the probabilities of plus-one or minus-one length transitions) is a product of two independent probabilities. The first is the probability Uh,g of a mutation occurring that causes a specific length transition from length g to length h. The second is the probability fh,gof fixation for this particular mutation with its associated length transition. These probabilities are specified by our model.
Let us first consider the probability of transitioning to a length h that is one residue smaller or larger than the starting length g:
if h=g-1 or g+1, Oh,g= Uh,gfh,g. (S1)
For transitions of step sizes larger than one, we make simplifications that approximate the actual transition probabilities. First we assume that mutations that increase domain size by more than one amino acid have effectively zero probability. Mutations that increase the domain size by more than one require either that two or more beneficial mutations simultaneously occur, or that a single mutation connectspreviously existing contiguous set of one or more alphabet amino acids to the functional domain. In real biological systems the first scenario is rare (occurring at a rate equal to or less than the square of the mutation rate), so in this model and in simulations we so in this model we only allow a single mutation per time-step. In this model we also ignore the relatively small probability of the second scenario occurring (even though it may occur in our simulations). In this analysis, we effectively we treat mutations that make connections between multiple contiguous series of alphabet amino acids as if they only increase the length by one. Thus, the probability of a mutation occurring that increases the length by more than one is simplified to zero, such that
if hg+1, Oh,g= 0*fh,g, (S2)
which means the transition probability is zero.
Another simplification is that we treat all mutations which decrease the length by more than one as if they have a zero probability f of fixation, such that
ifhg-1, Oh,g=Uh,g0. (S3)
This is because these mutations are generally very deleterious, and therefore, infrequently fixed. In any given time-step we thus assume a zero probability for transitions that increase or decrease the domain length by more than one amino acid.
The probability that no mutation is fixed during a given time-step can be determined by subtracting, from a total probability of one, the probability of increasing the length by one and the probability of decreasing the length by one:
Og,g=1- Ug-1,gfg-1,g - Ug+1,gfg+1,g. (S4)
Thus, we only need to calculate Uh,g and fh,g for h=g+1 and h=g-1, since Oh,g for hg+1 and hg-1 has been simplified to zero, and we can infer Og,g (equation S4).
Mutation Probability Calculations:
In order to calculate the probability Uh,g of a mutation occurring that causes a specific length transition, we first have to calculate the number Yg of viable mutations that are possible from a starting length g, and then calculate the number Zh,g of mutations that cause that specific length transition. The probability Uh,g is then the ratio Zh,g/Yg, times the per-site mutation rate, .
The only mutations that are unviable are those that decrease the size of the domain to less than the minimum length m. So, when the functional domain is equal to or larger than 2m, all possible mutations are viable—to any of the z-1 other amino acid types in the whole alphabet, at any of the R sites, such that
if g≥2m, Yg= R(z-1). (S5)
When the functional domain is smaller than 2m, some mutations at sites near the center of the domain are not viable because they will break the domain into two pieces that are both smaller than m. Thus, there are 2(g-m) sites within the functional domain of length g which can undergo non-neutral (i.e., domain length changing), viable mutations. There are z-a non-alphabet amino acid types, so there are 2(g-m)(z-a) non-neutral, viable mutations and (a-1)g neutral mutations within this region. Similarly, because any mutation can occur outside the functional domain (and there are z-1 non-wildtype amino acid types), there are (R-g)(z-1)viable mutations outside the functional domain. In summary,
if 2m g ≥ m, Yg=(R-g)(z-1) + (a-1)g + (2g-2m)(z-a). (S6)
We then calculate the number Zh,g of equally likely viable mutations that cause a change in length from g to h (where h=g+1 or g-1).
Transitions that reduce the length by one amino acid occur by a mutation to a non-alphabet amino acid (there are z-a of these) on one end of the domain or the other (i.e., at either of two different sites), such that
if h=g-1, Zh,g=2(z-a). (S7)
The simulations are set up such that functional domains do not come in contact with the edges of the sequence. Thus, for the purposes of calculating transition probabilities, we imagine a functional domain that is perfectly centered such that it always has two flanking sites at either end of the domain until it is R-1 residues long. These flanking residues can mutate to an alphabet amino acid, causing an increase in length. Thus,
if gR-1, Zg+1,g=2a. (S8)
If the domain is only one less than the maximum length R, there is only one flanking site, thus,
ZR,R-1=a. (S9)
Lastly, we do not need to know the probability of a mutation that maintains the same domain length (Ug,g) because we can simply infer Og,g from equation S4 (though see Appendix 4 for a description of how to calculate this directly).
Because in our model all viable mutations are equally likely, to finally calculate the probability Uh,g of a mutation that causes the length transition from g to h, we simply divide the number Zh,g of mutations that cause the specific length transition by the total number Yg of viable mutations that are possible at that time-step, and then multiply by the mutation rate, , such that
Fixation Probability Calculations:
The probability fh,g of fixation, as a function the starting and ending domain lengths (g and h, respectively), is determined by the forces of drift and selection in accordance with equations 4-6, and the definitions of function performance (p), protein fitness (w),and the selection coefficient(s) (equations 1-3). By substituting the expression for performance p (equation 1) into the expression for fitness w (equation 2), we obtain an expression for w as a function of b. Substituting this expression for w into equation 3 provides us with an expression for the selection coefficient s as a function of b. We can then use this expression for s to determine Nes, and through this, the fixation probability (equations 4-6). Ultimately, this process provides the fixation probability for a given mutation as a function of the domain lengths before and after the mutational event (i.e., g and h, respectively) (equations S11-S13):
Equilibrium Length Calculations:
Because the fixation probabilities are dependent in a complex way on the fitness function, we cannot easily derive the equilibrium length as a function of the parameters Ne, c, m, R, z and a. However, with the above equations we can numerically solve for the equilibrium length for any given set of parameters. The stationary distribution nassociated with the transition matrix M is the eigenvector associated with the eigenvalue of 1 for M, that is, n=Mn. The mean of the stationary distribution represents the expected equilibrium length, that is,
(S14)
where nb are the elements in the stationary distribution n, such that n={n1, n2, n3,…nR}.
Rate of random walk calculations:
Our expectation for a domain’s movement is zero because there is no bias for movement in one direction or the other, and expectations for movement in the positive and negative directions cancel out. Nevertheless, we do expect movement to occur over time in the form of a symmetric random walk. We can solve for the expected absolute value of the distance traveled to achieve a non-zero and non-constant function of time. Because we are ignoring the relatively low probabilities of increasing or decreasing domain length by any step size greater than one, we can solve for the expected absolute value of the distance x traveled from a starting point as a function of t time-steps (Rice 2004):
, (S15)
where and are the probabilities of, respectively, increasing or decreasing domain length by one amino acid once the equilibrium length has been reached.
Probability of touch-event calculations:
We can also calculate the probability v of two functional domains at equilibrium length q, that start at a certain distance y between their closest ends, experiencing a touch-event by t time-steps. Because two domains undergoing a random walk are equally likely to diverge as opposed to converge over time, the probability of a touch-event is never more than 50%. The probability of a touch-event can be calculated based on an understanding of the probability distribution of a functional domain’s position x over time (which we index as the position of the central amino acid of the domain). The probability distribution of x is assumed to be normal because it is the sum of many independent events, with an expectation E[x(t)]=0 for any time t because steps in the positive and negative directions are equally likely. To find the variance of this distribution we first find the variance of x after a single time-step using the definition of variance for discrete random variables:
(S16)
where N is the number of different possible outcomes xw that are each associated with a probability Pw, and E[x] is the expected outcome. There are only three possible outcomes for a single step: a move in the positive direction (x(t=1) = 1), a move in the negative direction (x(t=1) = -1), or no move at all (x(t=1) = 0). We know the probabilities of each of these positions after one time-step is , and , respectively. Our expectation for the domain’s position after one time-step is zero, so the variance of x after one time-step is
(S17)
We can now calculate the variance of the position x after multiple time-steps because the position at time t is just the sum of many single steps, and the variance of a sum of elements is the sum of the variances of each of the elements. Thus,
. (S18)
With the probability distribution of the position of a given domain, we can now calculate the probability distribution that describes the distance between the closest ends of two functional domains.
Let K(t) be the distance between the central amino acids of two functional domains 1 and 2, at equilibrium length , starting at a distance y between their ends, as the functional domains change position over time, such that
, (S19)
where x1(t) and x2(t) are the changes in position (or the absolute position, if both were relative to starting positions of zero) of functional domains 1 and 2, respectively. Our expectation for K(t) is simply the sum of the expectations for each of its parts:
. (S20)
Similarly, because the variance of a sum or difference is the sum of the variances of each of the parts:
. (S21)
Lastly, knowing the expectation and the variance for K(t), and the fact that it must be normally distributed (because it is a sum of many independent changes in position), we can use the cumulative distribution function to find the probability that two functional domains, at equilibrium length and with a starting distance y between their ends, have touched (i.e., that K(t)≤) at any given time t:
. (S22)
2. Simulation details
For simulation results shown in Figures 4 and 7 and Table I, and in all other simulations unless otherwise noted, the parameters and initial conditions are as follows. The protein sequence consists of R=100 residues and the effective population size is Ne=100. At all time-steps a sequence performs two primary functions and, if any, a single secondary function (i.e., there are F=2 or 3 functions). Environmental change events occur at intervals of r=100 time-steps.
All functional domains have a minimum length of m=5 amino acids. There are two primary functions (Functions 1 and 2), each of which is defined by a four-amino acid alphabet (consisting of four numbers, since each amino acid type is identified by a unique number). These two alphabets intersect at a single amino acid type (i.e., amino acid type 4), and there is a single secondary function with an eight-amino acid alphabet.
Function 1 alphabet: {1,2,3,4}
Function 2 alphabet: {4,5,6,7}
Function 3 alphabet (initial): {8,9,10,11,12,13,14,15}
The length of the primary functional domains at time-step 0 is equal to the equilibrium length determined from previous simulations. When a secondary functional domain is present, the observed primary domain equilibrium length is 19 amino acids. Thus, in simulations with secondary functions, the primary functional domains are initiated at a length of 19 amino acids. In the absence of secondary functional domains, the observed equilibrium length of the primary functional domains is 23, so simulations without secondary functions are initiated with primary functional domains of length 23. In simulations with and without secondary functions, the two primary functional domains start with 5 amino acids between their two closest ends. The simulations begin at time-step 0, and are ended at time-step 400000, 1000000, or 2000000.
The particular parameters and initial conditions for the above simulations were chosen for the sake of comparability between simulations with and without secondary functional domains. The same qualitative pattern (encroachment on primary domains by secondary domains, and the evolution of overlap between primary domains) is also observed in simulations with the following alternative initial conditions and parameter choices: (1) primary functional domains initially located much farther apart, (2) a secondary domain initially located between the primary domains, (3) primary and/or secondary domains that are not initially at equilibrium length, (4) several primary and secondary functions present, (5) no enhanced competition between primary and secondary functions (i.e., primary and secondary function alphabets overlap with each other), and (6) equation 1 replaced with a rational function of form
, (S23)
where k and l are parameters that allow a good fit to equation 1.
Simulations with results in Figures 2 & 3 each have only a single primary functional domain and no secondary functional domains. The primary functional domains were initiated at the minimum functional domain length m=5, and were run for 1000000 time-steps. Mean domain lengths and random walk rate measures were taken using the last 800000 time-steps of each simulation to allow plenty of time for domains to reach equilibrium length.
The simulations with results reported in Figure 9 have several primary functions (F=4) with partially intersecting alphabets evolved under crowded conditions within a sequence (without any secondary functions). Function 1 has alphabet {1,2,3,4}, function 2 has alphabet {1,5,6,7}, function 3 has alphabet {1,8,9,10}, and function 4 has alphabet {1,11,12,13}. Ne =10000, m=10. Simulations were initiated with domains of length m.
The simulation in Figure 10 was performed with only a single secondary function (F=1) with alphabet {1,2,3,4}. In this simulation, loss of function mutations were allowed (so there is occasionally no viable functional domain within this protein). Ne =10000, m=4, r=1000. Simulations were initiated with a domain of length m.
The simulations reported in Figure 5 have four functional domains at any given time: two primary and two secondary. The alphabets for the primary functional domains are {1,2,3,4} and {5,6,7,8}, and the secondary functional domains are {9,10,11,12} and {13,14,15,16} at the beginning of the simulation. The coexisting secondary functional alphabets were never allowed to overlap (i.e., share any amino acid types). Parameters were set as follows: tmax=100000, c=.2, Ne=100, r=100, m=4, and the simulations were initiated with functional domains of length m (or longer, if by random chance the inserted domains were adjacent to amino acids within the same alphabet).
Simulations with results shown in Figures 4, 5, 6A, 6B, 6C, 7, 9, 10 and Table I were performed with a slight technical variation from the analytical model described in the text: “mutations” that do not alter amino acid composition (e.g., a “4” randomly reassigned to be a “4”) still count as mutational time-steps. Compared to the analytical model or the simulations that do not count such transitions as mutations (Figures 2 and 3), this has the effect of slightly lowering the mutation rate from 1 to 19/20.
3. Sticking frequency
The sticking frequency is defined as the probability of evolving overlap given a touch-event between two primary functional domains. If two functional domains touch, under the assumption that they do not affect one another, we would expect there to be a 50% chance of them continuing to converge versus moving apart again. If, on the other hand, the two functional domains do affect one another by drawing each other in, we would expect the sticking frequency to be above 50%. In order to determine whether two functions that have randomly touched are indeed “stuck” (converging), we have to establish how fast we expect a domain to move after an initial touch-event. We do not want to count a touch-event as a sticking-event until there has been enough time for the two functional domains to move apart again. Likewise, we do not want to miss a real stick-event by assessing the stick event after too long a time interval, since any two functions will eventually diverge given enough time—even if they are non-randomly stuck to one another for a long time. To establish the appropriate time interval after a touch-event at which to assess sticking frequency, we first determine the expected amount of time required for the central amino acid of a functional domain (at equilibrium length and undergoing a random walk) to have a .95 probability of moving by at least one amino acid position. This is a little below 50,000 time-steps, so we round up and use 50,000 time-steps after the touch-event as the threshold to determine whether two functional domains are stuck or not.
4. Calculating Ug,g directly
Although we can infer Ug,g, we can also calculate it directly in order to verify that the probabilities for all possible mutations, from a given starting length g, add up to one. The number Zg,g of equally likely viable mutations that cause no change in length depends on how long the domain is. When the functional domain is length g=R, all residues are part of the domain, so only alphabet amino acid mutations maintain the domain’s length. There are a-1 alphabet amino acids, thus,