1

Online Supporting Information B: Generation of

the Complexity Measure Factors

A protein sequence is formed by 20 native amino acids whose single character codes are: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y. It is very difficult to find its characteristic pattern particularly when the sequence is very long. To cope with this situation, we resort to the images derived from the amino acid sequence by means of the space-time evolution of cellular automaton [1], as briefed below.

Suppose a protein P consists of N amino acids; i.e.

(1)

where R1 represents the first residue of the protein, R2 the second residue, and so forth. To transform a protein sequence from a character code to a numerical one, we adopted the code-converting relation as given in Table 1, which can better reflects the chemical and physical properties of an amino acid, as well as its structure and degeneracy, as elaborated in Xiao et al. [2]. If each of the constituent amino acids in the protein P is coded in a binary code according to Table 1, the protein sequence will be transformed to a serial of 5N digital elements, where each of the elements is either 0 or 1. For example, the sequence “PLQHRS…” is accordingly transformed to “000010001100100001010011001001…”. Each of these elements can be treated as a pixel with “0” for “white” and “1” for “black”, then by following the space-time evolution procedures as described in Xiao et al. [2], the protein P would correspond to a cellular automaton image.

Cellular automata [1] are discrete dynamical systems whose behavior is completely specified in terms of a local relation. A cellular automaton can be thought of as a stylized universe consisting of a regular grid of cells, each of which is in one of a finite number of possible states, updated synchronously in discrete time steps according to a local, identical interaction rule[1]. The concept of cellular automata has attracted a great deal of interest in recent years because many extremely complex patterns can be evolved by just repeatedly applying some very simple rules. This is particularly useful for emulating complicated physical, social, and biological systems.

In this study, the practical approach to generate the cellular automaton image for a given protein sequence can be described as follows.

According to Table 1, the residue chain of Eq.1 is initially converted to a sequence with digits (or grids); i.e,,

(2)

where as defined in [3]. Suppose the time for each updated step is consecutively expressed by , we have

(3)

where

(4)

with the spatially periodic boundary conditions; i.e.,

and (5)

The rule of Eq.4 is actually the space-time evolution rule for cellular automation models [1]. In the cellular automaton image for a protein sequence, itsconstituent residues are coupled with each other as an entity. While producing its cellular automaton image, the individual cell statecorresponding to a certain amino acid residue is colligated withthe residues both prior and behind it. Because of this, the cellular automaton image can help reveal some implicit sequence features that are usually difficult to be visualized. It was found that, of the 256 evolving rules [1], the space-time evolution rule of Eq.4 is the best in this regard. In other words, by means of Eq.4, we can convert the one-dimensional protein sequence to a two-dimensional matrix or a graph, from which some important features can be effectively extracted for incorporation into of Eq.5.

In this study, the grid at is filled with white color if and black if . Accordingly, each g-string in Eq.3 corresponds to a narrow ribbon mixed with white and black colors. Scanning these ribbons successively on to a screen or sheet will generate a 2D (2-dimensional) black-and-white image. The image thus evolved is called the cellular automaton image.

A protein sequence is actually a symbolic sequence for which the complexity measure factor can be used to reflect its sequence feature or pattern [4]. Among the known measures of complexity, the Lempel-Ziv (LZ) complexity reflects the order that is retained in the sequence, and hence was adopted in this study. Below, let us first introduce some basic definition about LZ complexity.

(6)

The LZ complexity of a sequence can be measured by the minimal number of steps required for its synthesis in a certain process. For each step only two operations were allowed in the process: either generating an additional symbol which ensures the uniqueness of each component or copying the longest fragment from the part of a synthesized sequence. Its substring is expressed by

(7)

The complexity measure factor,, of a non-empty sequence synthesized according to the following procedure is defined by the minimal number of steps

(8)

Let us assume that has been reconstructed by the program up to the digit , and has been newly inserted. The string up to will be denoted by , where the dot denotes that is newly inserted in order to check whether the rest of the string can be reconstructed by a simple copying. First, suppose and see whether is reproducible from. If the answer is “no”, then we insert into the sequence followed by a dot. Thus, it could not be obtained by the copying operation. If the answer is “yes”, then no new symbol is needed and we can go on to proceed with and repeat the same procedure. The LZ complexity is the number of dots (plus one if the string is not terminated by a dot). For example, for the stringS=0001101001000101, the LZ schema of synthesis generates the following components and the corresponding complexity:

(9)

implying that the complexity measure factor for the stringS=0001101001000101 is 6.

We can derive M complexity factors if the image has M rows. These complexity factors can allbe used to serve as the pseudo amino acid components. However, it was observed that the highest success rates were resulted if the first 10 complexity factors were used.

1

Table 1 Binary coding for 20 different amino acids

Type / Code
Character / P / L / Q / H / R / S / F / Y / W / C
Decimal / 1 / 3 / 4 / 5 / 6 / 9 / 11 / 12 / 14 / 15
Binary / 00001 / 00011 / 00100 / 00101 / 00110 / 01001 / 01011 / 01100 / 01110 / 01111
Character / T / I / M / K / N / A / V / D / E / G
Decimal / 16 / 18 / 19 / 20 / 21 / 25 / 26 / 28 / 29 / 30
Binary / 10000 / 10010 / 10011 / 10100 / 10101 / 11001 / 11010 / 11100 / 11101 / 11110

References

1.Wolfram S (1984) Cellular automation as models of complexity. Nature 311: 419-424.

2.Xiao X, Shao S, Ding Y, Huang Z, Huang Y, Chou KC (2005) Using complexity measure factor to predict protein subcellular location. Amino Acids 28: 57-61. doi: 10.1007/s00726-004-0148-7

3.Chou KC, Shen HB (2009) Review: recent advances in developing web-servers for predicting protein attributes. Natural Science 2: 63-92 (openly accessible at doi: 10.4236/ns.2009.12011

4.Xiao X, Shao SH, Huang ZD, Chou KC (2006) Using pseudo amino acid composition to predict protein structural classes: approached with complexity measure factor. J Comput Chem 27: 478-482. doi: 10.1002/jcc.20354