Calculation of compositional bias in CS samples

Occurrences of an n-mer of length within a sequence of length can be treated as a Bernoulli trial if, for each position along the sequence from to we consider a success as the occurrence of the n-mer if it starts at the current position.

Although the equation, , is often used to calculate the total number of trials for occurrences of an n-mer, we propose an adjustment (Equation 2) for the reasons outlined below. One of the requirements of a Bernoulli trial is that each trial is independent and results in either a success or a failure. Furthermore, the probability for a success must be defined such that the range is [0, 1]. In the case where we can perform an independent trial at each position along , and the probability for success has a range of [0, 1]. However, if and the n-mer contains no terminal repeats (e.g. ) then can occur at most times, where “” is the integer function. Thus, the only way that we can allow the probability to have the range of [0, 1] is to define the total number of independent trials in as . If we let represent the number of successes, and let represent the number of failures, then . In the case where contains a terminal repeat with length (e.g. if then , or if then , or if then ) then it can be shown that . Therefore, we define the general form for the number of failures to be where (Equation 2). When we are examining a population of sequences we define (Equation 3).

In order to determine the bias of the n-mer in the CS population we define the null hypothesis to be and the alternative to be . We let and . We can define a confidence interval for the binomial distribution (equation 4) using given a sample size equal to , where is the z-value that contains an area of in the tails of the normal distribution (Agresti and Coull 1998; Wilson 1927). This allows us to establish whether the value lays within the confidence interval for a sample that is the size of assuming .

(Equation 4)

In addition we can use the Gaussian approximation to the binomial distribution to calculate a Z-score for the number of counts observed in the CS. It is widely accepted that if the number of trials is sufficiently large (i.e. if both and ) then the binomial distribution can be approximated using the normal distribution. If the null hypothesis is true then the maximum likelihood estimator is , and the Z-score for n-mer can be defined according to Equation 5.

(Equation 5)

This Z-score (without adjustment of ) is essentially the same as that used by Fairbrother et. al. to determine n-mers enriched in ESEs (Fairbrother et al. 2002). The adjusted value for all possible occurrences, , produces essentially the same as ; however, since is generally smaller than the confidence intervals calculated according to Equation 4 are also smaller. Thus, the use of produces a more conservative estimation of significance than use of .

Calculation of strength of association between conserved n-mer occurrences and alternative splicing

In order to assess the association between the occurrences of a conserved n-mer and alternative splicing we defined splice-junctions as alternative based upon the data contained in the UCSC ExonWalk database (http://hgdownload.cse.ucsc.edu/goldenPath/hg17/database/). A splice-junction was annotated as alternative if it was annotated in the ExonWalk database as either an alternative splice-junction or was adjacent to a skipped-exon event. We label an alternative event as , and the presence of an n-mer in a CS as , then the number of instances of an n-mer in a CS located in an alternatively spliced-junction is . Similarly, the number of instances of an n-mer located in a CS associated with a non-alternative splice-junction is . The maximum number of times that the n-mer could have occurred in a sequence the length of the CS, , was determined using Equation 2 (see above), and from this the actual counts of the n-mer were subtracted to give associations. The G-test of association was used to determine the significance of the association where , and and represent the observed and expected types of associations. To correct for continuity, the William’s correction was applied to the corresponding G-values to yield (Equation 6)(Sokal and Rohlf 1995).

(Equation 6)

In order to determine if a CSM was composed of n-mers representing a biased distribution of associations the corresponding P-values for were derived using the Chi-distribution and transformed to produce the test statistic using Equation 7. This transforms the P-values to produce a symmetrical distribution where positive values indicate an association with alternative splicing and negative values indicate an association with constitutive splicing. Student’s T-test was used to determine if there was a significant (a=0.01) difference in the means between the values for the CSM and the distribution of for all n-mers from the sample.

(Equation 7)

Graph clustering by common substrings

In this method, the input graph is constructed such that each n-mer is represented as a vertex, and each vertex is connected by an edge to all other vertices sharing a common substring. In order to provide an edge weight we used an enrichment Z-score based upon the Gaussian approximation to the binomial distribution (see Materials and Methods). Edges are assigned weights according to the greatest Z-score of either adjacent vertex. The input graph is then clustered using the Markov graph clustering algorithm MCL (Dongen 2000; Enright et al. 2002)(see next section for details). In this manner, n-mers containing similar substrings and having high Z-scores tend to cluster with other compositionally similar n-mers. Recently, an independently devised graph-based motif finding algorithm, MotifCut, was published (Fratkin et al. 2006). GCCS differs from MotifCut in several significant ways. The primary difference between these methods is the emphasis that is placed upon inserting edges versus pruning edges. While MotifCut uses statistically determined parameters for inserting an edge between vertices, GCCS simply requires a common substring. However, unlike MotifCut, GCCS uses edge-weighting and MCL clustering to prune edges. We have not performed a comprehensive comparison of the performance of these two methods.

Previous studies designed to identify cis-splicing elements have used hierarchical clustering based upon alignment scores between n-mers (Fairbrother et al. 2002). We feel the GCCS method offers several advantages over hierarchical clustering of short n-mers. One disadvantage with hierarchical clustering based upon similarity is that it does not work well for clustering populations of mixed length n-mers. As discussed earlier, we felt that allowing for variable length n-mers would be advantageous when the subject sequences might contain binding sites for various proteins each of which might have a different sized binding site. Since the GCCS method only requires that an n-mer posses a substring in common with other n-mers, the performance is fairly independent of n-mer length. Another advantage of graph based clustering is that weights can be applied to the vertices and/or nodes. This allows external criteria that are independent of the sequence composition (e.g. the statistical score for the n-mer) to influence clustering. A final advantage of the GCCS method is that clustering tends to act as a final filter since only n-mers that share a common substring with other n-mers will remain in the final clusters. As discussed above, real motifs are expected to generate multiple n-mers with common substrings in different phases. By requiring that an n-mer share a substring with other n-mers the GCCS method takes advantage of this phenomenon and effectively reduces the n-mers that occur in the population by chance. Although positional degeneracy can be incorporated into the GCCS method we chose not to allow mismatches between substrings. It is difficult to build a general degeneracy model that fits all RNA binding proteins. Some proteins display strong affinities for very specific sites while others appear to be more promiscuous and have similar affinities for a range of compositionally related sites. Since we wanted to detect all conserved RNA elements as individual motifs we chose not to allow degeneracy. The effect of this choice is that the motifs we identify may over-estimate the number of physiologically distinct sites since a given protein may equally recognize more than one motif. However, we hoped that by allowing only exact matches we would identify the most highly conserved portions of cis-signals.

The final step in motif construction involves aligning the clustered n-mers according to their greatest common substring (GCS), followed by the creation of a positional weight matrix (PWM) in which each aligned sequence was represented proportionally to its actual occurrence in the CS sequences. The PWM was used to create Seqlogos where character height is proportional to the fractional contribution of each base (see below for details). We refer to the resulting motifs as conserved sequence motifs or CSMs.

Details of GCCS clustering

In the GCCS method compositionally similar n-mers are clustered using a graph clustering method. The input (master) graph is constructed such that each n-mer is represented as a vertex, and each vertex is connected by an edge to all other vertices sharing a common substring (defined below). Edges are assigned weights proportional to the greatest Z-score of either of the adjacent vertices. A custom program (source code available upon request) was written to use the Markov graph clustering program mcl (Dongen 2000). The MCL ‘scheme’ was set to 4, and the MCL ‘inflation’ was set to 3. All other MCL values were the mcl defaults. Six rounds of clustering were performed. The starting minimal common substring length was set to bases in length where is the length of the longest n-mer in the input population. After each round, clusters containing less than 6 vertices were rejected and put back into the unclustered population. On occasion the graph clustering failed to partition clusters of compositionally similar n-mers. To rectify this problem, the greatest common substring (GCS) of each cluster was determined using a suffix-trie. The GCS was defined to be the longest substring shared by all members of the cluster. If the GCS was less than 3 bases in length the cluster was rejected and the members were returned to the unclustered population. For each subsequent round of clustering the minimal common substring length was decreased for the following clustering round if no clusters were found in the previous round. The shortest minimal common substring length allowed was 4.

Sequence logos (similar to Seqlogos (Schneider and Stephens 1990)), used to represent the conserved sequence motifs (CSMs), were generated from each cluster using custom software. For each cluster the n-mers were aligned according to their GCS, and sequence logos were constructed such that the height of each character is proportional to the fractional contribution of that character in the clustered population normalized to reflect the unique contribution that the n-mer could make given the counts of the n-mer in the CS population. The fractional contribution was calculated as follows. Let represent the number of n-mers in the clustered population. Let represent a position in the aligned sequence space and , where is the total length of the aligned sequence. Let represent the current character in the n-mer . If represents the count for the n-mer in the CS population then the unique contribution, , of n-mer is the number of occurrences of that could not have arisen from any longer n-mer in the cluster that also contains . was calculated according to Equation 7, where if n-mer is a substring of n-mer and, else .

(Equation 7)

Then the fractional contribution is calculated, according to Equation 8, for each character , where “-” represents an end-gap, and represents the identity function where if, for n-mer , , else .

(Equation 8)

References.

Agresti, A. and B.A. Coull. 1998. Approximate is Better than Exact for Interval Estimation of Binomial Proportions. The American Statistician 52: 119-126.

Dongen, S.v. 2000. Graph Clustering by Flow Simulation. Thesis: University of Utrecht, Utrecht, NL.

Enright, A.J., S. Van Dongen, and C.A. Ouzounis. 2002. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30: 1575-1584.

Fairbrother, W.G., R.F. Yeh, P.A. Sharp, and C.B. Burge. 2002. Predictive identification of exonic splicing enhancers in human genes. Science 297: 1007-1013.

Fratkin, E., B.T. Naughton, D.L. Brutlag, and S. Batzoglou. 2006. MotifCut: regulatory motifs finding with maximum density subgraphs. Bioinformatics 22: e150-157.

Schneider, T.D. and R.M. Stephens. 1990. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18: 6097-6100.

Sokal, R.R. and F.J. Rohlf. 1995. Biometry: the principles and practice of statistics in biological research. W. H. Freeman and company, New York.

Wilson, E.B. 1927. Probable Inference, the Law of Succession, and Statistical Inference. Journal of the American Statistical Association 22: 209-212.