Supplementary material

Biased codon usage near intron-exon junctions: selection on splicing enhancers, splice site recognition or something else?

Jean-Vincent Chamary and Laurence D. Hurst

Department of Biology and Biochemistry, University of Bath, Bath, UK, BA2 7AY

Corresponding author: Hurst, L.D. ().

Dataset of complete coding sequences

We started with the same version of the human exon-intron database employed by Eskesen et al. [1], which contains 47 908 exons from 7150 entries. However, many sequences are incomplete, not starting with ATG and/or not finishing with a stop codon. These were eliminated. Sequences with non-terminal in-frame stop codons were also excluded. Moreover, many genes are duplicates of some variety, either different but similar genes, different versions of the same gene or, in some cases, identical copies of the same gene. We therefore performed an all-against-all BLAST (using E = 0.001), eliminating all but one of each duplicate cluster. We arbitrarily retained the longest entry or, when several long coding sequence were present, selected one at random. Of the remaining coding sequences, we cross-referenced the protein identifier with NCBI (http://www.ncbi.nlm.nih.gov), which revealed five non-human genes. Before ignoring first and last exons, this left 18 414 exons from 2033 genes. Our final dataset consisted of 14 407 exons in 1802 genes.

Proportional usage of synonymous codon pairs as a function of distance from junctions

All exons were trimmed so that the first and last codons were whole (i.e. removing 0–2 nucleotides from both ends). Each exon was divided in two, the first half being considered the 5’-end, the second the 3’-end. Under this protocol no given codon can be counted more than once. Running towards the interior of an exon, the distance from the intron-exon junction is the number of whole codons between a given codon and the junction pertinent to the half-exon. The first whole codon at each end was hence the 0th codon.

For each synonymous codon pair of interest, we count the number of codons at each position relative to the intron-exon junction. We estimate biases by considering the proportional use of the first codon within the pair, which is given by codon1/(codon1 + codon2). The 0th codon was excluded due to known preferences at junctions (e.g. because AAG is often the last codon at the 3’ end of an exon [2], AAA will be comparatively rare).

Proportional representation of codons in candidate exonic splicing enhancer hexamers

We counted the number of occurrences of each of the 64 possible codons in a list of strong candidate hexamers, separated into clusters with high sequence similarity (kindly provided by Will Fairbrother, [3]) that define human ESEs. From these clusters we determined whether a given hexamer enhanced splicing at the 5’ and/or 3’-ends of exons. We then split each hexamer in the non-redundant lists into four codons (starting at positions 1, 2 and 3), which yielded 95 hexamers for the 5’-end (a 380-codon set) and 177 hexamers with 3’ activity (708-codon set). A list of 238 hexamers that is not split by 5’ or 3’ activity can be obtained from the RESCUE-ESE web server (http://genes.mit.edu/burgelab/rescue-ese).

References

1 Eskesen, S.T. at al. (2004) Natural selection affects frequencies of AG and GT dinucleotides at the 5' and 3' ends of exons. Genetics 167, 543–550

2 Sun, H. and Chasin, L.A. (2000) Multiple splicing defects in an intronic false exon. Mol. Cell. Biol. 20, 6414–6425

3 Fairbrother, W.G. et al. (2002) Predictive identification of exonic splicing enhancers in human genes. Science 297, 1007–1013


Figure 1. The proportional usage of one codon within a synonymous pair as a function of distance from intron-exon junctions. For each of the following synonymous codon pairs, the plot for the distance from the 5’-end (in the 5’ to 3’ direction) of the exon is plotted on the left hand side, whereas distance from the 3’-end of the exon (in the 3’ to 5’ direction) is plotted on the right hand side. The plotted lines are the best fit regression lines weighted by the number of codons in the sample, for each given codon pair, at the given distance. For a statistical resume of results, see Table 1 in the main text.