Online supplementary information for Zhang et al, 2003

Table of contents:

1. Construction of Real Exon and Pseudo Exon Databases

2. Positive pentamers populating exon flanks

3. Mutual information

4. Base combinations

1. Construction of Real Exon and Pseudo Exon Databases

Exon/intron databases were constructed based on GenBank release 130 using the Exon Intron Database (EID) tools provided by Saxonov et al. (Saxonov, S., Daizadeh, I., Fedorov, A., and Gilbert, W. 2000. Nucleic Acids Res28: 185-190.). Human genes were extracted and any genes that had >75% homology at the protein level were purged using BLASTP in a pair wise manner. Computationally-predicted genes and genes subject to alternative splicing were also purged, based on related annotation words. Exons that were bordered by introns of at least 200 nt were extracted from these genes to constitute a real exon dataset of 5,753. Splice sites of internal exons were used to generate position-specific scoring matrices for acceptor sites and donor sites respectively. Introns that were longer than 400 nt were then used to search for pseudo exons. Matrix scores were calculated essentially as described by Shapiro and Senapathy (1987). The searching program scanned the intron sequence and recorded the locations of any 9-mers whose donor site matrix scores were greater than 78 and any 15-mers whose acceptor site matrix scores were greater than 75. These thresholds were chosen so as to generate 2 sets of pseudo splice sites that have an information content equal to those found in the 2 sets of real sites. Each of these thresholds also corresponds to approximately the lowest quartile of each splice site score distribution. These subsequences were treated as pseudo donor sites or pseudo acceptor sites. The program then looked for intronic sequences flanked by a pseudo acceptor site on the upstream side and a pseudo donor site on the downstream side. To define a pseudo exon, these sites must be between 50 and 250 nt apart, and at least 100 nt from the closest real exon. If all these requirements were met, the intronic sequences were included in the pseudo exon database, which numbered 9,246. The same processes were applied to repeat masked (Smit 2002) sequences to generate a subset of 4,912 pseudo exons free of highly repeated sequences.

In addition to the sequences from the EID, we generated a non-redundant final test set of real exons from 1,853 additional full-length human genes. mRNA sequences in Locus Link (ftp://ftp.ncbi.nih.gov/refseq/LocusLink/ homol_seq_pairs) were extracted and aligned to the human genome by Spidey ( We carefully examined the intron-exon structure of each hit using very stringent criteria: 1) the alignment of the mRNA had to be greater than 95%; 2) the identity in the alignment had to be at least 99% (allowing for the error rate of sequencing); and 3) AG and GT had to be present at the boundaries between the predicted exons and introns. We then removed exons and introns that were already present in the set generated from the EID, leaving us with approximately 15,000 real exons. About 24,000 pseudo exons were also generated from these genes as described above. These exons and pseudo exons were not used for SVM, but served as a final test set for the analysis of some feature statistics (see below).

Construction of Class II and Class III Pseudo Exon Databases

Class II pseudo exons have a trimer content that mimics the codons of real exons and Class III pseudo exons have a hexamer content that mimics the dicodons of real exons. Single codon and dicodon frequency tables were generated from the sequences with complete cds in the EID. A codon content score was calculated for an exon candidate sequence for all 3 reading frames by, where l is the length of the cds; j is the index of 3-mers; p is the phase (0, 1, or 2); i is the position of the 3-mer and fcod j is the frequency of the jth 3-mer as a codon. Among the 3 phases, the one with the highest score was used as the score of the sequence. Both real exons and pseudo exons were evaluated in this way. The median of real exons was set as the threshold for choosing Class II pseudo exons. Class III pseudo exons were similarly classified using dicodon frequencies and the lower quartile of real exons as the threshold. The additional requirements on exon bodies made the population of Class II and Class III pseudo exons shorter than the initial pseudo exons. To eliminate this influence, subsets of our real exons were chosen to have similar average lengths as these pseudo exon sets. We used five-fold cross-validation as a test of our results in this case: the data set was randomly divided into five folds, and in each of five experiments, an SVM is trained on four folds and tested on the held-out fold.

Data Division for SVM

The real exons and pseudo exons derived from the EID were randomly divided into two sets. The working set was comprised of approximately 60% of the real exons and 33% of the pseudo exons; their numbers were roughly equal. A test set contained the remainder of the real exons and a like number of randomly chosen pseudo exons. The working set was used for training and cross-validation and the test set was used for evaluation only.

2) Positive pentamers populating exon flanks

In the table below the population of exon and pseudo exon 50 nt flanks with each class of positive pentamers is compared. In every case the proportion of exons with at least one occurrence of the pentamer class (as defined in the text) is substantially greater in exon flanks compared to pseudo exon flanks. Also presented is a control showing the same calculation for 50mers randomly chosen from the interior of introns. The differences between the real and pseudo exons were typically an order of magnitude greater than the standard deviations exhibited by this control set. This control shows that in most cases not only is there an increase in the proportion of exon flanks with the indicated pentamer, but pseudo exons exhibit a deficiency in these sequences. An extreme example of the latter is the downstream “others” class: here the pentamer frequency is not different from random among real exon flanks but is significantly decreased among pseudo exons flanks. As a second control we carried out the same analysis for a set of 18 pentamers each of which was the reverse complement of one of the members of the branch point-like group (Comp. BP). Unlike the selected pentamers, these sequences showed little difference between real exons, pseudo exons, or randomly chosen intronic 50-mers (last column).

Table. Prevalence of positive pentamer classes in real and pseudo exon flanks.

Upstream / BP-like
(18) / PPT-like (20) / C-rich
(5) / G-rich
(7) / Other
(10) / Comp. BP
(18)
Real / 57.8% / 76.2% / 28.5% / 22.1% / 43.7% / 43.9%
Pseudo / 43.3% / 66.6% / 18.2% / 14.8% / 36.8% / 45.0%
Random / 48.1% / 67.1% / 20.3% / 19.0% / 37.7% / 44.9%
Random SD / 0.5% / 0.5% / 0.5% / 0.4% / 0.5% / 0.4%
Downstream / PPT-like
(8) / G-rich
(16) / C-rich
(18) / Others
(19)
Real / 38.7% / 48.8% / 37.8% / 68.4%
Pseudo / 33.4% / 32.0% / 27.9% / 61.8%
Random / 34.5% / 35.5% / 29.6% / 67.9%
Random SD / 0.4% / 0.4% / 0.4% / 0.4%

Legend to table: The numbers indicate the percentage of total exons or pseudo exons that contain at least one pentamer of the indicated class in 50 nts flanking the exons. Approximately 15,000 real exons and 24,000 pseudo exons were analyzed. The classes correspond to the clusters of pentamers whose positional distribution is shown in Figure 3. Flanks here are defined as regions beyond the splice site sequences themselves: < -14 upstream and > 6 downstream of the exon boundaries. Random sequences are comprised of 50-mers culled at random from the interior of introns. These numbers represent the average of 100 collections of 15,000 50-mers; the standard deviation of these 100 values is also given. Comp. BP signifies a control set of unselected pentamers, the reverse complements of the branch point-like group.

3)Mutual information:

4) Base combinations. A normalized list of weights for all 3-way and 2-way base combinations, as promised in Results at end of “Splice site sequences” is included in the Excel spreadsheet files:

MutualInfo.xls

BaseCombos.xls