Title: Overview of splicing relevant databases

Pierre de la Grange

GenoSplice technology, Centre Hayem, Hôpital Saint-Louis, Paris, France

*Address correspondence to: Pierre de la Grange, GenoSplice technology, Centre Hayem, Hôpital Saint-Louis, 1 avenue Claude Vellefaux, 75010 Paris, France; tel: +33 (0) 157 276 839; fax: +33 (0) 157 276 831; E-mail:

1. Abstract

Alternative Splicing is the main mechanism allowing to increase the transcriptome diversity by generating multiple RNA isoforms from a single gene. This mechanism concerns more than 90% of human genes and is altered in many diseases. In addition to the alternative splicing, other mechanisms allow to increase the transcriptome diversity: for example, at least 81% of genes are subject to alternative transcription initiation [1] F. Denoeud et al., Prominent use of distal 5′ transcription start sites and discovery of a large number of additional exons in ENCODE regions, Genome Res. 17 (2007), pp. 746–759. Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (51)and 60% to alternative polyadenylation. Around 10% of human genes may produce more than 10 different transcripts (i.e., with a different exon content). The large number and wide biological impact of alternative transcripts has created a high demand for tools enabling the identification, classification, functional annotation and expression profiling of alternative transcripts To meet this demand, several alternative splicing databases have been developed based on large-scale mappings or assemblies of transcribed sequences.

2. Theoretical background

2.1. Alternative splicing databases: interest

Alternative splicing concerns more than 90% of human genes [1] and is altered in many diseases [2]. In order to study gene expression regulation, including splicing regulation, researchers need tools and information to help them guide and interpret their experiments. Alternative splicing databases can fill several of these needs by gathering and organizing genomic and transcriptomic data as well as tools allowing to predict many features in term of regulation (e.g., tissue expression).

2.2. Alternative splicing databases: common strategy

Most alternative splicing databases are based on the same strategy: exon content of transcripts is retrieved by aligning sequence of these transcripts together or against the corresponding genomic sequence. Transcript sequences are downloaded from publicly available databanks: EMBL, GenBank and DDBJ [3,4,5]. Among these sequences, “full-length” complementary DNAs (flcDNA) allow to define the whole exon content of the different gene products. At the beginning of the 90’s, there were the first massive generations of Expressed Sequence Tags (EST). ESTs are unique read from clone extremity from normal or pathological tissue collections and provide the major information source for computational detection of alternative splicing patterns. Other kinds of sequences are obtained by large-scale approaches: Sequence Tagged Sites (STS), Genome Survey Sequences (GSS) and High Throughput Genomic Sequences (HTGS). Transcript-to-genomic alignments are performed using dedicated bioinformatics tools whose sensitivity, specificity and speed varied. The most used tools are BLAT, SIM4, GMAP, SPA and POA [6,7,8,9,10]. Alternative events are defined by comparing the exon content of transcripts from the same gene. Integration of these data in a user-friendly web interface is a crucial point in order to facilitate access of information for the user. Many other information sources are often integrated (e.g., protein information from SwissProt).

2.3. Description of Alternative splicing databases

More than 30 alternative splicing databases were developed during these last years. However, each of these databases has its specificities and there is no a “perfect database”: two or three should be used. Table 1 presents a selection of 14 databases with their brief description, advantages, and reference.

2.4. The UCSC genome browser

In addition to specialist databases on alternative splicing, other bioinformatic tools are very useful to study the gene expression regulation at the exon level. One of the most famous and useful tool is the UCSC Genome Browser [25]. This site contains the reference sequence and working draft assemblies for a large collection of genomes. The Genome Browser provides dozens of aligned annotation tracks that have been computed at UCSC or have been provided by outside collaborators. In addition to these standard tracks, it is also possible for users to upload their own annotation data for temporary display in the browser (see “Protocol” section).

3. Protocol

Since each database provides many different options, it is not possible here to describe how to use each of these databases. For most of them, a detailed documentation is available on their website or within the corresponding publication (see table 1). An example of the utilization of FAST DB from GenoSplice technology will be provided in the next section. The following part of this section describes how to create a custom track with the UCSC Genome Browser (explanations taken from the UCSC website).

Genome Browser annotation tracks are based on files in line-oriented format. Each line in the file defines a display characteristic for the track or defines a data item within the track. Annotation files contain three types of lines: browser lines, track lines, and data lines. To construct an annotation file and display it in the Genome Browser, follow these steps:

Step 1: Format the data set

Formulate your data set as a tab-separated file using one of the formats supported by the Genome Browser: GFF, BEDGRAPH, GTF, PSL, BED, bigBed, WIG, bigWig, MAF and microarray (see the UCSC website for more details about these formats).

Step 2: Define the Genome Browser display characteristics

Add one or more optional browser lines to the beginning of your formatted data file to configure the overall display of the Genome Browser when it initially shows your annotation data (genome.ucsc.edu/goldenPath/help/customTrack.html#lines). Browser lines allow you to configure such things as the genome position that the Genome Browser will initially open to, the width of the display, and the configuration of the other annotation tracks that are shown (or hidden) in the initial display.

Step 3: Define the annotation track display characteristics

Following the browser lines and preceding the formatted data, add a track line (genome.ucsc.edu/goldenPath/help/customTrack.html#TRACK) to define the display attributes for your annotation data set. Track lines enable you to define annotation track characteristics such as the name, description, colors, initial display mode, use score, etc.

4. Example of an experiment

Figures 1 to 4 show several screenshots from FAST DB [16,26] regarding the PDLIM5 human gene. In addition to the presented options, FAST DB provides many other options such as tissue-specificity analysis using EST expression data, prediction of microRNA binding sites, prediction of NMD regulation, prediction of transcription and splicing factor binding sites, etc. FAST DB also provides direct links to many other databases (SwissProt, PubMed, Entrez Gene, OMIM, EnsEMBL, UCSC, other alternative splicing databases, etc). Even before the FAST DB update providing several additional options and tools, a publication from Lerivray et al., awarded it as the most useful and user-friendly alternative splicing database [27].

5. Troubleshooting

It is known that alternative splicing regulation depends on development stages, tissues and various stimuli [28,29]. However, very few cell types, development stages or stimuli have been studied in a genome-wide manner. Considering this point, the main limit of approaches used by the alternative splicing databases is that the number of transcript sequences in publicly available databanks is surely underestimated compared to those existing in vivo.

Moreover, even if all possible transcript sequences are not available in databanks, their number is growing every day and update of specialist databases such as alternative splicing databases is a crucial aspect. However, due to technical and time limitations, many databases are not regularly updated.

Number of available information in alternative splicing databases depends on their update but also on the selection of raw data to define gene exon structures and alternative events. For example, EST data contain a wide variety of experimental artefacts that can lead to incorrect prediction of alternative splicing. To reduce the number of such artefacts, some databases have set up several filters. Stringency of these filters lead to obtain less but high confident data, on the opposite less (or no) filters lead to obtain much more data but with many artefacts. For this reason, it is advised to use two or three different databases to have an overview of all information available for a same gene in term of splicing events.

Finally, one other crucial aspect, which is no limited to alternative splicing databases, concerns the standardization of data. For example, the same gene can have a different number of exons depending on the database (e.g., exon #4 of one gene in a given database corresponds to exon #6 of the same gene in another database). It became a real problem when researchers need to compare/share their results, in particular when publishing their data. As done by HUGO for the gene names and symbols [30], efforts should be made to standardize information regarding exon/intron structure and alternative events of known genes.

Figure legends

Table 1: Description of relevant alternative splicing databases

Figure 1: Options from FAST DB. Example with the human PDLIM5 gene

A. Main page of FAST DB for the PDLIM5 gene. Exon/intron gene structure is displayed with known alternative events in red. In particular, exons 10 to 12 are known to be multiple-cassette exons and exons 9 and 18 are two alternative terminal exons for this gene.

B. Tissue-specificity of the multiple-cassette exons 10 to 12 using EST data. These exons seems to be specifically included in muscle and heart tissues (blue bars) and skipped in the other tissues (red bars).

C. The in silico PCR option of FAST DB allows to facilitate the primer design for RT-PCR validations and to predict the expected product sizes and sequences.

D. FAST DB allows to predict the functional consequences of alternative events by providing protein domains prediction through direct links to specialist databases (e.g., SMART). In this example the short form ending in exon 9 is predicted to be translated in protein encoded one PDZ domain. The long form (ending in exon 18) is predicted to be translated in protein encoded one PDZ domain and three LIM domains.

References

[1] Wang E.T., Sandberg R., Luo S., Khrebtukova I., Zhang L., Mayr C., Kingsmore S.F., Schroth G.P., Burge C.B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature. 456(7221):470-6

[2] Venables J.P. (2004). Aberrant and alternative splicing in cancer. Cancer Res. 64(21):7647-54

[3] Benson D.A., Karsch-Mizrachi I., Lipman D.J., Ostell J., Sayers E.W. (2009). GenBank. Nucleic Acids Res. 37(Database issue):D26-31

[4] Sugawara H., Ogasawara O., Okubo K., Gojobori T., Tateno Y. (2008). DDBJ with new system and face. Nucleic Acids Res. 36(Database issue):D22-4

[5] Kulikova T., Akhtar R., Aldebert P., Althorpe N., Andersson M., Baldwin A., Bates K., Bhattacharyya S., Bower L., Browne P., et al. (2007).EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Res. 35(Database issue):D16-20

[6] Grasso C., Lee C. (2004). Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics. 20(10):1546-56

[7] Florea L., Hartzell G., Zhang Z., Rubin G.M., Miller W. (1998). A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8(9):967-74

[8] Wu T.D., Watanabe C.K. (2005). GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 21(9):1859-75

[9] Van Nimwegen E., Paul N., Sheridan R., Zavolan M. (2006). SPA: a probabilistic algorithm for spliced alignment. PLoS Genet. 2(4):e24.

[10] Kent W.J. (2002). BLAT--the BLAST-like alignment tool. Genome Res. 12(4):656-64

[11] Kim N, Alekseyenko AV, Roy M, Lee C. (2007) The ASAP II database: analysis and comparative genomics of alternative splicing in 15 animal species. Nucleic Acids Res. 35(Database issue):D93-8

[12] Koscielny G, Le Texier V, Gopalakrishnan C, Kumanduri V, Riethoven JJ, Nardone F, Stanley E, Fallsehr C, Hofmann O, Kull M., et al. (2009). ASTD: The Alternative Splicing and Transcript Diversity database. Genomics. 93(3):213-20

[13] Dralyuk I, Brudno M, Gelfand MS, Zorn M, Dubchak I. (2000). ASDB: database of alternatively spliced genes. Nucleic Acids Res. 28(1):296-7

[14] Nagasaki, H., Arita, M., Nishizawa, T., Suwa, M., and Gotoh, O. (2006) Automated classification of alternative splicing and transcriptional initiation and construction of visual database of classified patterns. Bioinformatics 22, 1211-6

[15] Kim, P., Kim, N., Lee, Y., Kim, B., Shin, Y., and Lee, S. (2005) ECgene: genome annotation for alternative splicing. Nucleic Acids Res 33, D75-9

[16] de la Grange P., Dutertre M., Correa M., Auboeuf D. (2007). A new advance in alternative splicing databases: from catalogue to detailed analysis of regulation of expression and function of human alternative splicing variants. BMC Bioinformatics. 8:180.

[17] Takeda J., Suzuki Y., Nakao M., Kuroda T., Sugano S., Gojobori T., Imanishi T. (2007). H-DBAS: alternative splicing database of completely sequenced and manually annotated full-length cDNAs based on H-Invitational. Nucleic Acids Res. 35(Database issue):D104-9

[18] Holste D., Huo G., Tung V., Burge C.B. (2006). HOLLYWOOD: a comparative relational database of alternative splicing. Nucleic Acids Res. 34(Database issue):D56-62

[19] Zheng C.L., Kwon Y.S., Li H.R., Zhang K., Coutinho-Mansfield G., Yang C., Nair T.M., Gribskov M., Fu X.D. (2005). MAASE: an alternative splicing database designed for supporting splicing microarray applications. RNA. 11(12):1767-76

[20] Huang Y.H., Chen Y.T., Lai J.J., Yang S.T., Yang U.C. (2002). PALS db: Putative Alternative Splicing database Nucleic Acids Res. 30(1):186-90

[21] Huang H.D., Horng J.T., Lee C.C., Liu B.J. (2003). ProSplicer: a database of putative alternative splicing information derived from protein, mRNA and expressed sequence tag sequence data. Genome Biol. 4(4):R29

[22] Huang H.D., Horng J.T., Lin F.M., Chang Y.C., Huang C.C. (2005). SpliceInfo: an information repository for mRNA alternative splicing in human genome. Nucleic Acids Res. 33(Database issue):D80-5

[23] Krause A., Haas S.A., Coward E., Vingron M. (2002). SYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein. Nucleic Acids Res. 30(1):299-300

[24] Hiller M., Nikolajewa S., Huse K., Szafranski K., Rosenstiel P., Schuster S., Backofen R., Platzer M. (2007). TassDB: a database of alternative tandem splice sites. Nucleic Acids Res. 35(Database issue):D188-92

[25] Kuhn R.M., Karolchik D., Zweig A.S., Wang T., Smith K.E., Rosenbloom K.R., Rhead B., Raney B.J., Pohl A., Pheasant M. et al. (2009). The UCSC Genome Browser Database: update 2009. Nucleic Acids Res. 37(Database issue):D755-61

[26] de la Grange P., Dutertre M., Martin N., Auboeuf D. (2005). FAST DB: a website resource for the study of the expression regulation of human gene products. Nucleic Acids Res. 33(13):4276-84