SACBI-140

In Silico Analysis of pseduogenes in the genome of Arabidopsis thaliana chromosomes I, II and III: Implications for Microarray Data Analysis

Abstract:

We surveyed three of the five chromosomes of the model plant Arabidopsis thaliana for the presence of pseudogenes recently reported in yeast. We introduce our application, Bison-Blast, describe its capabilities as a sequence analysis tool, present our findings and discuss how pseudogenes impact the analysis of microarray data. We found that 381 “potential pseudogenes are in the chromosome I, 463 in the chromosome II and 588 in the chromosome III. Our results suggest that the abundance of pseudognes in chromosomes II and III is proportional to their size. However a low number of “potential pseudogenes” in the chromosome I of A. thaliana suggests that some chromosomes are more prone for the accumulation of pseudogenic sequences than others. Using the gene ontology annotation system containing 4696 molecular function entries and after the removal of redundant sequences, we found only 19, 29 and 39 entries corresponding to chromosome I, II and III respectively. For most of these entries, the gene functions were putative and know followed by unknown and hypothetical. While we report a preliminary analysis of a wide genomic scanning of the A. thaliana genome for homologues pseudogenes derived from yeast, our report have relevant implications on microarray data since many pseudogenes may be expressing a large number of non functional proteins and are adding a considerable source of noise in microarray experiments.

1. Introduction:

Over the last century, genetic studies on a small number of organisms have played an important role in the understanding numerous biological processes. However, in recent years genetic research has shifted from how visible traits are transmitted to the study of the genome structure at the molecular level. Advances in robotics, miniaturization and parallelization of molecular biology tools has lead to the development of more sensitive and ultra high-throughput analytical devices that are allowing the exploration of biological systems in a global schema. Genome sequencing projects had proven to be a powerful and efficient approach for accessing the complete gene structure of different organisms. Using this technology, an international consortium released the genomic sequence of all 16 chromosomes constituting the nuclear genome of yeast (Saccharomice. cerevisiae) lab strain S288 (Goffeau et al. 1996). This information initiated a quest for the development of more complex comparative sequence tools using sequence, motif, and structure of known proteins and translated expressed sequence tags (ESTs) where the user queries a database and retrieves related sequences with user-specified scores. Using data from functional or comparative genomic studies over the last five years, previously non-annotated genes have been discovered in yeast (Velvelescu et al. 1997; Cliften, 2001). However as new sequencing techniques develop and more efficient computational tools are available, new insights about the genomic structure of yeast has been published. Recently, Kumar et al. (2002) and Harrison et al. (2002) reported a total of 137 new non-annotated genes that represented 2 % of the yeast genome. From this gene set, 104 genes were <100 codons in length. The same research group reported the existence of genomic DNA sequences released from selective pressure with similarity to normal genes (Kumar et al.2002; Harrison and Gerstein, 2002; Zhang et al. 2002). These disablements (know as pseudogenes) result in the loss of gene function at the transcription or translation level (or both) since the sequence no longer results in the production of a functional protein. Pseudogenes result from disablement of a gene in many ways, e.g. creation of premature stop codons, disruptive frameshift mutations, disablement of regulatory regions, and alterations in splice sites (Harrison and Gerstein, 2002). From homology matches it has been reported that there may be up to a further 183 un-annotated disabled pseudogenes in the S. cerevisiae strain S288C (Harrison et al. 2002; Harrison and Gertein, 2002). These pseudogenes are characterized by the lack of introns, the presence of small flanking direct repeats and polyadenine tail near the 3’ end (Harrison and Gerstein, 2002). Even more recent analysis suggest that in the human genome it could be up to ~20,000 genes, with approximately more than half transcribed (Zhang et al. 2002).

The genome of the flowering plant Arabidopsis thaliana has five chromosomes with low repetitive DNA content representing a total of 120 Mbp. The Chromosome I contains about 6,850 open reading frames (ORFs) covering about 300 protein families, 236 transfer RNA (tRNA) and 12 small nuclear RNAs (Theologis et al. 2000). The chromosome III encodes approximately 5,220 predicted genes. Using sequence comparison tools over 60% of the predicted ORFs in Arabidopsis match a paralogue somewhere else in the genome, and these duplicated genes are organized into large syntenic blocks that might be several hundred ORFs long (Stein, 2001). For example, one of the big surprises that emerged during the sequencing of the mustard weed genome was evidence for several distinct large-scale duplication events in the organism's past. However, one of the main limitations of comparative analysis tools is that most of them are web based. Working with web browsers is extremely limited for two main reasons: 1) the query are restricted to the scope of the browser and querying this information manually is time consuming and tedious. 2) these applications limit the efficiency of the user when questions of biological significance need to query data sets held at different locations.

Given the growing recognition of both importance of genetic variation and usefulness of model organisms it is important to attempt derive from them principles about gene products interactions that appear to be similar. After a preliminary comparative genomic analysis, this paper analyzes the implications of disabled pseudogenes of yeast and their homologues in Arabiposis thaliana. We introduce our application Bison-Blast, describe its capabilities as a genomic data analysis tool and discuss how our results represents a new paradigm for the analysis of microarray data as these transcribed pseudogenes are considered an additional source of noise.

2. Methodology:

2.1. Sequence analysis:

The sequences of chromosomes I, II and III were download from the Arabidopsis Sequence Initiative (ftp://tairpub:/). In addition the sequence of 183 disabled yeast pseudogenes was retrieved from GENESENSUS database (http://bioinfo.mbb.yale.edu/genome/) and used as input for our BISON-BLAST application. The BISON-BLAST is tool implemented in JAVA. BISON-BLAST integrates, analyses and visualizes DNA and aminoacid sequences in Genebank, Swissprot, EMBL or ASCII formats. BISON-BLAST can be used in both Linux and Windows environments and it is designed for the analysis of medium to large number of sequences. Currently our tool uses the NCBI blast algorithm versions 2.0 and 2.2.2. In addition, BISON-BLAST runs our parallized version of blast 2.0 (D-blast). Using a friendly graphical interface the user can perform sequence comparative analysis including blastn, blastp, blastx, tblastn and blastpgp; filter and parse the results and present them as a table that can be saved as ASCII file and implemented in SQL pipeline. The BISON-BLAST Gui details are presented in Figure 1a and Figure 1b.

2.2 Filtering and Molecular Function Assignment of Potential Pseudogenes:

The recent A. thaliana ontology release file containing 4696 molecular function entries was retrieved from the Gene Ontology Consortia (http://www.geneontology.org/cgi-bin/GO/downloadGOGA.pl/gene_association.tigr_ath). We used these entries with the objective to determine if there is any preferential type of “pseudogenic molecular functions” in A. thaliana sequences that matched multiple yeast pseudogenes. Similar molecular function entries for both ATH_locus and molecular function were eliminated.



3. Results and Discussion:

Comparison among genomes can be used for two purposes: inferring the phylogenetic relationships of species, and estimating the number and type of genomic rearrangements that have occurred since genomes last shared a common ancestor. Based in sequential analysis of disabled pseudogenes of yeast with homologues in the A. thaliana genome, we argue that around 9 % of sequences annotated as genes are actually “potential pseduogenes”. We use the term “potential pseudogenes” since the only way to determine this is by experiments in the laboratory. The number of matches of yeast pseudognes in chromosomes I, II and III of Arabidopsis thaliana was proportional to their size (Figure 2a). We found a total number of 1432 “potential disabled pseudogenes” in the chromosomes I, II and II. This number included duplicates and triplicates entries resulting from the same A. thaliana ORF matching different yeast pseudogenes. We found that 381 “potential pseudogenes are in the chromosome I (5074 ORFs), 463 in the chromosome II (4030 ORFs) and 588 in the chromosome III (5987) (Figure 2b). However, a low number of “potential pseudogenes” in the chromosome I suggests that in Arabidopsis some chromosomes are more prone for the accumulation of pseudogenic sequences than others. Our assumption takes in consideration that the chromosome I have a 50 % gene density, while the chromosome III have a 43 % gene density (Theologis et al. 2000; The Arabidopsis Genome Initiative 2000). An interesting aspect our analysis is that most A. thaliana “potential pseudogenes” are either derived from the use of in silico gene prediction and have been not experimentally determined.

Figure 1. Number of yeast pseudogenes matching A. thaliana chromosomes I, II and III.

Figure 1. Number of A. thaliana ORFs matching yeast pseudogenes.

Figure 3a. Distribution of gene function of the “potential pseudogenes” in the A. thaliana chromosome I.

Figure 3a. Distribution of gene function of the “potential pseudogenes” in the A. thaliana chromosome II.

Figure 3a. Distribution of gene function of the “potential pseudogenes” in the A. thaliana chromosome III.

Using the gene ontology annotation system we only found 19, 29 and 39 entries corresponding to chromosome I, II and III respectively (Table 1, 2 and 3). The redundancy of molecular functions across the chromosomes of A. thaliana can be attributed to a large number of segmental chromosomal duplications arising from four distinct large-scale duplication events. Also the results shifted to putative and know followed by unknown and hypothetical (Table 4). Our analysis also found agreements with previous reports about pseudogenes. For example, the P450s form one of the largest families of proteins in higher plants. Previously has been establish that the A. thaliana genome contains 272 sequences with different P450 signature motifs of which 26 appear to be psedogenes that lack a complete open reading frame or contain frameshifts or in-frame stop codons (Werck-Reichhart et al. 2002). Our analysis of the chromosome I, II and III of this model plant identified 16 “potential pseduogenes” within this gene family.

4. Implications of Pseudogenes in the analysis of Microarray data:

Biological entities are the result of the complex interplay of the genetic make-up with the environment (Kiberstis and Roberts, 2002). Since the information needed to make a protein is contained in mRNA, one of the objectives of post-genomic techniques is to identify gene function by quantifying mRNA abundance. Several techniques can be used including northern blots, RT-PCR, differential display PCR (DD-PCR), serial analysis of gene expression (SAGE), massive parallel signature sequencing (MPSS), cDNA amplified fragment length polymorphism (cDNA-AFLP), rapid analysis of gene expression (RAGE), macroarrays and microarrays. These techniques are having a considerable impact in several areas from basic research to clinical diagnostics. Among them, DNA microarrays are becoming one of the most used approaches. Due to their small size, high densities, and compatibility with fluorescence labeling, microarray technology is becoming ideal to complete comparative analysis in changes of gene expression level. In just over a few years since their conception, DNA microarrays have produced a paradigm shift that is transforming the understanding of gene expression changes. However, as more laboratories use microarrays to study gene expression changes, the size and complexity of public microarray data is growing exponentially.

Various computational and statistical methods have been proposed and developed by both public and public initiatives for microarray data analysis. These tools range from simple criteria to define gene expression changes in a fold change cut-off to complex analysis using machine learning techniques. Nevertheless none has yet gain widespread acceptance. While clustering is an unsupervised method widely used for microarray data analysis, choosing a clustering algorithm can be a daunting task. There is not an accurate approach to find the true cluster structure and therefore objectively evaluate the “best” clustering method. This is due to the large number of combinations between the number of distance matrixes and clustering algorithms. Supervised methods represent an alternative to unsupervised microarray data analysis because it takes a different approach in which previous knowledge about which genes are related each to another. By having an explicit knowledge of the classes the different objects belong to, these algorithms can perform an effective feature section. However, the more variables one models, the difficult the modeling task becomes. This is a consequence of the space needed to find the models increase exponentially with the number of model parameters, and with the number of variables that it contain. For some datasets, these methods may not achieve a proper separation since the kernel function is improperly defined or there are problems in the training set. Also, it is often difficult to choose the kernel function, parameters and penalties.

The difficulty mining microarray data is exacerbated by the fact that there is not a complete understanding of gene interactions and that post-translational and folding dynamic changes occurring after mRNA synthesis may alter protein-protein interactions. Multiple proteins can arise from a single gene or the mRNA is subjected to alternative splicing or post-translational modification. The most relevant aspect of the information presented in this paper, which has been not considered in previous reports studying pseudogenes is their implications on microarray data. If a well-characterized protein is known to be involved in the initiation of a biological process, then it is likely that a protein predicted from the genomic sequence that is similar to the known protein will have the same function. Our preliminary evaluations suggest that at least 10% of the sequences defined as genes can be coding for non-functional proteins. Although these signals can be detected using microarrays, as they don’t code for proteins. We argue that they should be not considered in the microarray data analysis process. Using a two-yeast hybrid experiment we eliminated disabled ORFs and we noticed that results of our predictions were significantly improved (data not shown). Our results suggest that computational approaches for microarray data need to take in consideration relevant biological information.