Iterative gene prediction and pseudogene removal improves genome annotation

Supplementary data

Table1, footnote b

To determine the number of parent genes found in the external database method, transcripts of a single gene were merged. This ensured that if two pseudogenes were identified with different transcripts that belonged to the same gene, the parent was counted only once.

When N-SCAN predictions were used as parents, overlapping transcripts did not occur. However, pseudogenes were sometimes masked out with other pseudogenes (see Bootstrapping pseudogene detection from N-SCAN predictions). Therefore, only parents that were not themselves masked out were counted.

PPFINDER and olfactory receptors

The olfactory receptor (OR) gene family is the largest multigene family in the human genome, with over 800 members, about half of which are pseudogenes (Olender et al. 2004). The mouse genome has around 1400 OR genes of which only a quarter are pseudogenes (Niimura and Nei 2005). Gene duplications after the mouse/human split have led to these different numbers, and therefore there is no one to one relationship between human OR genes and those of mouse. This could be a problem in the conserved synteny method, and we expected to find false positives. (The intron alignment method would not identify OR parents because all OR genes are single exon genes) However, when we used external databases for finding parent genes, most predicted OR genes were present in these databases, so they were not masked out. In the bootstrap method, PPFINDER skips hits within 1 Mb around the gene model to avoid such hits, as described before. We found that the best protein hits to predicted OR genes were almost always neighboring genes, reflecting the tandem duplications that caused OR clustering (Niimura and Nei 2003). As a result, no good parent gene was found for most of the OR genes (including the pseudogenes) and only 6 OR genes were masked out.

Bootstrapping pseudogene detection from N-SCAN predictions

We used translations of N-SCAN’s initial gene models for the synteny method and their open reading frames for the intron location method. To avoid masking correct gene models with themselves, all BLAST hits with genes predicted within 500 kb flanking the gene model were ignored. Since processed pseudogenes integrate randomly, their parents are usually not found within 500 kb.

Not all exons masked in the bootstrap method are real pseudogenes. Before running N-SCAN, we mask the genome for interspersed repeats but not for low complexity and simple repeats, because some coding exons contain such repeats (e.g. the ataxin genes (Costa Lima and Pimentel 2004)). This does lead to an overprediction of repeat-containing exons. Such exons will find matches with other repeat-containing gene models, therefore many of these are masked out in the pseudogene masking procedure unless they are conserved between target and informant species. The parents of these genes are not real pseudogene parents but merely identifiers of repeats in gene models.

The bootstrap procedure does not always identify the real parent of a pseudogene: Of 3,191 parent genes found, 1,475 were themselves masked out. Different retropositions of the same parent gene are highly similar, and may be incorporated into different gene models. The synteny method often identifies other pseudogene-containing gene models as parents, because they are (slightly) more similar to the pseudogene under review than the true parent is. It would be more elegant to identify the real parent from which both pseudogenes were derived, and use that for masking, but that would have added substantial complexity and computing requirements to PPFINDER. Since true parents and pseudogenes are very similar, the effects of masking with either will be similar, too. Therefore, we chose to use the putative parent found by the system for masking without attempting to determine whether it was itself a pseudogene.

Removing those pseudogene parents left 1,716 real parents. Of these, 1,503 overlap a SwissProt/TrEMBL or RefSeq entry. 45 of the remaining 213 parents contain simple repeats, as described above. Of the remaining 168 parents, 77 overlap a spliced EST, indicating that they are real genes. Since the retroposition event that leads to the formation of a processed pseudogene starts with a processed mRNA, the identification of a predicted gene as a pseudogene parent indicates that it is, or at least was, transcribed. It is also evidence that the parent gene was predicted correctly.

A potential source of false positives is when N-SCAN predicts introns in different places within highly similar genes, such as those resulting from recent segmental duplications. In the intron alignment method, such genes could be incorrectly identified as parents for each other, and both could be masked out. To investigate this, we first identified 234 sets of recent segmental duplications in the human genome at least one of which contained N-SCAN predictions. Among these, we found only 4 cases where genes in one locus were masked out by genes in a duplicate locus (2 reciprocally, and 2 unidirectionally). All those cases passed the intron filter because their introns contained mostly repeat sequence.

Because N-SCAN is run on 1 Mb fragments, rounds of masking and gene prediction could be done separately for each fragment. In some fragments no pseudogenes are masked, so the process terminates after one round of gene prediction. In others, several rounds of masking and gene prediction are required, up to a maximum of 4 iterations. It is possible to run PPFINDER on whole chromosomes, but convergence in the bootstrap method may take many more rounds. In these cases, a maximum number of iterations can be set.

Applying PPFINDER to unannotated genomes

We applied this bootstrap method to the dog genome (Lindblad-Toh et al. 2005) with human as informant, 6,981 predicted exons were masked out (vs. 4,530 in human). As seen in human, the number of gene models decreases substantially (dog: 24,405 to 22,392, human: 24,712 to 21,511). The number of predicted exons was reduced by 5,024 (9,944 in human), and the average number of exons per gene increased from 7.4 to 7.8 (8,2 to 8.9 in human). Because of the absence of a validated gene set for the dog genome, it is impossible to verify directly that pseudogene masking improves gene prediction in the dog. However, the effect of masking on the statistical characteristics of the predicted gene set is comparable between dog and human, suggesting that the method can be used successfully on an unannotated mammalian genome.

In order to verify that we could not achieve the same results using expressed sequences from dog, we ran PPFINDER with a set of dog transcripts as parent database. There are very few full-ORF cDNAs or RefSeq transcripts for dog and no mapped SwissProt/TrEMBL entries. We therefore used 1,353 dog mRNAs from GenBank as input database for the intron location method and found a much smaller effect: exons of 546 gene models were masked out, and after re-predicting on the masked genome the predicted gene number was still 24,393 as compared to 22,392 after masking with N-SCAN predictions. The average number of exons per gene was 7.3 after masking, as compared to 7.8 after masking with N-SCAN predictions. This indicates that most pseudogene parents in dog are not yet present in GenBank.

METHODS

Sequences

All human predictions were made on NCBI Build 35 (May 2004 data freeze) of the human genome sequence (ftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/

chromosomes) (Lander et al. 2001). The downloaded sequences were RepeatMasked (A Smith, unpublished; http://www.repeatmasker.org) for all interspersed repeats based on the RepeatMasker tables from UCSC. The sequences were divided into nonoverlapping 1 Mb segments for both the BLAST and N-SCAN portions of the analysis. Mouse Build 33 genomic sequences were used as the informant database for the gene predictions (downloaded from ftp://hgdownload.cse.ucsc.edu/goldenPath/mm5/chromosomes).

Dog WGS assembly v1.0 was retrieved from ftp://hgdownload.cse.ucsc.edu/goldenPath/canFam1/chromosomes.

Cleaning the RefSeq set

RefSeqs were removed from the set for the following reasons: Coding region length was not evenly divisible by three, transcript did not translate on NCBI build 35, initial codon was not ATG, and/or stop codon was not TAA, TGA, or TAG. Sequences were also removed if they mapped to chrN_random, chrM or if they were identical to other RefSeqs. The remaining 17,820 sequences were formatted for BLASTn.

BLAST

We used WU-BLAST version 2.0 (http://blast.wustl.edu, W. Gish, unpublished) for all BLAST searches in this paper, using default parameters unless otherwise indicated.

For the intron location method, BLASTn was performed with E=0.0001 M=1 N=-1 Q=2 R=2.

For the conserved synteny method, BLASTp parameters were E=0.01 Q=50. The Q parameter is set high to avoid gappy extensions of hits, which will unnecessarily lower the overall score. For tBLASTn of gene prediction to the syntenic region in the mouse, E was set to 0.01, to increase the chance of catching low scoring exon hits.

Alignment programs

In the intron location method, the EST_GENOME program (Mott 1997) was used. This program tries to create an alignment containing splice consensus sequences near the beginnings and ends of gaps. Because the program uses an optimal dynamic programming algorithm, it is computationally intensive. For filtering the pseudogenic regions, we used the Sim4 program (Florea et al. 1998). Sim4 is faster, but we found it slightly less sensitive in determining the correct splice sites (Manimozhyian et al, submitted).

N-SCAN

For human gene prediction, an 8-way, human-referenced MULTIZ multiple-genome alignment was downloaded from UCSC. A pair alignment with human (build 35) as target and mouse (build 33) as informant was extracted from this multiple-genome alignment and used as input for N-SCAN’s parameter estimation and gene prediction. For dog gene prediction, a 3-way, dog-referenced MULITZ multiple-genome alignment was downloaded from UCSC. A pair alignment with dog (WGS assembly v1.0) as target and human (build 35) as informant was extracted from this multiple-genome alignment and used as input to N-SCAN’s gene prediction. Because too few annotated dog genes are available for training, parameters with human as target and dog as informant using the previously referenced 8-way, MULTIZ multiple-genome alignment, were estimated and used for dog target – human informant gene prediction runs.

FIGURES

Supplemental Figure 1. Exons per gene

Comparison of the distribution of coding exons per transcript in the N-SCAN predictions (red) and human CCDS annotations (blue). The last data point includes all transcripts containing >20 coding exons. Masking pseudogenes (pink for external databases and green for bootstrap method) and re-prediction results in a distribution that more closely resembles the RefSeq distribution. This figure was modified from (Flicek et al. 2003).

References

Costa Lima, M.A. and M.M. Pimentel. 2004. Dynamic mutation and human disorders: the spinocerebellar ataxias (review). Int J Mol Med 13: 299-302.

Flicek, P., E. Keibler, P. Hu, I. Korf, and M.R. Brent. 2003. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res 13: 46-54.

Florea, L., G. Hartzell, Z. Zhang, G.M. Rubin, and W. Miller. 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8: 967-974.

Lander, E.S. L.M. Linton B. Birren C. Nusbaum M.C. Zody J. Baldwin K. Devon K. Dewar M. Doyle W. FitzHugh R. Funke D. Gage K. Harris A. Heaford J. Howland L. Kann J. Lehoczky R. LeVine P. McEwan K. McKernan J. Meldrim J.P. Mesirov C. Miranda W. Morris J. Naylor C. Raymond M. Rosetti R. Santos A. Sheridan C. Sougnez N. Stange-Thomann N. Stojanovic A. Subramanian D. Wyman J. Rogers J. Sulston R. Ainscough S. Beck D. Bentley J. Burton C. Clee N. Carter A. Coulson R. Deadman P. Deloukas A. Dunham I. Dunham R. Durbin L. French D. Grafham S. Gregory T. Hubbard S. Humphray A. Hunt M. Jones C. Lloyd A. McMurray L. Matthews S. Mercer S. Milne J.C. Mullikin A. Mungall R. Plumb M. Ross R. Shownkeen S. Sims R.H. Waterston R.K. Wilson L.W. Hillier J.D. McPherson M.A. Marra E.R. Mardis L.A. Fulton A.T. Chinwalla K.H. Pepin W.R. Gish S.L. Chissoe M.C. Wendl K.D. Delehaunty T.L. Miner A. Delehaunty J.B. Kramer L.L. Cook R.S. Fulton D.L. Johnson P.J. Minx S.W. Clifton T. Hawkins E. Branscomb P. Predki P. Richardson S. Wenning T. Slezak N. Doggett J.F. Cheng A. Olsen S. Lucas C. Elkin E. Uberbacher M. Frazier R.A. Gibbs D.M. Muzny S.E. Scherer J.B. Bouck E.J. Sodergren K.C. Worley C.M. Rives J.H. Gorrell M.L. Metzker S.L. Naylor R.S. Kucherlapati D.L. Nelson G.M. Weinstock Y. Sakaki A. Fujiyama M. Hattori T. Yada A. Toyoda T. Itoh C. Kawagoe H. Watanabe Y. Totoki T. Taylor J. Weissenbach R. Heilig W. Saurin F. Artiguenave P. Brottier T. Bruls E. Pelletier C. Robert P. Wincker D.R. Smith L. Doucette-Stamm M. Rubenfield K. Weinstock H.M. Lee J. Dubois A. Rosenthal M. Platzer G. Nyakatura S. Taudien A. Rump H. Yang J. Yu J. Wang G. Huang J. Gu L. Hood L. Rowen A. Madan S. Qin R.W. Davis N.A. Federspiel A.P. Abola M.J. Proctor R.M. Myers J. Schmutz M. Dickson J. Grimwood D.R. Cox M.V. Olson R. Kaul N. Shimizu K. Kawasaki S. Minoshima G.A. Evans M. Athanasiou R. Schultz B.A. Roe F. Chen H. Pan J. Ramser H. Lehrach R. Reinhardt W.R. McCombie M. de la Bastide N. Dedhia H. Blocker K. Hornischer G. Nordsiek R. Agarwala L. Aravind J.A. Bailey A. Bateman S. Batzoglou E. Birney P. Bork D.G. Brown C.B. Burge L. Cerutti H.C. Chen D. Church M. Clamp R.R. Copley T. Doerks S.R. Eddy E.E. Eichler T.S. Furey J. Galagan J.G. Gilbert C. Harmon Y. Hayashizaki D. Haussler H. Hermjakob K. Hokamp W. Jang L.S. Johnson T.A. Jones S. Kasif A. Kaspryzk S. Kennedy W.J. Kent P. Kitts E.V. Koonin I. Korf D. Kulp D. Lancet T.M. Lowe A. McLysaght T. Mikkelsen J.V. Moran N. Mulder V.J. Pollara C.P. Ponting G. Schuler J. Schultz G. Slater A.F. Smit E. Stupka J. Szustakowski D. Thierry-Mieg J. Thierry-Mieg L. Wagner J. Wallis R. Wheeler A. Williams Y.I. Wolf K.H. Wolfe S.P. Yang R.F. Yeh F. Collins M.S. Guyer J. Peterson A. Felsenfeld K.A. Wetterstrand A. Patrinos M.J. Morgan P. de Jong J.J. Catanese K. Osoegawa H. Shizuya S. Choi and Y.J. Chen. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921.