Iterative gene prediction and pseudogene removal improves genome annotation

Marijke J van Baren¹, Michael R Brent

Affiliations

PMID: 16651666
PMCID: PMC1457044
DOI: 10.1101/gr.4766206

Iterative gene prediction and pseudogene removal improves genome annotation

Marijke J van Baren et al. Genome Res. 2006 May.

. 2006 May;16(5):678-85.

doi: 10.1101/gr.4766206.

Authors

Marijke J van Baren¹, Michael R Brent

Affiliation

¹ Laboratory for Computational Genomics, Department of Computer Science Washington University, Saint Louis, Missouri 63130, USA.

PMID: 16651666
PMCID: PMC1457044
DOI: 10.1101/gr.4766206

Abstract

Correct gene prediction is impaired by the presence of processed pseudogenes: nonfunctional, intronless copies of real genes found elsewhere in the genome. Gene prediction programs frequently mistake processed pseudogenes for real genes or exons, leading to biologically irrelevant gene predictions. While methods exist to identify processed pseudogenes in genomes, no attempt has been made to integrate pseudogene removal with gene prediction, or even to provide a freestanding tool that identifies such erroneous gene predictions. We have created PPFINDER (for Processed Pseudogene finder), a program that integrates several methods of processed pseudogene finding in mammalian gene annotations. We used PPFINDER to remove pseudogenes from N-SCAN gene predictions, and show that gene prediction improves substantially when gene prediction and pseudogene masking are interleaved. In addition, we used PPFINDER with gene predictions as a parent database, eliminating the need for libraries of known genes. This allows us to run the gene prediction/PPFINDER procedure on newly sequenced genomes for which few genes are known.

PubMed Disclaimer

Figures

**Figure 1.**
The intron location method of pseudogene finding. (A) Flow diagram of the method. See text for details. (B) All predicted gene models are used for BLASTn against a database of known genes. When a pseudogene is incorporated in the gene model, it will hit its parent gene in the BLAST search (*left* side of diagram). Alignment with the genomic location of the parent gene will usually show intron gaps. Gene model segments that are not derived from pseudogenes may hit a family member elsewhere in the genome (*right* side of diagram). In this case, alignment of the prediction to the genomic region of the parent will typically include gaps where introns are predicted in the gene model.

**Figure 2.**
Flow diagram for the conserved synteny method. See text for details.

**Figure 3.**
Use of conserved synteny in pseudogene finding. (A) A pseudogene (pink) on human chromosome 7 was inserted between two genes (blue and orange) (B). This part of human chromosome 7 is orthologous to a region on mouse chromosome 5. If the processed pseudogene in human was generated after the mouse–human split, it will not be present in the orthologous region in the mouse. Instead, the best match in the mouse genome is the location that is orthologous to the parent of the pseudogene.

**Figure 4.**
Flow diagram for the bootstrap method that combines pseudogene finding with gene prediction. To iteratively mask pseudogenes and rerun gene prediction, PPFINDER is run with a masking step after each of the methods (conserved synteny and intron alignment). This nested looping is done to remove redundancy, because many pseudogenes will be found by both methods. First, the cycle of pseudogene finding and masking is run using the conserved synteny method, until no more pseudogenes are found. Then the same is done using the intron alignment method. PPFINDER will keep looping through both methods until neither finds any more pseudogenes. One masking/gene prediction loop is called one round.

**Figure 5.**
Improvement of gene prediction after pseudogene masking. (A) After masking out a pseudogene incorporated in the original gene model, the gene is predicted correctly. (B) A single gene model is split into two correct models after a pseudogene exon is masked. (C) Two gene models are merged into one correct model after masking the pseudogene in an intron of SLC16A1. UTRs are shown as thin blocks, coding exons as thicker blocks, and introns as lines. A gene is considered correctly predicted when the coding sequence is correct. This figure was modified from a screen shot of the UCSC Genome Browser at http://genome.ucsc.edu (Kent et al. 2002).

See this image and copyright information in PMC

References

1. Alexandersson M., Cawley S., Pachter L., Cawley S., Pachter L., Pachter L. SLAM: Cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. 2003;13:496–502. - PMC - PubMed
1. Ashurst J.L., Chen C.K., Gilbert J.G., Jekosch K., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Chen C.K., Gilbert J.G., Jekosch K., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Gilbert J.G., Jekosch K., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Jekosch K., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Searle S.M., Stalker J., Storey R., Trevanion S., Stalker J., Storey R., Trevanion S., Storey R., Trevanion S., Trevanion S., et al. The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res. 2005;33:D459–D465. - PMC - PubMed
1. Blanco E., Parra G., Guigo R., Parra G., Guigo R., Guigo R.2003. Using geneid to identify genes. In Current protocols in bioinformatics (ed. D.B. Davison) pp. Unit 4.3. John Wiley & Sons Inc. New York
1. Burge C., Karlin S., Karlin S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 1997;268:78–94. - PubMed
1. Buzdin A.A. Retroelements and formation of chimeric retrogenes. Cell. Mol. Life Sci. 2004;61:2046–2059. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Iterative gene prediction and pseudogene removal improves genome annotation

Affiliation

Iterative gene prediction and pseudogene removal improves genome annotation

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases