Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 May;16(5):678-85.
doi: 10.1101/gr.4766206.

Iterative gene prediction and pseudogene removal improves genome annotation

Affiliations

Iterative gene prediction and pseudogene removal improves genome annotation

Marijke J van Baren et al. Genome Res. 2006 May.

Abstract

Correct gene prediction is impaired by the presence of processed pseudogenes: nonfunctional, intronless copies of real genes found elsewhere in the genome. Gene prediction programs frequently mistake processed pseudogenes for real genes or exons, leading to biologically irrelevant gene predictions. While methods exist to identify processed pseudogenes in genomes, no attempt has been made to integrate pseudogene removal with gene prediction, or even to provide a freestanding tool that identifies such erroneous gene predictions. We have created PPFINDER (for Processed Pseudogene finder), a program that integrates several methods of processed pseudogene finding in mammalian gene annotations. We used PPFINDER to remove pseudogenes from N-SCAN gene predictions, and show that gene prediction improves substantially when gene prediction and pseudogene masking are interleaved. In addition, we used PPFINDER with gene predictions as a parent database, eliminating the need for libraries of known genes. This allows us to run the gene prediction/PPFINDER procedure on newly sequenced genomes for which few genes are known.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The intron location method of pseudogene finding. (A) Flow diagram of the method. See text for details. (B) All predicted gene models are used for BLASTn against a database of known genes. When a pseudogene is incorporated in the gene model, it will hit its parent gene in the BLAST search (left side of diagram). Alignment with the genomic location of the parent gene will usually show intron gaps. Gene model segments that are not derived from pseudogenes may hit a family member elsewhere in the genome (right side of diagram). In this case, alignment of the prediction to the genomic region of the parent will typically include gaps where introns are predicted in the gene model.
Figure 2.
Figure 2.
Flow diagram for the conserved synteny method. See text for details.
Figure 3.
Figure 3.
Use of conserved synteny in pseudogene finding. (A) A pseudogene (pink) on human chromosome 7 was inserted between two genes (blue and orange) (B). This part of human chromosome 7 is orthologous to a region on mouse chromosome 5. If the processed pseudogene in human was generated after the mouse–human split, it will not be present in the orthologous region in the mouse. Instead, the best match in the mouse genome is the location that is orthologous to the parent of the pseudogene.
Figure 4.
Figure 4.
Flow diagram for the bootstrap method that combines pseudogene finding with gene prediction. To iteratively mask pseudogenes and rerun gene prediction, PPFINDER is run with a masking step after each of the methods (conserved synteny and intron alignment). This nested looping is done to remove redundancy, because many pseudogenes will be found by both methods. First, the cycle of pseudogene finding and masking is run using the conserved synteny method, until no more pseudogenes are found. Then the same is done using the intron alignment method. PPFINDER will keep looping through both methods until neither finds any more pseudogenes. One masking/gene prediction loop is called one round.
Figure 5.
Figure 5.
Improvement of gene prediction after pseudogene masking. (A) After masking out a pseudogene incorporated in the original gene model, the gene is predicted correctly. (B) A single gene model is split into two correct models after a pseudogene exon is masked. (C) Two gene models are merged into one correct model after masking the pseudogene in an intron of SLC16A1. UTRs are shown as thin blocks, coding exons as thicker blocks, and introns as lines. A gene is considered correctly predicted when the coding sequence is correct. This figure was modified from a screen shot of the UCSC Genome Browser at http://genome.ucsc.edu (Kent et al. 2002).

Similar articles

Cited by

References

    1. Alexandersson M., Cawley S., Pachter L., Cawley S., Pachter L., Pachter L. SLAM: Cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. 2003;13:496–502. - PMC - PubMed
    1. Ashurst J.L., Chen C.K., Gilbert J.G., Jekosch K., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Chen C.K., Gilbert J.G., Jekosch K., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Gilbert J.G., Jekosch K., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Jekosch K., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Keenan S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Meidl P., Searle S.M., Stalker J., Storey R., Trevanion S., Searle S.M., Stalker J., Storey R., Trevanion S., Stalker J., Storey R., Trevanion S., Storey R., Trevanion S., Trevanion S., et al. The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res. 2005;33:D459–D465. - PMC - PubMed
    1. Blanco E., Parra G., Guigo R., Parra G., Guigo R., Guigo R.2003. Using geneid to identify genes. In Current protocols in bioinformatics (ed. D.B. Davison) pp. Unit 4.3. John Wiley & Sons Inc. New York
    1. Burge C., Karlin S., Karlin S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 1997;268:78–94. - PubMed
    1. Buzdin A.A. Retroelements and formation of chimeric retrogenes. Cell. Mol. Life Sci. 2004;61:2046–2059. - PMC - PubMed

Publication types