Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Nov;14(11):2330-5.
doi: 10.1101/gr.2816704. Epub 2004 Oct 12.

Gene prediction and verification in a compact genome with numerous small introns

Affiliations

Gene prediction and verification in a compact genome with numerous small introns

Aaron E Tenney et al. Genome Res. 2004 Nov.

Abstract

The genomes of clusters of related eukaryotes are now being sequenced at an increasing rate, creating a need for accurate, low-cost annotation of exon-intron structures. In this paper, we demonstrate that reverse transcription-polymerase chain reaction (RT-PCR) and direct sequencing based on predicted gene structures satisfy this need, at least for single-celled eukaryotes. The TWINSCAN gene prediction algorithm was adapted for the fungal pathogen Cryptococcus neoformans by using a precise model of intron lengths in combination with ungapped alignments between the genome sequences of the two closely related Cryptococcus varieties. This approach resulted in approximately 60% of known genes being predicted exactly right at every coding base and splice site. When previously unannotated TWINSCAN predictions were tested by RT-PCR and direct sequencing, 75% of targets spanning two predicted introns were amplified and produced high-quality sequence. When targets spanning the complete predicted open reading frame were tested, 72% of them amplified and produced high-quality sequence. We conclude that sequencing a small number of expressed sequence tags (ESTs) to provide training data, running TWINSCAN on an entire genome, and then performing RT-PCR and direct sequencing on all of its predictions would be a cost-effective method for obtaining an experimentally verified genome annotation.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Intron length probability distributions used. The smoothed empirical distribution (purple, solid line) closely mirrors the observed intron lengths in the training set (not shown). The geometric distribution (black, dashed line) is the unique member of the geometric family with mean intron length equal to that of the training set (68 bp), but it is clearly a poor fit to the observed distribution. (A) Linear scales; (B) log scale on probability axis.
Figure 2.
Figure 2.
Accuracy of the gene prediction set generated without genome comparison and the final prediction set, which was generated using comparison to the Serotype A genome. The smoothed empirical intron length model was used in generating both sets.
Figure 3.
Figure 3.
Comparison of TWINSCAN predictions generated using the geometric intron length model to the final TWINSCAN prediction set, which was generated using the smoothed empirical length model. Comparison to the Serotype A genome was used in generating both sets.
Figure 4.
Figure 4.
A curated annotation (blue), blocks of ungapped alignment from the genome sequence of Serotype A Strain H99 (black), the TWINSCAN prediction (red), and the PCR primers and experimental sequence aligned back to the genome (green). TWINSCAN's prediction of the missing exon is influenced by both the long ungapped alignment from H99 and the unusually (though not impossibly) long intron in the curated gene structure.

References

    1. Allen, J.E., Pertea, M., and Salzberg, S.L. 2004. Computational gene prediction using multiple sources of evidence. Genome Res. 14: 142-148. - PMC - PubMed
    1. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94. - PubMed
    1. Cawley, S.E., Wirth, A.I., and Speed, T.P. 2001. Phat—A gene finding program for Plasmodium falciparum. Mol. Biochem. Parasitol. 118: 167-174. - PubMed
    1. Flicek, P., Keibler, E., Hu, P., Korf, I., and Brent, M.R. 2003. Leveraging the mouse genome for gene prediction in human: From whole-genome shotgun reads to a global synteny map. Genome Res. 13: 46-54. - PMC - PubMed
    1. Jacobson, E., Goodner, A.P., and Nyhus, K.J. 1998. Ferrous iron uptake in Cryptococcus neoformans. Infect. Immun. 66: 4169-4175. - PMC - PubMed

WEB SITE REFERENCES

    1. http://www.ncbi.nlm.nih.gov/Traces/; NCBI Trace Archive.
    1. http://genes.cse.wustl.edu/tenney-04-crypto-data/; Supplemental data for this paper.
    1. http://genes.cse.wustl.edu/; TWINSCAN home page, application, source code, and gene predictions.
    1. http://micro-gen.ouhsc.edu; Oklahoma University Health Sciences Center.

Publication types

LinkOut - more resources