Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2005 Apr;15(4):577-82.
doi: 10.1101/gr.3329005.

Closing in on the C. elegans ORFeome by cloning TWINSCAN predictions

Affiliations
Comparative Study

Closing in on the C. elegans ORFeome by cloning TWINSCAN predictions

Chaochun Wei et al. Genome Res. 2005 Apr.

Abstract

The genome of Caenorhabditis elegans was the first animal genome to be sequenced. Although considerable effort has been devoted to annotating it, the standard WormBase annotation contains thousands of predicted genes for which there is no cDNA or EST evidence. We hypothesized that a more complete experimental annotation could be obtained by creating a more accurate gene-prediction program and then amplifying and sequencing predicted genes. Our approach was to adapt the TWINSCAN gene prediction system to C. elegans and C. briggsae and to improve its splice site and intron-length models. The resulting system has 60% sensitivity and 58% specificity in exact prediction of open reading frames (ORFs), and hence, proteins-the best results we are aware of any multicellular organism. We then attempted to amplify, clone, and sequence 265 TWINSCAN-predicted ORFs that did not overlap WormBase gene annotations. The success rate was 55%, adding 146 genes that were completely absent from WormBase to the ORF clone collection (ORFeome). The same procedure had a 7% success rate on 90 Worm Base "predicted" genes that do not overlap TWINSCAN predictions. These results indicate that the accuracy of WormBase could be significantly increased by replacing its partially curated predicted genes with TWINSCAN predictions. The technology described in this study will continue to drive the C. elegans ORFeome toward completion and contribute to the annotation of the three Caenorhabditis species currently being sequenced. The results also suggest that this technology can significantly improve our knowledge of the "parts list" for even the best-studied model organisms.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Empirical, smoothed empirical, and geometric intron length distributions up to 4000 nt, on a log–log scale. The smoothed empirical length distribution is very close to the distribution observed in the 3889 fully confirmed genes of WS100. The geometric distribution, in contrast, assigns far too much probability to very short introns, too little to introns of the most common lengths, too much to introns between 100 and 1000 nt, and too little to introns longer than 2000 nt.
Figure 2.
Figure 2.
Accuracy of TWINSCAN and GENEFINDER on C. elegans estimated by comparison to 5569 fully confirmed ORFs (WS130). TWINSCAN uses alignments to the C. briggsae genome, an empirical intron-length model, and a model of GC splice donors. (Gene Sn) Percentage of loci with fully confirmed ORFs, at which TWINSCAN predicts one confirmed ORF exactly right. (Gene Sp) Percentage of TWINSCAN predictions that exactly match fully confirmed ORFs. Predictions that do not overlap any confirmed ORF are not counted. (Exon Sn and Exon Sp) Exact matches to coding regions of exons in fully confirmed ORFs.
Figure 3.
Figure 3.
Comparison of the accuracy of GAZE (with its trans-splicing model), FGENESH, and TWINSCAN on the GAZE test set. Numbers for GAZE and FGENESH are taken from Howe et al. (2002).
Figure 4.
Figure 4.
Breakdown of genome-wide predictions by TWINSCAN 2.01 in comparison to the WS130 annotations. (Row 1) Total number of WormBase annotations and TWINSCAN predictions. (Row 2) Breakdown of TWINSCAN predictions into those that are identical to fully confirmed WormBase predictions, those that overlap but are not identical, and those that do not overlap (orange). (Row 3) Breakdown of TWINSCAN predictions that do not overlap fully cDNA-confirmed ORFs by comparison to the partially cDNA confirmed WormBase ORFs. (Row 4) Breakdown of TWINSCAN predictions that do not overlap fully or partially confirmed WormBase ORFs by comparison to predicted WormBase ORFs. (Row 5) Breakdown of TWINSCAN predictions that do not overlap any of the above into single exon (beige) and multiexon (orange) predictions. (Row 6) Breakdown of novel multiexon TWINSCAN predictions into those that are shorter than 200 amino acids (pink) and those that are at least 200 amino acids (red). Analysis of predictions by an earlier and slightly less accurate version of TWINSCAN (2.0α), by comparison to WS100 ORFs, placed 265 novel ORFs of at least 200 amino acids in the red box, all of which were tested experimentally.

Similar articles

Cited by

References

    1. Brent, M.R. and Guigó, R. 2004. Recent advances in gene structure prediction. Curr. Opin. Struct. Biol. 14: 264-272. - PubMed
    1. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94. - PubMed
    1. C. elegans Sequencing Consortium. 1998. Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282: 2012-2018. - PubMed
    1. Cho, S., Suk-Won, J., Cohen, A., and Ellis, R. 2004. A phylogeny of Caenorhabditis reveals frequent loss of introns during nematode evolution. Genome Res. 14: 1209-1220. - PMC - PubMed
    1. Flicek, P., Keibler, E., Hu, P., Korf, I., and Brent, M.R. 2003. Leveraging the mouse genome for gene prediction in human: From whole-genome shotgun reads to a global synteny map. Genome Res. 13: 46-54. - PMC - PubMed

Web site references

    1. http://www.girinst.org/server/RepBase/repeatmaskerlibraries/repeatmasker...; Repeat libraries used in the foregoing analysis.
    1. http://www.sanger.ac.uk/Software/analysis/GAZE; GAZE data set.
    1. http://genes.cse.wustl.edu/eval/; Eval software.
    1. http://genes.cse.wustl.edu/wei-2005/; Predictions, primers, experimental sequences and traces, and genome alignments.
    1. http://blast.wustl.edu; Washington University BLAST archives.

Publication types