. 2006 Jul 3:7:327.

doi: 10.1186/1471-2105-7-327.

Using ESTs to improve the accuracy of de novo gene prediction

Chaochun Wei¹, Michael R Brent

Affiliations

Affiliation

¹ Laboratory for Computational Genomics and Department of Computer Science and Engineering, Washington University, One Brookings Drive, St, Louis, MO 63130, USA. wei@cse.wustl.edu

PMID: 16817966
PMCID: PMC1534067
DOI: 10.1186/1471-2105-7-327

Using ESTs to improve the accuracy of de novo gene prediction

Chaochun Wei et al. BMC Bioinformatics. 2006.

. 2006 Jul 3:7:327.

doi: 10.1186/1471-2105-7-327.

Authors

Chaochun Wei¹, Michael R Brent

Affiliation

¹ Laboratory for Computational Genomics and Department of Computer Science and Engineering, Washington University, One Brookings Drive, St, Louis, MO 63130, USA. wei@cse.wustl.edu

PMID: 16817966
PMCID: PMC1534067
DOI: 10.1186/1471-2105-7-327

Abstract

Background: ESTs are a tremendous resource for determining the exon-intron structures of genes, but even extensive EST sequencing tends to leave many exons and genes untouched. Gene prediction systems based exclusively on EST alignments miss these exons and genes, leading to poor sensitivity. De novo gene prediction systems, which ignore ESTs in favor of genomic sequence, can predict such "untouched" exons, but they are less accurate when predicting exons to which ESTs align. TWINSCAN is the most accurate de novo gene finder available for nematodes and N-SCAN is the most accurate for mammals, as measured by exact CDS gene prediction and exact exon prediction.

Results: TWINSCAN_EST is a new system that successfully combines EST alignments with TWINSCAN. On the whole C. elegans genome TWINSCAN_EST shows 14% improvement in sensitivity and 13% in specificity in predicting exact gene structures compared to TWINSCAN without EST alignments. Not only are the structures revealed by EST alignments predicted correctly, but these also constrain the predictions without alignments, improving their accuracy. For the human genome, we used the same approach with N-SCAN, creating N-SCAN_EST. On the whole genome, N-SCAN_EST produced a 6% improvement in sensitivity and 1% in specificity of exact gene structure predictions compared to N-SCAN.

Conclusion: TWINSCAN_EST and N-SCAN_EST are more accurate than TWINSCAN and N-SCAN, while retaining their ability to discover novel genes to which no ESTs align. Thus, we recommend using the EST versions of these programs to annotate any genome for which EST information is available.TWINSCAN_EST and N-SCAN_EST are part of the TWINSCAN open source software package http://genes.cse.wustl.edu/distribution/download_TS.html.

PubMed Disclaimer

Figures

**Figure 1**
Construction of ESTseq from EST alignments. Each row of purple bars represents the aligned blocks of one EST, while the thin lines connecting the bars represent implied introns. The ESTseq representation contains an "E" for each base that is indicated as exonic (red), an "I" for each base that is indicated as intronic (yellow), and an "N" for each base that lies outside of all the alignments (gray). Regions that are indicated as intronic by some alignments and exonic by others are also labeled "N".

**Figure 2**
Results on the whole *C. elegans* genome (version WS130) using *C. briggsae* (version cb25.apg8) as the informant database and *C. elegans* ESTs from dbEST. The sensitivities are based on the 4,705 fully confirmed genes from WS130 and the specificities are based on those predictions that overlap with fully confirmed genes.

**Figure 3**
Accuracy on GAZE merged data set. Both GAZE_EST and TWINSCAN_EST used the same BLAT alignments of *C. elegans* ESTs from dbEST (1/20/2005). Informant database for TWINSCAN_EST is the *C. briggsae* genome (version cb25.apg8). 305 of the 325 gene loci have at least one EST alignment.

**Figure 4**
Accuracy of TWINSCAN, TWINSCAN_EST, NSCAN and N-SCAN_EST on the human genome. For TWINSCAN and TWINSCAN_EST, the mouse genome sequence is used as the informant database. For NSCAN and N-SCAN_EST, mouse, rat and chicken genomes are used as the informant databases. Human ESTs are from dbEST. For all methods, pseudo genes are masked out first [41].

**Figure 5**
Trainability of ESTseq parameters. The human and worm genes were each divided into two halves, one for training and one for testing. ESTseq parameters were estimated separately from half the human genes and half the worm genes. Each set of parameters was then tested separately on the other half of the human genes and the other half of the worm genes. The same models were used for both human and worm ESTseqs (5^th-order Markov Models for the coding regions, UTRs, Introns and intergenic regions, 43-base-long 2^nd-order WAM for splice acceptor sites and 9-base-long 2^nd-order WAM for the splice donor sites).

**Figure 6**
Accuracy of Pairagon cDNA alignments alone compared to Pairagon+N-SCAN_EST as a function of the number of cDNAs used. A total of 445 cDNAs aligned to the 31 human ENCODE test regions. The x axis shows the percentage of these 445 that were used. From left to right, 5% of unused cDNAs were randomly picked and added to those used previously.

See this image and copyright information in PMC

References

1. Birney E, Clamp M, Durbin R. GeneWise and Genomewise. Genome Res. 2004;14:988–995. doi: 10.1101/gr.1865504. - DOI - PMC - PubMed
1. Brent MR. Genome annotation past, present and future:How to define an ORF at each locus. Genome Res. 2005;15:1777–1786. doi: 10.1101/gr.3866105. - DOI - PubMed
1. Guigó R, Dermitzakis ET, Agarwal P, Ponting C, Parra G, Reymond A, Abril JF, Keibler E, Lyle R, Ucla C, Antonarakis SE, Brent MR. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A. 2003;100:1140–1145. doi: 10.1073/pnas.0337561100. - DOI - PMC - PubMed
1. The MGC Project Team The Status, Quality, and Expansion of the NIH Full-Length cDNA Project: The Mammalian Gene Collection (MGC) Genome Res. 2004;14:2121–2127. doi: 10.1101/gr.2596504. - DOI - PMC - PubMed
1. Howe KL, Chothia T, Durbin R. GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res. 2002;12:1418–1427. doi: 10.1101/gr.149502. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Using ESTs to improve the accuracy of de novo gene prediction

Affiliation

Using ESTs to improve the accuracy of de novo gene prediction

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials