Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus
- PMID: 14764557
- DOI: 10.1093/bioinformatics/bth058
Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus
Abstract
Motivation: Accurate gene structure annotation is a challenging computational problem in genomics. The best results are achieved with spliced alignment of full-length cDNAs or multiple expressed sequence tags (ESTs) with sufficient overlap to cover the entire gene. For most species, cDNA and EST collections are far from comprehensive. We sought to overcome this bottleneck by exploring the possibility of using combined EST resources from fairly diverged species that still share a common gene space. Previous spliced alignment tools were found inadequate for this task because they rely on very high sequence similarity between the ESTs and the genomic DNA.
Results: We have developed a computer program, GeneSeqer, which is capable of aligning thousands of ESTs with a long genomic sequence in a reasonable amount of time. The algorithm is uniquely designed to tolerate a high percentage of mismatches and insertions or deletions in the EST relative to the genomic template. This feature allows use of non-cognate ESTs for gene structure prediction, including ESTs derived from duplicated genes and homologous genes from related species. The increased gene prediction sensitivity results in part from novel splice site prediction models that are also available as a stand-alone splice site prediction tool. We assessed GeneSeqer performance relative to a standard Arabidopsis thaliana gene set and demonstrate its utility for plant genome annotation. In particular, we propose that this method provides a timely tool for the annotation of the rice genome, using abundant ESTs from other cereals and plants.
Availability: The source code is available for download at http://bioinformatics.iastate.edu/bioinformatics2go/gs/download.html. Web servers for Arabidopsis and other plant species are accessible at http://www.plantgdb.org/cgi-bin/AtGeneSeqer.cgi and http://www.plantgdb.org/cgi-bin/GeneSeqer.cgi, respectively. For non-plant species, use http://bioinformatics.iastate.edu/cgi-bin/gs.cgi. The splice site prediction tool (SplicePredictor) is distributed with the GeneSeqer code. A SplicePredictor web server is available at http://bioinformatics.iastate.edu/cgi-bin/sp.cgi
Similar articles
-
Incorporation of splice site probability models for non-canonical introns improves gene structure prediction in plants.Bioinformatics. 2005 Nov 1;21 Suppl 3:iii20-30. doi: 10.1093/bioinformatics/bti1205. Bioinformatics. 2005. PMID: 16306388
-
Optimal spliced alignment of homologous cDNA to a genomic DNA template.Bioinformatics. 2000 Mar;16(3):203-11. doi: 10.1093/bioinformatics/16.3.203. Bioinformatics. 2000. PMID: 10869013
-
GeneSeqer@PlantGDB: Gene structure prediction in plant genomes.Nucleic Acids Res. 2003 Jul 1;31(13):3597-600. doi: 10.1093/nar/gkg533. Nucleic Acids Res. 2003. PMID: 12824374 Free PMC article.
-
An overview of the wcd EST clustering tool.Bioinformatics. 2008 Jul 1;24(13):1542-6. doi: 10.1093/bioinformatics/btn203. Epub 2008 May 14. Bioinformatics. 2008. PMID: 18480101 Free PMC article. Review.
-
Gene identification through large-scale EST sequence processing.Appl Bioinformatics. 2003;2(3):123-9. Appl Bioinformatics. 2003. PMID: 15130797 Review.
Cited by
-
Gene discovery and transcript analyses in the corn smut pathogen Ustilago maydis: expressed sequence tag and genome sequence comparison.BMC Genomics. 2007 Sep 24;8:334. doi: 10.1186/1471-2164-8-334. BMC Genomics. 2007. PMID: 17892571 Free PMC article.
-
Genome-wide development of transposable elements-based markers in foxtail millet and construction of an integrated database.DNA Res. 2015 Feb;22(1):79-90. doi: 10.1093/dnares/dsu039. Epub 2014 Nov 26. DNA Res. 2015. PMID: 25428892 Free PMC article.
-
Detecting small plant peptides using SPADA (Small Peptide Alignment Discovery Application).BMC Bioinformatics. 2013 Nov 20;14:335. doi: 10.1186/1471-2105-14-335. BMC Bioinformatics. 2013. PMID: 24256031 Free PMC article.
-
TomatEST database: in silico exploitation of EST data to explore expression patterns in tomato species.Nucleic Acids Res. 2007 Jan;35(Database issue):D901-5. doi: 10.1093/nar/gkl921. Epub 2006 Nov 16. Nucleic Acids Res. 2007. PMID: 17142232 Free PMC article.
-
Integrating alternative splicing detection into gene prediction.BMC Bioinformatics. 2005 Feb 10;6:25. doi: 10.1186/1471-2105-6-25. BMC Bioinformatics. 2005. PMID: 15705189 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Research Materials