Pairagon+N-SCAN_EST: a model-based gene annotation pipeline
- PMID: 16925839
- PMCID: PMC1810554
- DOI: 10.1186/gb-2006-7-s1-s5
Pairagon+N-SCAN_EST: a model-based gene annotation pipeline
Abstract
Background: This paper describes Pairagon+N-SCAN_EST, a gene annotation pipeline that uses only native alignments. For each expressed sequence it chooses the best genomic alignment. Systems like ENSEMBL and ExoGean rely on trans alignments, in which expressed sequences are aligned to the genomic loci of putative homologs. Trans alignments contain a high proportion of mismatches, gaps, and/or apparently unspliceable introns, compared to alignments of cDNA sequences to their native loci. The Pairagon+N-SCAN_EST pipeline's first stage is Pairagon, a cDNA-to-genome alignment program based on a PairHMM probability model. This model relies on prior knowledge, such as the fact that introns must begin with GT, GC, or AT and end with AG or AC. It produces very precise alignments of high quality cDNA sequences. In the genomic regions between Pairagon's cDNA alignments, the pipeline combines EST alignments with de novo gene prediction by using N-SCAN_EST. N-SCAN_EST is based on a generalized HMM probability model augmented with a phylogenetic conservation model and EST alignments. It can predict complete transcripts by extending or merging EST alignments, but it can also predict genes in regions without EST alignments. Because they are based on probability models, both Pairagon and N-SCAN_EST can be trained automatically for new genomes and data sets.
Results: On the ENCODE regions of the human genome, Pairagon+N-SCAN_EST was as accurate as any other system tested in the EGASP assessment, including ENSEMBL and ExoGean.
Conclusion: With sufficient mRNA/EST evidence, genome annotation without trans alignments can compete successfully with systems like ENSEMBL and ExoGean, which use trans alignments.
Figures







Similar articles
-
Pairagon: a highly accurate, HMM-based cDNA-to-genome aligner.Bioinformatics. 2009 Jul 1;25(13):1587-93. doi: 10.1093/bioinformatics/btp273. Epub 2009 May 4. Bioinformatics. 2009. PMID: 19414532 Free PMC article.
-
Using ESTs to improve the accuracy of de novo gene prediction.BMC Bioinformatics. 2006 Jul 3;7:327. doi: 10.1186/1471-2105-7-327. BMC Bioinformatics. 2006. PMID: 16817966 Free PMC article.
-
AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome.Genome Biol. 2006;7 Suppl 1(Suppl 1):S11.1-8. doi: 10.1186/gb-2006-7-s1-s11. Epub 2006 Aug 7. Genome Biol. 2006. PMID: 16925833 Free PMC article.
-
EGASP: the human ENCODE Genome Annotation Assessment Project.Genome Biol. 2006;7 Suppl 1(Suppl 1):S2.1-31. doi: 10.1186/gb-2006-7-s1-s2. Epub 2006 Aug 7. Genome Biol. 2006. PMID: 16925836 Free PMC article. Review.
-
A hitchhiker's guide to expressed sequence tag (EST) analysis.Brief Bioinform. 2007 Jan;8(1):6-21. doi: 10.1093/bib/bbl015. Epub 2006 May 23. Brief Bioinform. 2007. PMID: 16772268 Review.
Cited by
-
Hidden Markov Models and their Applications in Biological Sequence Analysis.Curr Genomics. 2009 Sep;10(6):402-15. doi: 10.2174/138920209789177575. Curr Genomics. 2009. PMID: 20190955 Free PMC article.
-
EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data.BMC Bioinformatics. 2009 Jun 16;10 Suppl 6(Suppl 6):S10. doi: 10.1186/1471-2105-10-S6-S10. BMC Bioinformatics. 2009. PMID: 19534735 Free PMC article.
-
Pairagon: a highly accurate, HMM-based cDNA-to-genome aligner.Bioinformatics. 2009 Jul 1;25(13):1587-93. doi: 10.1093/bioinformatics/btp273. Epub 2009 May 4. Bioinformatics. 2009. PMID: 19414532 Free PMC article.
-
ASPic-GeneID: a lightweight pipeline for gene prediction and alternative isoforms detection.Biomed Res Int. 2013;2013:502827. doi: 10.1155/2013/502827. Epub 2013 Nov 7. Biomed Res Int. 2013. PMID: 24308000 Free PMC article.
-
CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction.Genome Biol. 2007;8(12):R269. doi: 10.1186/gb-2007-8-12-r269. Genome Biol. 2007. PMID: 18096039 Free PMC article.
References
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials
Miscellaneous