CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes
- PMID: 17332020
- DOI: 10.1093/bioinformatics/btm071
CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes
Abstract
Motivation: The numbers of finished and ongoing genome projects are increasing at a rapid rate, and providing the catalog of genes for these new genomes is a key challenge. Obtaining a set of well-characterized genes is a basic requirement in the initial steps of any genome annotation process. An accurate set of genes is needed in order to learn about species-specific properties, to train gene-finding programs, and to validate automatic predictions. Unfortunately, many new genome projects lack comprehensive experimental data to derive a reliable initial set of genes.
Results: In this study, we report a computational method, CEGMA (Core Eukaryotic Genes Mapping Approach), for building a highly reliable set of gene annotations in the absence of experimental data. We define a set of conserved protein families that occur in a wide range of eukaryotes, and present a mapping procedure that accurately identifies their exon-intron structures in a novel genomic sequence. CEGMA includes the use of profile-hidden Markov models to ensure the reliability of the gene structures. Our procedure allows one to build an initial set of reliable gene annotations in potentially any eukaryotic genome, even those in draft stages.
Availability: Software and data sets are available online at http://korflab.ucdavis.edu/Datasets.
Similar articles
-
Homology search for genes.Bioinformatics. 2007 Jul 1;23(13):i97-103. doi: 10.1093/bioinformatics/btm225. Bioinformatics. 2007. PMID: 17646351
-
PhyloPat: phylogenetic pattern analysis of eukaryotic genes.BMC Bioinformatics. 2006 Sep 1;7:398. doi: 10.1186/1471-2105-7-398. BMC Bioinformatics. 2006. PMID: 16948844 Free PMC article.
-
Identifying clusters of functionally related genes in genomes.Bioinformatics. 2007 May 1;23(9):1053-60. doi: 10.1093/bioinformatics/btl673. Epub 2007 Jan 19. Bioinformatics. 2007. PMID: 17237058
-
Advances in the Exon-Intron Database (EID).Brief Bioinform. 2006 Jun;7(2):178-85. doi: 10.1093/bib/bbl003. Epub 2006 Mar 9. Brief Bioinform. 2006. PMID: 16772261 Review.
-
Discovering and detecting transposable elements in genome sequences.Brief Bioinform. 2007 Nov;8(6):382-92. doi: 10.1093/bib/bbm048. Epub 2007 Oct 10. Brief Bioinform. 2007. PMID: 17932080 Review.
Cited by
-
The draft genome of the pest tephritid fruit fly Bactrocera tryoni: resources for the genomic analysis of hybridising species.BMC Genomics. 2014 Dec 20;15(1):1153. doi: 10.1186/1471-2164-15-1153. BMC Genomics. 2014. PMID: 25527032 Free PMC article.
-
Jackfruit genome and population genomics provide insights into fruit evolution and domestication history in China.Hortic Res. 2022 Aug 4;9:uhac173. doi: 10.1093/hr/uhac173. eCollection 2022. Hortic Res. 2022. PMID: 36204202 Free PMC article.
-
A near-complete chromosome-level genome assembly of looseleaf lettuce (Lactuca sativa var. crispa).Sci Data. 2024 Sep 4;11(1):961. doi: 10.1038/s41597-024-03830-y. Sci Data. 2024. PMID: 39231996 Free PMC article.
-
The immunotranscriptome of the Caribbean reef-building coral Pseudodiploria strigosa.Immunogenetics. 2015 Sep;67(9):515-30. doi: 10.1007/s00251-015-0854-1. Epub 2015 Jul 1. Immunogenetics. 2015. PMID: 26123975
-
Genome Sequencing of Paecilomyces Penicillatus Provides Insights into Its Phylogenetic Placement and Mycoparasitism Mechanisms on Morel Mushrooms.Pathogens. 2020 Oct 13;9(10):834. doi: 10.3390/pathogens9100834. Pathogens. 2020. PMID: 33065983 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources