Missing genes in the annotation of prokaryotic genomes
- PMID: 20230630
- PMCID: PMC3098052
- DOI: 10.1186/1471-2105-11-131
Missing genes in the annotation of prokaryotic genomes
Abstract
Background: Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes.
Results: We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs.
Conclusions: Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.
Figures






Similar articles
-
Large-scale prokaryotic gene prediction and comparison to genome annotation.Bioinformatics. 2005 Dec 15;21(24):4322-9. doi: 10.1093/bioinformatics/bti701. Epub 2005 Oct 25. Bioinformatics. 2005. PMID: 16249266
-
GeneLook: a novel ab initio gene identification system suitable for automated annotation of prokaryotic sequences.Gene. 2005 Feb 14;346:115-25. doi: 10.1016/j.gene.2004.10.018. Epub 2005 Jan 26. Gene. 2005. PMID: 15716020
-
[Comprehensive re-annotation of protein-coding genes for prokaryotic genomes by Z-curve and similarity-based methods].Yi Chuan. 2020 Jul 20;42(7):691-702. doi: 10.16288/j.yczz.20-022. Yi Chuan. 2020. PMID: 32694108 Chinese.
-
An Experimental Approach to Genome Annotation: This report is based on a colloquium sponsored by the American Academy of Microbiology held July 19-20, 2004, in Washington, DC.Washington (DC): American Society for Microbiology; 2004. Washington (DC): American Society for Microbiology; 2004. PMID: 33001599 Free Books & Documents. Review.
-
Proteogenomics of rare taxonomic phyla: A prospective treasure trove of protein coding genes.Proteomics. 2016 Jan;16(2):226-40. doi: 10.1002/pmic.201500263. Epub 2015 Nov 23. Proteomics. 2016. PMID: 26773550 Review.
Cited by
-
The Escherichia coli CydX protein is a member of the CydAB cytochrome bd oxidase complex and is required for cytochrome bd oxidase activity.J Bacteriol. 2013 Aug;195(16):3640-50. doi: 10.1128/JB.00324-13. Epub 2013 Jun 7. J Bacteriol. 2013. PMID: 23749980 Free PMC article.
-
Thousands of missed genes found in bacterial genomes and their analysis with COMBREX.Biol Direct. 2012 Oct 30;7:37. doi: 10.1186/1745-6150-7-37. Biol Direct. 2012. PMID: 23111013 Free PMC article.
-
OLGenie: Estimating Natural Selection to Predict Functional Overlapping Genes.Mol Biol Evol. 2020 Aug 1;37(8):2440-2449. doi: 10.1093/molbev/msaa087. Mol Biol Evol. 2020. PMID: 32243542 Free PMC article.
-
Discovery of numerous novel small genes in the intergenic regions of the Escherichia coli O157:H7 Sakai genome.PLoS One. 2017 Sep 13;12(9):e0184119. doi: 10.1371/journal.pone.0184119. eCollection 2017. PLoS One. 2017. PMID: 28902868 Free PMC article.
-
Predicting statistical properties of open reading frames in bacterial genomes.PLoS One. 2012;7(9):e45103. doi: 10.1371/journal.pone.0045103. Epub 2012 Sep 24. PLoS One. 2012. PMID: 23028785 Free PMC article.
References
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Miscellaneous