Automatic assignment of prokaryotic genes to functional categories using literature profiling
- PMID: 23077617
- PMCID: PMC3471813
- DOI: 10.1371/journal.pone.0047436
Automatic assignment of prokaryotic genes to functional categories using literature profiling
Abstract
In the last years, there was an exponential increase in the number of publicly available genomes. Once finished, most genome projects lack financial support to review annotations. A few of these gene annotations are based on a combination of bioinformatics evidence, however, in most cases, annotations are based solely on sequence similarity to a previously known gene, which was most probably annotated in the same way. As a result, a large number of predicted genes remain unassigned to any functional category despite the fact that there is enough evidence in the literature to predict their function. We developed a classifier trained with term-frequency vectors automatically disclosed from text corpora of an ensemble of genes representative of each functional category of the J. Craig Venter Institute Comprehensive Microbial Resource (JCVI-CMR) ontology. The classifier achieved up to 84% precision with 68% recall (for confidence≥0.4), F-measure 0.76 (recall and precision equally weighted) in an independent set of 2,220 genes, from 13 bacterial species, previously classified by JCVI-CMR into unambiguous categories of its ontology. Finally, the classifier assigned (confidence≥0.7) to functional categories a total of 5,235 out of the ∼24 thousand genes previously in categories "Unknown function" or "Unclassified" for which there is literature in MEDLINE. Two biologists reviewed the literature of 100 of these genes, randomly picket, and assigned them to the same functional categories predicted by the automatic classifier. Our results confirmed the hypothesis that it is possible to confidently assign genes of a real world repository to functional categories, based exclusively on the automatic profiling of its associated literature. The LitProf--Gene Classifier web server is accessible at: www.cebio.org/litprofGC.
Conflict of interest statement
Figures

Similar articles
-
CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations.Database (Oxford). 2012 Mar 20;2012:bas001. doi: 10.1093/database/bas001. Print 2012. Database (Oxford). 2012. PMID: 22434836 Free PMC article.
-
Using computational predictions to improve literature-based Gene Ontology annotations: a feasibility study.Database (Oxford). 2011 Mar 15;2011:bar004. doi: 10.1093/database/bar004. Print 2011. Database (Oxford). 2011. PMID: 21411447 Free PMC article.
-
CharProtDB: a database of experimentally characterized protein annotations.Nucleic Acids Res. 2012 Jan;40(Database issue):D237-41. doi: 10.1093/nar/gkr1133. Epub 2011 Dec 2. Nucleic Acids Res. 2012. PMID: 22140108 Free PMC article.
-
Gene Ontology annotation of the rice blast fungus, Magnaporthe oryzae.BMC Microbiol. 2009 Feb 19;9 Suppl 1(Suppl 1):S8. doi: 10.1186/1471-2180-9-S1-S8. BMC Microbiol. 2009. PMID: 19278556 Free PMC article. Review.
-
Genome annotation techniques: new approaches and challenges.Drug Discov Today. 2002 Jun 1;7(11):S70-6. doi: 10.1016/s1359-6446(02)02289-4. Drug Discov Today. 2002. PMID: 12047883 Review.
Cited by
-
Activation and identification of five clusters for secondary metabolites in Streptomyces albus J1074.Microb Biotechnol. 2014 May;7(3):242-56. doi: 10.1111/1751-7915.12116. Epub 2014 Mar 4. Microb Biotechnol. 2014. PMID: 24593309 Free PMC article.
-
A deep learning model to detect novel pore-forming proteins.Sci Rep. 2022 Feb 7;12(1):2013. doi: 10.1038/s41598-022-05970-w. Sci Rep. 2022. PMID: 35132124 Free PMC article.
-
Transcriptome analysis of Enterococcus faecalis during mammalian infection shows cells undergo adaptation and exist in a stringent response state.PLoS One. 2014 Dec 29;9(12):e115839. doi: 10.1371/journal.pone.0115839. eCollection 2014. PLoS One. 2014. PMID: 25545155 Free PMC article.
-
Diversity of gene cassettes and the abundance of the class 1 integron-integrase gene in sediment polluted by metals.Extremophiles. 2016 May;20(3):283-9. doi: 10.1007/s00792-016-0820-3. Epub 2016 Mar 9. Extremophiles. 2016. PMID: 26961777
References
-
- Poptsova MS, Gogarten JP (2010) Using comparative genome analysis to identify problems in annotated microbial genomes. Microbiology 156: 1909–1917. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Molecular Biology Databases
Research Materials