Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;7(10):e47436.
doi: 10.1371/journal.pone.0047436. Epub 2012 Oct 15.

Automatic assignment of prokaryotic genes to functional categories using literature profiling

Affiliations

Automatic assignment of prokaryotic genes to functional categories using literature profiling

Raul Torrieri et al. PLoS One. 2012.

Abstract

In the last years, there was an exponential increase in the number of publicly available genomes. Once finished, most genome projects lack financial support to review annotations. A few of these gene annotations are based on a combination of bioinformatics evidence, however, in most cases, annotations are based solely on sequence similarity to a previously known gene, which was most probably annotated in the same way. As a result, a large number of predicted genes remain unassigned to any functional category despite the fact that there is enough evidence in the literature to predict their function. We developed a classifier trained with term-frequency vectors automatically disclosed from text corpora of an ensemble of genes representative of each functional category of the J. Craig Venter Institute Comprehensive Microbial Resource (JCVI-CMR) ontology. The classifier achieved up to 84% precision with 68% recall (for confidence≥0.4), F-measure 0.76 (recall and precision equally weighted) in an independent set of 2,220 genes, from 13 bacterial species, previously classified by JCVI-CMR into unambiguous categories of its ontology. Finally, the classifier assigned (confidence≥0.7) to functional categories a total of 5,235 out of the ∼24 thousand genes previously in categories "Unknown function" or "Unclassified" for which there is literature in MEDLINE. Two biologists reviewed the literature of 100 of these genes, randomly picket, and assigned them to the same functional categories predicted by the automatic classifier. Our results confirmed the hypothesis that it is possible to confidently assign genes of a real world repository to functional categories, based exclusively on the automatic profiling of its associated literature. The LitProf--Gene Classifier web server is accessible at: www.cebio.org/litprofGC.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Recall vs. precision of the classifier.
The red line represents the average performance of the initial classifier trained with the original categories of the JCVI-CMR ontology. The blue line, represents the average performance of the final classifier trained with a rearranged version of the ontology where noisy subcategories were merged together to create the Mix Category. For red and blue lines, the average was calculated from 100 replicates of 10-fold cross validation. The green line represents the performance of the final classifier in an independent gene set. Horizontal bars represent the standard deviations of recall. The dashed lines represent the standard deviation of precision for the blue curve.

Similar articles

Cited by

References

    1. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, et al. (2005) The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 33: 5691–5702. - PMC - PubMed
    1. Blaby-Haas CE, de Crécy-Lagard V (2011) Mining high-throughput experimental data to link gene and function. Trends Biotechnol 29: 174–182. - PMC - PubMed
    1. Poptsova MS, Gogarten JP (2010) Using comparative genome analysis to identify problems in annotated microbial genomes. Microbiology 156: 1909–1917. - PubMed
    1. Schnoes AM, Brown SD, Dodevski I, Babbitt PC (2009) Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol 5: e1000605. - PMC - PubMed
    1. Hunter L, Cohen KB (2006) Biomedical language processing: what's beyond PubMed? Mol Cell 21: 589–594. - PMC - PubMed

Publication types

LinkOut - more resources