Automatic document classification of biological literature

David Chen¹, Hans-Michael Müller, Paul W Sternberg

Affiliations

PMID: 16893465
PMCID: PMC1559726
DOI: 10.1186/1471-2105-7-370

Automatic document classification of biological literature

David Chen et al. BMC Bioinformatics. 2006.

. 2006 Aug 7:7:370.

doi: 10.1186/1471-2105-7-370.

Authors

David Chen¹, Hans-Michael Müller, Paul W Sternberg

Affiliation

¹ Division of Biology and Howard Hughes Medical Institute, California Institute of Technology, Pasadena, California, USA. davidc@caltech.edu

PMID: 16893465
PMCID: PMC1559726
DOI: 10.1186/1471-2105-7-370

Abstract

Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature.

Results: We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.

Conclusion: We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.

PubMed Disclaimer

Figures

**Figure 1**
**An example of the clustering results from the Sex Determination category**. An intuitive interface allows users to quickly locate the topic of interest. The topics listed were generated automatically during the phrase-based clustering step.

**Figure 2**
**Overview of the classification process**. Full-text papers are taken from the Textpresso corpus and processed via SVM and phrase-base clustering. The end result is a large set of html files displaying the paper taxonomy.

See this image and copyright information in PMC

References

1. Andrade MA, Bork P. Automated extraction or information in molecular biology. FEBS Lett. 2000;476:12–17. - PubMed
1. De Bruijn B, Martin J. Getting to the (c)ore of knowledge: Mining biomedical literature. Int J Med Inf. 2002;67:7–18. - PubMed
1. Staab S, (editor) Mining information for function genomics. IEEE Intell Syst. 2002;17:66.
1. Jensen LJ, Saric J, Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics. 2006;7:119–129. - PubMed
1. Muller HM, Kenny EE, Sternberg PW. Textpresso: An ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2004;2:e309. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automatic document classification of biological literature

Affiliation

Automatic document classification of biological literature

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources