. 2011 Dec 19:12:482.

doi: 10.1186/1471-2105-12-482.

Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation

Emily Seymour¹, Rohini Damle, Alessandro Sette, Bjoern Peters

Affiliations

PMID: 22182279
PMCID: PMC3314711
DOI: 10.1186/1471-2105-12-482

Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation

Emily Seymour et al. BMC Bioinformatics. 2011.

. 2011 Dec 19:12:482.

doi: 10.1186/1471-2105-12-482.

Authors

Emily Seymour¹, Rohini Damle, Alessandro Sette, Bjoern Peters

Affiliation

¹ The La Jolla Institute for Allergy and Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA.

PMID: 22182279
PMCID: PMC3314711
DOI: 10.1186/1471-2105-12-482

Abstract

Background: The Immune Epitope Database (IEDB) project manually curates information from published journal articles that describe immune epitopes derived from a wide variety of organisms and associated with different diseases. In the past, abstracts of scientific articles were retrieved by broad keyword queries of PubMed, and were classified as relevant (curatable) or irrelevant (not curatable) to the scope of the database by a Naïve Bayes classifier. The curatable abstracts were subsequently manually classified into categories corresponding to different disease domains. Over the past four years, we have examined how to further improve this approach in order to enhance classification performance and to reduce the need for manual intervention.

Results: Utilizing 89,884 abstracts classified by a domain expert as curatable or uncuratable, we found that a SVM classifier outperformed the previously used Naïve Bayes classifier for curatability predictions with an AUC of 0.899 and 0.854, respectively. Next, using a non-hierarchical and a hierarchical application of SVM classifiers trained on 22,833 curatable abstracts manually classified into three levels of disease specific categories we demonstrated that a hierarchical application of SVM classifiers outperformed non-hierarchical SVM classifiers for categorization. Finally, to optimize the hierarchical SVM classifiers' error profile for the curation process, cost sensitivity functions were developed to avoid serious misclassifications. We tested our design on a benchmark dataset of 1,388 references and achieved an overall category prediction accuracy of 94.4%, 93.9%, and 82.1% at the three levels of categorization, respectively.

Conclusions: A hierarchical application of SVM algorithms with cost sensitive output weighting enabled high quality reference classification with few serious misclassifications. This enabled us to significantly reduce the manual component of abstract categorization. Our findings are relevant to other databases that are developing their own document classifier schema and the datasets we make available provide large scale real-life benchmark sets for method developers.

PubMed Disclaimer

Figures

**Figure 1**
**IEDB PubMed abstract triaging process**. Abstracts of references retrieved from the PubMed queries that have not been introduced to the IEDB's database and curation pipeline proceed to at least one of four hierarchical levels of classification. At Level 0, an abstract is evaluated for epitope-specific content. Abstracts which contain epitope-specific data are assigned to one of the seven Level 1 categories. References receive increasingly specific category assignments at Levels 2 and 3. High IEDB priority categories are Allergy, Autoimmunity, Infectious Disease, and Transplantation. Low IEDB priority categories are Cancer, HIV, and Other. Transplant and Cancer references are not assigned Level 2 categories. HIV references do not receive Level 2 or 3 category assignments.

**Figure 2**
**Comparison of Naïve Bayes and SVM algorithms at training Level 0**. The performance of the Naïve Bayes and SVM classifiers was evaluated with 10-fold cross-validation. As is shown in the ROC curve, the SVM classifier outperformed the Naïve Bayes classifier on curatability predictions for the cross-validation dataset of 89,884 abstracts. The AUC value for the SVM classifier was 0.899 and the AUC value for the Naïve Bayes classifier was 0.854. At the 5% false negative rate for the curatability decision, the SVM classifier had a true positive rate of 41.4% and the Naïve Bayes classifier had a true positive rate of 33.5%.

See this image and copyright information in PMC

References

1. Peters B, Sidney J, Bourne P, Bui HH, Doh G, Fleri W, Kronenberg M, Kubo R, Lund O, Nemazee D, Ponomarenko JV, Sathiamurthy M, Schoenberger S, Stewart S, Surko P, Way S, Wilson S, Sette A. The Immune Epitope Database and Analysis Resource: from vision to blueprint. PLoS Biology. 2005;3(3):379–381. - PMC - PubMed
1. Vita R, Zarebski L, Greenbaum JA, Emami H, Hoof I, Salimi N, Damle R, Sette A, Peters B. The Immune Epitope Database 2.0. Nucleic Acids Research. 2010. pp. D854–D862. - PMC - PubMed
1. Vita R, Peters B, Sette A. The curation guidelines of the Immune Epitope Database and Analysis Resource. Cytometry A. 2008;73(11):1066–1070. - PMC - PubMed
1. Wang P, Morgan AA, Zhang Q, Sette A, Peters B. Automating document classification for the Immune Epitope Database. BMC Bioinformatics. 2007;8:269. doi: 10.1186/1471-2105-8-269. - DOI - PMC - PubMed
1. Davies V, Vaughan K, Damle R, Peters B, Sette A. Classification of the universe of immune epitope literature: representation and knowledge gaps. PLoS One. 2009;4(9):e6948. doi: 10.1371/journal.pone.0006948. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

HHSN2662004000 0 6C/PHS HHS/United States

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation

Affiliation

Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous