Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation
- PMID: 22182279
- PMCID: PMC3314711
- DOI: 10.1186/1471-2105-12-482
Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation
Abstract
Background: The Immune Epitope Database (IEDB) project manually curates information from published journal articles that describe immune epitopes derived from a wide variety of organisms and associated with different diseases. In the past, abstracts of scientific articles were retrieved by broad keyword queries of PubMed, and were classified as relevant (curatable) or irrelevant (not curatable) to the scope of the database by a Naïve Bayes classifier. The curatable abstracts were subsequently manually classified into categories corresponding to different disease domains. Over the past four years, we have examined how to further improve this approach in order to enhance classification performance and to reduce the need for manual intervention.
Results: Utilizing 89,884 abstracts classified by a domain expert as curatable or uncuratable, we found that a SVM classifier outperformed the previously used Naïve Bayes classifier for curatability predictions with an AUC of 0.899 and 0.854, respectively. Next, using a non-hierarchical and a hierarchical application of SVM classifiers trained on 22,833 curatable abstracts manually classified into three levels of disease specific categories we demonstrated that a hierarchical application of SVM classifiers outperformed non-hierarchical SVM classifiers for categorization. Finally, to optimize the hierarchical SVM classifiers' error profile for the curation process, cost sensitivity functions were developed to avoid serious misclassifications. We tested our design on a benchmark dataset of 1,388 references and achieved an overall category prediction accuracy of 94.4%, 93.9%, and 82.1% at the three levels of categorization, respectively.
Conclusions: A hierarchical application of SVM algorithms with cost sensitive output weighting enabled high quality reference classification with few serious misclassifications. This enabled us to significantly reduce the manual component of abstract categorization. Our findings are relevant to other databases that are developing their own document classifier schema and the datasets we make available provide large scale real-life benchmark sets for method developers.
Figures


Similar articles
-
Automating document classification for the Immune Epitope Database.BMC Bioinformatics. 2007 Jul 26;8:269. doi: 10.1186/1471-2105-8-269. BMC Bioinformatics. 2007. PMID: 17655769 Free PMC article.
-
Classifying injury narratives of large administrative databases for surveillance-A practical approach combining machine learning ensembles and human review.Accid Anal Prev. 2017 Jan;98:359-371. doi: 10.1016/j.aap.2016.10.014. Epub 2016 Nov 15. Accid Anal Prev. 2017. PMID: 27863339
-
Prediction of heart disease and classifiers' sensitivity analysis.BMC Bioinformatics. 2020 Jul 2;21(1):278. doi: 10.1186/s12859-020-03626-y. BMC Bioinformatics. 2020. PMID: 32615980 Free PMC article.
-
The Immune Epitope Database and Analysis Resource Program 2003-2018: reflections and outlook.Immunogenetics. 2020 Feb;72(1-2):57-76. doi: 10.1007/s00251-019-01137-6. Epub 2019 Nov 25. Immunogenetics. 2020. PMID: 31761977 Free PMC article. Review.
-
Predicting membrane protein types using various decision tree classifiers based on various modes of general PseAAC for imbalanced datasets.J Theor Biol. 2017 Dec 21;435:208-217. doi: 10.1016/j.jtbi.2017.09.018. Epub 2017 Sep 20. J Theor Biol. 2017. PMID: 28941868 Review.
Cited by
-
The Cancer Epitope Database and Analysis Resource: A Blueprint for the Establishment of a New Bioinformatics Resource for Use by the Cancer Immunology Community.Front Immunol. 2021 Aug 24;12:735609. doi: 10.3389/fimmu.2021.735609. eCollection 2021. Front Immunol. 2021. PMID: 34504503 Free PMC article.
-
TANTIGEN: a comprehensive database of tumor T cell antigens.Cancer Immunol Immunother. 2017 Jun;66(6):731-735. doi: 10.1007/s00262-017-1978-y. Epub 2017 Mar 9. Cancer Immunol Immunother. 2017. PMID: 28280852 Free PMC article.
-
Automatic Generation of Validated Specific Epitope Sets.J Immunol Res. 2015;2015:763461. doi: 10.1155/2015/763461. Epub 2015 Oct 19. J Immunol Res. 2015. PMID: 26568965 Free PMC article.
-
A behind-the-scenes tour of the IEDB curation process: an optimized process empirically integrating automation and human curation efforts.Immunology. 2020 Oct;161(2):139-147. doi: 10.1111/imm.13234. Epub 2020 Jul 26. Immunology. 2020. PMID: 32615639 Free PMC article.
-
The Immune Epitope Database: How Data Are Entered and Retrieved.J Immunol Res. 2017;2017:5974574. doi: 10.1155/2017/5974574. Epub 2017 May 29. J Immunol Res. 2017. PMID: 28634590 Free PMC article.
References
-
- Peters B, Sidney J, Bourne P, Bui HH, Doh G, Fleri W, Kronenberg M, Kubo R, Lund O, Nemazee D, Ponomarenko JV, Sathiamurthy M, Schoenberger S, Stewart S, Surko P, Way S, Wilson S, Sette A. The Immune Epitope Database and Analysis Resource: from vision to blueprint. PLoS Biology. 2005;3(3):379–381. - PMC - PubMed
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous