Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Jul 26:8:269.
doi: 10.1186/1471-2105-8-269.

Automating document classification for the Immune Epitope Database

Affiliations

Automating document classification for the Immune Epitope Database

Peng Wang et al. BMC Bioinformatics. .

Abstract

Background: The Immune Epitope Database contains information on immune epitopes curated manually from the scientific literature. Like similar projects in other knowledge domains, significant effort is spent on identifying which articles are relevant for this purpose.

Results: We here report our experience in automating this process using Naïve Bayes classifiers trained on 20,910 abstracts classified by domain experts. Improvements on the basic classifier performance were made by a) utilizing information stored in PubMed beyond the abstract itself b) applying standard feature selection criteria and c) extracting domain specific feature patterns that e.g. identify peptides sequences. We have implemented the classifier into the curation process determining if abstracts are clearly relevant, clearly irrelevant, or if no certain classification can be made, in which case the abstracts are manually classified. Testing this classification scheme on an independent dataset, we achieve 95% sensitivity and specificity in the 51.1% of abstracts that were automatically classified.

Conclusion: By implementing text classification, we have sped up the reference selection process without sacrificing sensitivity or specificity of the human expert classification. This study provides both practical recommendations for users of text classification tools, as well as a large dataset which can serve as a benchmark for tool developers.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Comparison of popular text classification algorithms. The performance of four well-known text classification algorithm was evaluated on our dataset via 10-fold cross-validation. The ROC curve shows that the Naïve Bayes classifier performs best on our dataset. The AUC values for the four classifiers are as follows: Naïve Bayes classifier: 0.838; Neural Network: 0.831; Support Vector Machine: 0.825; Decision Tree: 0.809.
Figure 2
Figure 2
Comparison of Naive Bayes Classifier performance in cross validation. AUCs of Naïve Bayes classifier incorporating various dimensionality reduction techniques were compared in each round of the 10-fold cross-validation side by side. Abstract: AUC of classifier trained on the raw words of abstracts. PubMed: AUC of classifier trained on raw words in abstract, MeSH heading, title, author etc. PubMed+FS: AUC of classifier trained on subset of raw words selected from abstract, MESH heading, title, author etc using combined cutoff of IG >2.00e-05 and DF >3. PubMed+FS+FE: AUC of classifier trained on a subset of feature generated from raw words in abstract, MeSH heading, title, author etc by first applying feature extraction followed by feature selection. Using combined cutoff of IG >2.00e-05 and DF >3.
Figure 3
Figure 3
Effects of feature selection on Naïve Bayes classifier performance. The performances of the Naïve Bayes classifier (measured in AUC) is plotted against the number of features used in training. Both IG (information gain) and DF (document frequency) based feature selection have a similar effect on classifier performance. Reducing the number of features used to the top 20,000 by each measure leads to a small increase in performance. Using even less features leads to decreases in performance, but notably the top 100 features in term of information gain are sufficient to reach AUC values of 0.82.
Figure 4
Figure 4
ROC and Precision-recall curve of malaria abstracts. Newly acquired malaria abstracts were classified with the Naïve Bayes classifier trained on all previously expert classified abstracts. The ROC curve was shown in Figure 4a. Horizontal line is the cutoff for "irrelevant" abstracts and vertical line is the cutoff for "relevant" abstracts. Figure 4b is the Precision-Recall curve. The curve shows that at 95% precision, we achieved a recall rate of 36.4%.

References

    1. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS. UniProt: the Universal Protein knowledgebase. Nucleic acids research. 2004;32:D115–9. doi: 10.1093/nar/gkh131. - DOI - PMC - PubMed
    1. GeneRIF http://www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html
    1. Eppig JT, Bult CJ, Kadin JA, Richardson JE, Blake JA, Anagnostopoulos A, Baldarelli RM, Baya M, Beal JS, Bello SM, Boddy WJ, Bradt DW, Burkart DL, Butler NE, Campbell J, Cassell MA, Corbani LE, Cousins SL, Dahmen DJ, Dene H, Diehl AD, Drabkin HJ, Frazer KS, Frost P, Glass LH, Goldsmith CW, Grant PL, Lennon-Pierce M, Lewis J, Lu I, Maltais LJ, McAndrews-Hill M, McClellan L, Miers DB, Miller LA, Ni L, Ormsby JE, Qi D, Reddy TB, Reed DJ, Richards-Smith B, Shaw DR, Sinclair R, Smith CL, Szauter P, Walker MB, Walton DO, Washburn LL, Witham IT, Zhu Y. The Mouse Genome Database (MGD): from genes to mice--a community resource for mouse biology. Nucleic acids research. 2005;33:D471–5. doi: 10.1093/nar/gki113. - DOI - PMC - PubMed
    1. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. The KEGG resource for deciphering the genome. Nucleic acids research. 2004;32:D277–80. doi: 10.1093/nar/gkh063. - DOI - PMC - PubMed
    1. Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D. DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic acids research. 2002;30:303–305. doi: 10.1093/nar/30.1.303. - DOI - PMC - PubMed

Publication types

LinkOut - more resources