A High Recall Classifier for Selecting Articles for MEDLINE Indexing

Alastair R Rae¹, Max E Savery¹, James G Mork¹, Dina Demner-Fushman¹

Affiliations

PMID: 32308868
PMCID: PMC7153058

A High Recall Classifier for Selecting Articles for MEDLINE Indexing

Alastair R Rae et al. AMIA Annu Symp Proc. 2020.

. 2020 Mar 4:2019:727-734.

eCollection 2019.

Authors

Alastair R Rae¹, Max E Savery¹, James G Mork¹, Dina Demner-Fushman¹

Affiliation

¹ Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD.

PMID: 32308868
PMCID: PMC7153058

Abstract

MEDLINE is the National Library of Medicine's premier bibliographic database for biomedical literature. A highly valuable feature of the database is that each record is manually indexed with a controlled vocabulary called MeSH. Most MEDLINE journals are indexed cover-to-cover, but there are about 200 selectively indexed journals for which only articles related to biomedicine and life sciences are indexed. In recent years, the selection process has become an increasing burden for indexing staff, and this paper presents a machine learning based system that offers very significant time savings by semi-automating the task. At the core of the system is a high recall classifier for the identification of journal articles that are in-scope for MEDLINE. The system is shown to reduce the number of articles requiring manual review by 54%, equivalent to approximately 40,000 articles per year.

PubMed Disclaimer

Figures

**Figure 2:**
Illustration of the special encoding used for year inputs. The example shows how years between 2014 and 2018 would be encoded.

**Figure 3:**
Precision-recall curves for the ensemble of traditional machine learning algorithms, CNN, and combined model a) full plot b) zoomed in plot showing precision at high recall.

**Figure 4:**
Precision-recall curves for combined model by journal group.

**Figure 5:**
Fraction of indexed articles from selectively indexed journals against publication year. Shows the actual fraction and the fraction predicted by the CNN model.

See this image and copyright information in PMC

References

1. MEDLINE/PubMed baseline; 2018. Available from: https://mbr.nlm.nih.gov/Download/ Baselines/2018/
1. Cohen AM, Hersh WR. The TREC 2004 genomics track categorization task: classifying full text biomedical documents. Journal of Biomedical Discovery and Collaboration. 2006 Mar;1(1):4 Available from: https: //doi.org/10.1186/1747-5333-1-4. - DOI - PMC - PubMed
1. Wiegers TC, Davis AP, Mattingly CJ. Collaborative biocuration - text-mining development task for document prioritization for curation. Database. 2012 Nov;2012. Available from: https://dx.doi.org/10.1093/ database/bas037. - DOI - PMC - PubMed
1. Kilicoglu H, Demner-Fushman D, Rindflesch TC, Haynes RB, Wilczynski NL. Towards automatic recognition of scientifically rigorous clinical research evidence. Journal of the American Medical Informatics Association. 2009 Jan;16(1):25–31. doi: 10.1197/jamia.M2996. Available from: - DOI - PMC - PubMed
1. Del Fiol G, Michelson M, Iorio A, Cotoi C, Haynes RB. A deep learning method to automatically identify reports of scientifically rigorous clinical research from the biomedical literature comparative analytic study. J Med Internet Res. 2018 Jun;20(6):e10281. Available from: http://www.jmir.org/2018/6/e10281/ - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A High Recall Classifier for Selecting Articles for MEDLINE Indexing

Affiliation

A High Recall Classifier for Selecting Articles for MEDLINE Indexing

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources