Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

Byron C Wallace¹, Anna Noel-Storr², Iain J Marshall³, Aaron M Cohen⁴, Neil R Smalheiser⁵, James Thomas⁶

Affiliations

¹ College of Computer and Information Science, Northeastern University, Boston MA, USA.
² Radcliffe Department of Medicine, University of Oxford, Oxford, UK.
³ Department of Primary Care and Public Health Sciences, King's College London, London, UK.
⁴ Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, OR, USA.
⁵ Department of Psychiatry and Psychiatric Institute, University of Illinois College of Medicine, Chicago, IL, USA.
⁶ EPPI-Centre, Department of Social Science, University College London, London, UK.

PMID: 28541493
PMCID: PMC5975623
DOI: 10.1093/jamia/ocx053

Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

Byron C Wallace et al. J Am Med Inform Assoc. 2017.

. 2017 Nov 1;24(6):1165-1168.

doi: 10.1093/jamia/ocx053.

Authors

Byron C Wallace¹, Anna Noel-Storr², Iain J Marshall³, Aaron M Cohen⁴, Neil R Smalheiser⁵, James Thomas⁶

Affiliations

¹ College of Computer and Information Science, Northeastern University, Boston MA, USA.
² Radcliffe Department of Medicine, University of Oxford, Oxford, UK.
³ Department of Primary Care and Public Health Sciences, King's College London, London, UK.
⁴ Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, OR, USA.
⁵ Department of Psychiatry and Psychiatric Institute, University of Illinois College of Medicine, Chicago, IL, USA.
⁶ EPPI-Centre, Department of Social Science, University College London, London, UK.

PMID: 28541493
PMCID: PMC5975623
DOI: 10.1093/jamia/ocx053

Abstract

Objectives: Identifying all published reports of randomized controlled trials (RCTs) is an important aim, but it requires extensive manual effort to separate RCTs from non-RCTs, even using current machine learning (ML) approaches. We aimed to make this process more efficient via a hybrid approach using both crowdsourcing and ML.

Methods: We trained a classifier to discriminate between citations that describe RCTs and those that do not. We then adopted a simple strategy of automatically excluding citations deemed very unlikely to be RCTs by the classifier and deferring to crowdworkers otherwise.

Results: Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy (our estimates suggest 95%-99% recall) with substantially less effort (we observed a reduction of around 60%-80%) than relying on manual screening alone.

Conclusions: Hybrid crowd-ML strategies warrant further exploration for biomedical curation/annotation tasks.

Keywords: crowdsourcing; evidence-based medicine; human computation; machine learning; natural language processing.

PubMed Disclaimer

Figures

**Figure 1.**
Left: Receiver operating characteristic curve showing the performance of our RCT classifier, trained on a subset of the Embase dataset. Right: Receiver operating characteristic curve showing the performance of our pretrained RCT classifier on the entire Embase dataset.

**Figure 2.**
A scatterplot of recall vs (simulated) total expended effort for varying values of the confidence threshold t. As noted in the text, effort is modeled as unit costs, where 1 novice screening decision = 1 unit, 1 expert decision = 2 units, and 1 resolver decision = 4 units.

See this image and copyright information in PMC

References

1. Chalmers I. The Cochrane collaboration: preparing, maintaining, and disseminating systematic reviews of the effects of health care. Ann NY Acad Sci. 1993;7031:156–65. - PubMed
1. Cohen AM, Smalheiser NR, McDonagh MS. et al. Automated confidence ranked classification of randomized controlled trial articles: an aid to evidence-based medicine. J Am Med Inform Assoc. 2015;223:707–17. - PMC - PubMed
1. McKibbon KA, Wilczynskil NL, Haynes RB. Retrieving randomized controlled trials from Medline: a comparison of 38 published search filters. Health Info Libr J. 2009;263:187–202. - PubMed
1. Wieland LS, Robinson KA, Dickersin K. Understanding why evidence from randomised clinical trials may not be retrieved from Medline: comparison of indexed and non-indexed records. Brit Med J. 2012;344:2008–12. - PubMed
1. Bastian H, Glaszio P, Chalmers I. Seventy-five trials and eleven systematic reviews a day: how will we ever keep up? PLoS Med. 2010;79:1–6. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 LM012086/LM/NLM NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

Affiliations

Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources