Randomized Controlled Trial

. 2024 Apr:152:104628.

doi: 10.1016/j.jbi.2024.104628. Epub 2024 Mar 26.

Automatic categorization of self-acknowledged limitations in randomized controlled trial publications

Mengfei Lan¹, Mandy Cheng², Linh Hoang¹, Gerben Ter Riet³, Halil Kilicoglu⁴

Affiliations

¹ School of Information Sciences, University of Illinois Urbana-Champaign, 501 Daniel Street, Champaign, 61820, IL, USA.
² Department of Biological Sciences, Binghamton University, 4400 Vestal Parkway East, New York City, 13902, NY, USA.
³ Faculty of Health, Amsterdam University of Applied Sciences, Tafelbergweg 51, Amsterdam, 1105 BD, The Netherlands.
⁴ School of Information Sciences, University of Illinois Urbana-Champaign, 501 Daniel Street, Champaign, 61820, IL, USA. Electronic address: halil@illinois.edu.

PMID: 38548008
PMCID: PMC11807350
DOI: 10.1016/j.jbi.2024.104628

Randomized Controlled Trial

Automatic categorization of self-acknowledged limitations in randomized controlled trial publications

Mengfei Lan et al. J Biomed Inform. 2024 Apr.

. 2024 Apr:152:104628.

doi: 10.1016/j.jbi.2024.104628. Epub 2024 Mar 26.

Authors

Mengfei Lan¹, Mandy Cheng², Linh Hoang¹, Gerben Ter Riet³, Halil Kilicoglu⁴

Affiliations

¹ School of Information Sciences, University of Illinois Urbana-Champaign, 501 Daniel Street, Champaign, 61820, IL, USA.
² Department of Biological Sciences, Binghamton University, 4400 Vestal Parkway East, New York City, 13902, NY, USA.
³ Faculty of Health, Amsterdam University of Applied Sciences, Tafelbergweg 51, Amsterdam, 1105 BD, The Netherlands.
⁴ School of Information Sciences, University of Illinois Urbana-Champaign, 501 Daniel Street, Champaign, 61820, IL, USA. Electronic address: halil@illinois.edu.

PMID: 38548008
PMCID: PMC11807350
DOI: 10.1016/j.jbi.2024.104628

Abstract

Objective: Acknowledging study limitations in a scientific publication is a crucial element in scientific transparency and progress. However, limitation reporting is often inadequate. Natural language processing (NLP) methods could support automated reporting checks, improving research transparency. In this study, our objective was to develop a dataset and NLP methods to detect and categorize self-acknowledged limitations (e.g., sample size, blinding) reported in randomized controlled trial (RCT) publications.

Methods: We created a data model of limitation types in RCT studies and annotated a corpus of 200 full-text RCT publications using this data model. We fine-tuned BERT-based sentence classification models to recognize the limitation sentences and their types. To address the small size of the annotated corpus, we experimented with data augmentation approaches, including Easy Data Augmentation (EDA) and Prompt-Based Data Augmentation (PromDA). We applied the best-performing model to a set of about 12K RCT publications to characterize self-acknowledged limitations at larger scale.

Results: Our data model consists of 15 categories and 24 sub-categories (e.g., Population and its sub-category DiagnosticCriteria). We annotated 1090 instances of limitation types in 952 sentences (4.8 limitation sentences and 5.5 limitation types per article). A fine-tuned PubMedBERT model for limitation sentence classification improved upon our earlier model by about 1.5 absolute percentage points in F₁ score (0.821 vs. 0.8) with statistical significance (p<.001). Our best-performing limitation type classification model, PubMedBERT fine-tuning with PromDA (Output View), achieved an F₁ score of 0.7, improving upon the vanilla PubMedBERT model by 2.7 percentage points, with statistical significance (p<.001).

Conclusion: The model could support automated screening tools which can be used by journals to draw the authors' attention to reporting issues. Automatic extraction of limitations from RCT publications could benefit peer review and evidence synthesis, and support advanced methods to search and aggregate the evidence from the clinical trial literature.

Keywords: Large language models; Natural language processing; Randomized controlled trials; Reporting quality; Self-acknowledged limitations; Text classification.

PubMed Disclaimer

Conflict of interest statement

Declaration of competing interest The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Halil Kilicoglu reports financial support was provided by National Library of Medicine.

Figures

**Fig. E.4.**
A sample generated by output-view.

**Fig. F.5.**
Document-level distribution of SAL types on the manually annotated dataset. The x-axis shows the number of documents that contain a specific SAL type.

**Fig. G.6.**
Sentence-level distribution of SAL types on the large-scale RCT dataset. The x-axis shows the total number of sentences with the SAL types in the dataset.

**Fig. 1.**
Overview of our soft-prompt based data augmentation (PromDA).

**Fig. 2.**
The sentence-level distribution of SAL types on the manually annotated dataset. Note that in some cases, the total number of fine-grained labels in a top-level category exceeds the total number for the top-level category, because the same sentence could be labeled with a top-level category as well as a fine-grained label belonging to the same top-level category (e.g., 10.6% + 7.5% > 17% for the UnderpoweredStudy category). The document-level distribution of SAL types on this dataset is provided in Appendix F.

**Fig. 3.**
Document-level distribution of SAL types on the large-scale RCT dataset. x-axis shows the number of articles that contain a specific SAL type. The sentence-level distribution of SAL types on this dataset is provided in Appendix G.

See this image and copyright information in PMC

Cited by

SPIRIT-CONSORT-TM: a corpus for assessing transparency of clinical trial protocol and results publications.
Jiang L, Vorland CJ, Ying X, Brown AW, Menke JD, Hong G, Lan M, Mayo-Wilson E, Kilicoglu H. Jiang L, et al. Sci Data. 2025 Feb 28;12(1):355. doi: 10.1038/s41597-025-04629-1. Sci Data. 2025. PMID: 40021657 Free PMC article.
SPIRIT-CONSORT-TM: a corpus for assessing transparency of clinical trial protocol and results publications.
Jiang L, Vorland CJ, Ying X, Brown AW, Menke JD, Hong G, Lan M, Mayo-Wilson E, Kilicoglu H. Jiang L, et al. medRxiv [Preprint]. 2025 Jan 15:2025.01.14.25320543. doi: 10.1101/2025.01.14.25320543. medRxiv. 2025. Update in: Sci Data. 2025 Feb 28;12(1):355. doi: 10.1038/s41597-025-04629-1. PMID: 39867389 Free PMC article. Updated. Preprint.
The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review.
Scherbakov D, Hubig N, Jansari V, Bakumenko A, Lenert LA. Scherbakov D, et al. J Am Med Inform Assoc. 2025 Jun 1;32(6):1071-1086. doi: 10.1093/jamia/ocaf063. J Am Med Inform Assoc. 2025. PMID: 40332983 Free PMC article.

References

1. Else H, How a torrent of COVID science changed research publishing-in seven charts, Nature (2020) 553. - PubMed
1. Watson C, Rise of the preprint: how rapid data sharing during COVID-19 has changed science forever, Nat. Med 28 (1) (2022) 2–5. - PubMed
1. Bramstedt KA, The carnage of substandard research during the COVID-19 pandemic: a call for quality, J. Med. Ethics 46 (12) (2020) 803–807. - PubMed
1. Zdravkovic M, Berger-Estilita J, Zdravkovic B, Berger D, Scientific quality of COVID-19 and SARS CoV-2 publications in the highest impact medical journals during the early phase of the pandemic: A case control study, PLoS One 15 (11) (2020) e0241826. - PMC - PubMed
1. Quinn TJ, Burton JK, Carter B, Cooper N, Dwan K, Field R, Freeman SC, Geue C, Hsieh P-H, McGill K, et al., Following the science? Comparison of methodological and reporting quality of COVID-19 and other research from the first wave of the pandemic, BMC Med. 19 (1) (2021) 1–10. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

R01 LM014079/LM/NLM NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Elsevier Science
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automatic categorization of self-acknowledged limitations in randomized controlled trial publications

Affiliations

Automatic categorization of self-acknowledged limitations in randomized controlled trial publications

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources