. 2025 May 22:2024:818-827.

eCollection 2024.

Publication Type Tagging using Transformer Models and Multi-Label Classification

Joe D Menke¹, Halil Kilicoglu¹, Neil R Smalheiser^{1

2}

Affiliations

¹ School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL.
² Department of Psychiatry, University of Illinois Chicago, Chicago, IL.

PMID: 40417522
PMCID: PMC12099436

Publication Type Tagging using Transformer Models and Multi-Label Classification

Joe D Menke et al. AMIA Annu Symp Proc. 2025.

. 2025 May 22:2024:818-827.

eCollection 2024.

Authors

Joe D Menke¹, Halil Kilicoglu¹, Neil R Smalheiser^{1

2}

Affiliations

¹ School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL.
² Department of Psychiatry, University of Illinois Chicago, Chicago, IL.

PMID: 40417522
PMCID: PMC12099436

Abstract

Indexing articles by their publication type and study design is essential for efficient search and filtering of the biomedical literature, but is understudied compared to indexing by MeSH topical terms. In this study, we leveraged the human-curated publication types and study designs in PubMed to generate a dataset of more than 1.2M articles (titles and abstracts) and used state-of-the-art Transformer-based models for automatic tagging of publication types and study designs. Specifically, we trained PubMedBERT-based models using a multi-label classification approach, and explored undersampling, feature verbalization, and contrastive learning to improve model performance. Our results show that PubMedBERT provides a strong baseline for publication type and study design indexing; undersampling, feature verbalization, and unsupervised constrastive loss have a positive impact on performance, whereas supervised contrastive learning degrades the performance. We obtained the best overall performance with 80% undersampling and feature verbalization (0.632 macro-F1, 0.969 macro-AUC). The model outperformed previous models (MultiTagger) across all metrics and the performance difference was statistically significant (p < 0.001). Despite its stronger performance, the model still has room for improvement and future work could explore features based on full-text as well as model interpretability. We make our data and code available at https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/AMIA.

PubMed Disclaimer

Figures

**Figure 1:**
Flow diagram of experiments including data undersampling, feature augmentation, and contrastive learning. The dense layer used for contrastive learning experiments and the linear layer used for label predictions utilize the [CLS] token’s embedding from the last hidden state layer within PubMedBERT.

**Figure 2:**
The sub-figure on the left shows the PT label distribution for all articles in the dataset. The right sub-figure shows the individual label performances using the best-performing model (80% undersampling and verbalized feature augmentation) on the test set.

See this image and copyright information in PMC

Update of

Publication Type Tagging using Transformer Models and Multi-Label Classification.
Menke JD, Kilicoglu H, Smalheiser NR. Menke JD, et al. medRxiv [Preprint]. 2025 Mar 7:2025.03.06.25323516. doi: 10.1101/2025.03.06.25323516. medRxiv. 2025. Update in: AMIA Annu Symp Proc. 2025 May 22;2024:818-827. PMID: 40093254 Free PMC article. Updated. Preprint.

Cited by

Issues regarding the Indexing of Adaptive Clinical Trial Articles.
Smalheiser NR, Shahidehpour A, Troy AM. Smalheiser NR, et al. medRxiv [Preprint]. 2025 Mar 11:2025.03.10.25323694. doi: 10.1101/2025.03.10.25323694. medRxiv. 2025. PMID: 40162265 Free PMC article. Preprint.
Enhancing automated indexing of publication types and study designs in biomedical literature using full-text features.
Menke JD, Ming S, Radhakrishna S, Kilicoglu H, Smalheiser NR. Menke JD, et al. medRxiv [Preprint]. 2025 Apr 28:2025.04.23.25326300. doi: 10.1101/2025.04.23.25326300. medRxiv. 2025. PMID: 40343026 Free PMC article. Preprint.

References

1. Cohen AM, Adams CE, Davis JM, Yu C, Yu PS, Meng W, et al. Evidence-based medicine, the essential role of systematic reviews, and the need for automated text mining tools. Proceedings of the 1st ACM international Health Informatics Symposium. 2010. pp. p. 376–80.
1. Khangura S, Konnyu K, Cushman R, Grimshaw J, Moher D. Evidence summaries: the evolution of a rapid review approach. Systematic reviews. 2012;1(1):1–9. - PMC - PubMed
1. Clark J, Glasziou P, Del Mar C, Bannach-Brown A, Stehlik P, Scott AM. A full systematic review was completed in 2 weeks using automation tools: a case study. Journal of clinical epidemiology. 2020;121:81–90. - PubMed
1. National Library of Medicine (US) Incorporating Values for Indexing Method in MEDLINE/PubMed XML. NLM Tech Bulletin. 2018;e2:423. Accessed on 02.26.2024. Available from: https://www.nlm.nih.gov/ pubs/techbull/ja18/ja18_indexing_method.html.
1. Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. MEDINFO 2004. IOS Press; 2004. The NLM indexing initiative’s medical text indexer; pp. p. 268–72. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 LM014292/LM/NLM NIH HHS/United States

LinkOut - more resources

Full Text Sources
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Publication Type Tagging using Transformer Models and Multi-Label Classification

Affiliations

Publication Type Tagging using Transformer Models and Multi-Label Classification

Authors

Affiliations

Abstract

Figures

Update of

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Figures

Update of

Similar articles

Cited by

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous