This is a preprint.
Biomedical Text Normalization through Generative Modeling
- PMID: 40093227
- PMCID: PMC11908301
- DOI: 10.1101/2024.09.30.24314663
Biomedical Text Normalization through Generative Modeling
Update in
-
Biomedical text normalization through generative modeling.J Biomed Inform. 2025 Jul;167:104850. doi: 10.1016/j.jbi.2025.104850. Epub 2025 May 15. J Biomed Inform. 2025. PMID: 40381869
Abstract
Objective: Around 80% of electronic health record (EHR) data consists of unstructured medical language text. The formatting of this text is often flexible and inconsistent, making it challenging to use for predictive modeling, clinical decision support, and data mining. Large language models' (LLMs) ability understand context and semantic variations makes them promising tools for standardizing medical text. In this study, we develop and assess clinical text normalization pipelines built using large-language models.
Methods: We implemented four LLM-based normalization strategies ( Zero-Shot Recall, Prompt Recall, Semantic Search, and Retrieval-Augmented Generation based normalization [RAGnorm]) and one baseline approach using TF-IDF based String Matching. We evaluated performance across three datasets of SNOMED-mapped condition terms: (1) an oncology-specific dataset, (2) a representative sample of institutional medical conditions, and (3) a dataset of commonly occurring condition codes (>1000 uses) from our institution. We measured performance by recording the mean shortest path length between predicted and true SNOMED CT terms. Additionally, we benchmarked our models against the TAC 2017 drug label annotations, which normalizes terms to the Medical Dictionary for Regulatory Activities (MedDRA) Preferred Terms.
Results: We found that RAGnorm was the most effective throughout each dataset, achieving a mean shortest path length of 0.21 for the domain-specific dataset, 0.58 for the sampled dataset, and 0.90 for the top terms dataset. It achieved a micro F1 score of 88.01 on task 4 of the TAC2017 conference, surpassing all other models without viewing the provided training data.
Conclusion: We find that retrieval-focused approaches overcome traditional LLM limitations for this task. RAGnorm and related retrieval techniques should be explored further for the normalization of biomedical free text.
Keywords: clinical text normalization; large language models; prompt engineering; retrieval-augmented generation.
Figures
References
-
- Ahmad Aliero A, Sulaimon Adebayo B, Olanrewaju Aliyu H, Gogo Tafida A, Umar Kangiwa B, Muhammad Dankolo N. Systematic Review on Text Normalization Techniques and its Approach to Non-Standard Words [Internet]. Vol. 185, International Journal of Computer Applications. 2023. Available from: https://www.researchgate.net/publication/374160515
-
- Dankolo N, Ahmad Aliero A, Sulaimon Adebayo B, Olanrewaju Aliyu H, Gogo Tafida A, Umar Kangiwa B, et al. Systematic Review on Text Normalization Techniques and its Approach to Non-Standard Words [Internet]. Vol. 185, International Journal of Computer Applications. 2023. Available from: https://www.researchgate.net/publication/374160515
Publication types
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous