This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Mar 5:2024.09.30.24314663.

doi: 10.1101/2024.09.30.24314663.

Biomedical Text Normalization through Generative Modeling

Jacob S Berkowitz¹, Apoorva Srinivasan¹, Jose Miguel Acitores Cortina¹, Yasaman Fatapour¹, Nicholas P Tatonetti¹

Affiliations

Affiliation

¹ Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N San Vicente Blvd, Pacific Design Center Suite G540, West Hollywood, CA 90069 United States.

PMID: 40093227
PMCID: PMC11908301
DOI: 10.1101/2024.09.30.24314663

Biomedical Text Normalization through Generative Modeling

Jacob S Berkowitz et al. medRxiv. 2025.

[Preprint]. 2025 Mar 5:2024.09.30.24314663.

doi: 10.1101/2024.09.30.24314663.

Authors

Jacob S Berkowitz¹, Apoorva Srinivasan¹, Jose Miguel Acitores Cortina¹, Yasaman Fatapour¹, Nicholas P Tatonetti¹

Affiliation

¹ Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N San Vicente Blvd, Pacific Design Center Suite G540, West Hollywood, CA 90069 United States.

PMID: 40093227
PMCID: PMC11908301
DOI: 10.1101/2024.09.30.24314663

Update in

Biomedical text normalization through generative modeling.
Berkowitz JS, Srinivasan A, Acitores Cortina JM, Fatapour Y, Tatonetti NP. Berkowitz JS, et al. J Biomed Inform. 2025 Jul;167:104850. doi: 10.1016/j.jbi.2025.104850. Epub 2025 May 15. J Biomed Inform. 2025. PMID: 40381869

Abstract

Objective: Around 80% of electronic health record (EHR) data consists of unstructured medical language text. The formatting of this text is often flexible and inconsistent, making it challenging to use for predictive modeling, clinical decision support, and data mining. Large language models' (LLMs) ability understand context and semantic variations makes them promising tools for standardizing medical text. In this study, we develop and assess clinical text normalization pipelines built using large-language models.

Methods: We implemented four LLM-based normalization strategies ( Zero-Shot Recall, Prompt Recall, Semantic Search, and Retrieval-Augmented Generation based normalization [RAGnorm]) and one baseline approach using TF-IDF based String Matching. We evaluated performance across three datasets of SNOMED-mapped condition terms: (1) an oncology-specific dataset, (2) a representative sample of institutional medical conditions, and (3) a dataset of commonly occurring condition codes (>1000 uses) from our institution. We measured performance by recording the mean shortest path length between predicted and true SNOMED CT terms. Additionally, we benchmarked our models against the TAC 2017 drug label annotations, which normalizes terms to the Medical Dictionary for Regulatory Activities (MedDRA) Preferred Terms.

Results: We found that RAGnorm was the most effective throughout each dataset, achieving a mean shortest path length of 0.21 for the domain-specific dataset, 0.58 for the sampled dataset, and 0.90 for the top terms dataset. It achieved a micro F1 score of 88.01 on task 4 of the TAC2017 conference, surpassing all other models without viewing the provided training data.

Conclusion: We find that retrieval-focused approaches overcome traditional LLM limitations for this task. RAGnorm and related retrieval techniques should be explored further for the normalization of biomedical free text.

Keywords: clinical text normalization; large language models; prompt engineering; retrieval-augmented generation.

PubMed Disclaimer

Figures

**Figure 1:**
Methodology Flowchart. Step-by-step approach for four different normalization methods— (A) Zero-Shot Recall: Uses a single prompt to elicit the correct term from the model without any prior examples or fine-tuning. (B) Prompt Recall: Feeds the model a comprehensive list of terms, prompting it to select the most appropriate one based on the input context. (C) Semantic Search: Matches input terms with the closest semantic equivalents using a precomputed vector space of embeddings. (D) RAGnorm: First retrieves most semantically relevant terms and then uses the generative decoder to choose the best matching term. An embedding space visualization illustrates the differences between Semantic Search (E) and RAGnorm (F).

**Figure 2:**
Embedding Evaluation. Comparison of Semantic Search text normalization accuracy across each assessed embedding model. The models are grouped by the number of embedding dimensions used.

**Figure 3:**
Violin plot of mean shortest path length of five different normalization methods—TF-IDF String Matching, Zero Shot Prompting, Prompt Recall, Semantic Search, and RAGnorm—across three cohort sizes: 106 Domain Specific SNOMED CTs, 750 Randomly Sampled SNOMED CTs, and High Frequency SNOMED CTs (above 1000 uses). The best performing approach on in each dataset grouping is bolded.

See this image and copyright information in PMC

References

1. Kong HJ. Managing unstructured big data in healthcare system. Vol. 25, Healthcare Informatics Research. Korean Society of Medical Informatics; 2019. p. 1–2. - PMC - PubMed
1. Leaman R, Khare R, Lu Z. Challenges in clinical natural language processing for automated disorder normalization. J Biomed Inform. 2015. Oct 1;57:28–37. - PMC - PubMed
1. Kreimeyer K, Foster M, Pandey A, Arya N, Halford G, Jones SF, et al. Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review. Vol. 73, Journal of Biomedical Informatics. Academic Press Inc.; 2017. p. 14–29. - PMC - PubMed
1. Ahmad Aliero A, Sulaimon Adebayo B, Olanrewaju Aliyu H, Gogo Tafida A, Umar Kangiwa B, Muhammad Dankolo N. Systematic Review on Text Normalization Techniques and its Approach to Non-Standard Words [Internet]. Vol. 185, International Journal of Computer Applications. 2023. Available from: https://www.researchgate.net/publication/374160515
1. Dankolo N, Ahmad Aliero A, Sulaimon Adebayo B, Olanrewaju Aliyu H, Gogo Tafida A, Umar Kangiwa B, et al. Systematic Review on Text Normalization Techniques and its Approach to Non-Standard Words [Internet]. Vol. 185, International Journal of Computer Applications. 2023. Available from: https://www.researchgate.net/publication/374160515

Publication types

Actions

Grants and funding

R35 GM131905/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Cold Spring Harbor Laboratory
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Biomedical Text Normalization through Generative Modeling

Affiliation

Biomedical Text Normalization through Generative Modeling

Authors

Affiliation

Update in

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous