Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Mar 5:2024.09.30.24314663.
doi: 10.1101/2024.09.30.24314663.

Biomedical Text Normalization through Generative Modeling

Affiliations

Biomedical Text Normalization through Generative Modeling

Jacob S Berkowitz et al. medRxiv. .

Update in

Abstract

Objective: Around 80% of electronic health record (EHR) data consists of unstructured medical language text. The formatting of this text is often flexible and inconsistent, making it challenging to use for predictive modeling, clinical decision support, and data mining. Large language models' (LLMs) ability understand context and semantic variations makes them promising tools for standardizing medical text. In this study, we develop and assess clinical text normalization pipelines built using large-language models.

Methods: We implemented four LLM-based normalization strategies ( Zero-Shot Recall, Prompt Recall, Semantic Search, and Retrieval-Augmented Generation based normalization [RAGnorm]) and one baseline approach using TF-IDF based String Matching. We evaluated performance across three datasets of SNOMED-mapped condition terms: (1) an oncology-specific dataset, (2) a representative sample of institutional medical conditions, and (3) a dataset of commonly occurring condition codes (>1000 uses) from our institution. We measured performance by recording the mean shortest path length between predicted and true SNOMED CT terms. Additionally, we benchmarked our models against the TAC 2017 drug label annotations, which normalizes terms to the Medical Dictionary for Regulatory Activities (MedDRA) Preferred Terms.

Results: We found that RAGnorm was the most effective throughout each dataset, achieving a mean shortest path length of 0.21 for the domain-specific dataset, 0.58 for the sampled dataset, and 0.90 for the top terms dataset. It achieved a micro F1 score of 88.01 on task 4 of the TAC2017 conference, surpassing all other models without viewing the provided training data.

Conclusion: We find that retrieval-focused approaches overcome traditional LLM limitations for this task. RAGnorm and related retrieval techniques should be explored further for the normalization of biomedical free text.

Keywords: clinical text normalization; large language models; prompt engineering; retrieval-augmented generation.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Methodology Flowchart. Step-by-step approach for four different normalization methods— (A) Zero-Shot Recall: Uses a single prompt to elicit the correct term from the model without any prior examples or fine-tuning. (B) Prompt Recall: Feeds the model a comprehensive list of terms, prompting it to select the most appropriate one based on the input context. (C) Semantic Search: Matches input terms with the closest semantic equivalents using a precomputed vector space of embeddings. (D) RAGnorm: First retrieves most semantically relevant terms and then uses the generative decoder to choose the best matching term. An embedding space visualization illustrates the differences between Semantic Search (E) and RAGnorm (F).
Figure 2:
Figure 2:
Embedding Evaluation. Comparison of Semantic Search text normalization accuracy across each assessed embedding model. The models are grouped by the number of embedding dimensions used.
Figure 3:
Figure 3:
Violin plot of mean shortest path length of five different normalization methods—TF-IDF String Matching, Zero Shot Prompting, Prompt Recall, Semantic Search, and RAGnorm—across three cohort sizes: 106 Domain Specific SNOMED CTs, 750 Randomly Sampled SNOMED CTs, and High Frequency SNOMED CTs (above 1000 uses). The best performing approach on in each dataset grouping is bolded.

References

    1. Kong HJ. Managing unstructured big data in healthcare system. Vol. 25, Healthcare Informatics Research. Korean Society of Medical Informatics; 2019. p. 1–2. - PMC - PubMed
    1. Leaman R, Khare R, Lu Z. Challenges in clinical natural language processing for automated disorder normalization. J Biomed Inform. 2015. Oct 1;57:28–37. - PMC - PubMed
    1. Kreimeyer K, Foster M, Pandey A, Arya N, Halford G, Jones SF, et al. Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review. Vol. 73, Journal of Biomedical Informatics. Academic Press Inc.; 2017. p. 14–29. - PMC - PubMed
    1. Ahmad Aliero A, Sulaimon Adebayo B, Olanrewaju Aliyu H, Gogo Tafida A, Umar Kangiwa B, Muhammad Dankolo N. Systematic Review on Text Normalization Techniques and its Approach to Non-Standard Words [Internet]. Vol. 185, International Journal of Computer Applications. 2023. Available from: https://www.researchgate.net/publication/374160515
    1. Dankolo N, Ahmad Aliero A, Sulaimon Adebayo B, Olanrewaju Aliyu H, Gogo Tafida A, Umar Kangiwa B, et al. Systematic Review on Text Normalization Techniques and its Approach to Non-Standard Words [Internet]. Vol. 185, International Journal of Computer Applications. 2023. Available from: https://www.researchgate.net/publication/374160515

Publication types

LinkOut - more resources