Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 18:27:e66279.
doi: 10.2196/66279.

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study

Affiliations

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study

Hendrik Šuvalov et al. J Med Internet Res. .

Abstract

Background: Named entity recognition (NER) plays a vital role in extracting critical medical entities from health care records, facilitating applications such as clinical decision support and data mining. Developing robust NER models for low-resource languages, such as Estonian, remains a challenge due to the scarcity of annotated data and domain-specific pretrained models. Large language models (LLMs) have proven to be promising in understanding text from any language or domain.

Objective: This study addresses the development of medical NER models for low-resource languages, specifically Estonian. We propose a novel approach by generating synthetic health care data and using LLMs to annotate them. These synthetic data are then used to train a high-performing NER model, which is applied to real-world medical texts, preserving patient data privacy.

Methods: Our approach to overcoming the shortage of annotated Estonian health care texts involves a three-step pipeline: (1) synthetic health care data are generated using a locally trained GPT-2 model on Estonian medical records, (2) the synthetic data are annotated with LLMs, specifically GPT-3.5-Turbo and GPT-4, and (3) the annotated synthetic data are then used to fine-tune an NER model, which is later tested on real-world medical data. This paper compares the performance of different prompts; assesses the impact of GPT-3.5-Turbo, GPT-4, and a local LLM; and explores the relationship between the amount of annotated synthetic data and model performance.

Results: The proposed methodology demonstrates significant potential in extracting named entities from real-world medical texts. Our top-performing setup achieved an F1-score of 0.69 for drug extraction and 0.38 for procedure extraction. These results indicate a strong performance in recognizing certain entity types while highlighting the complexity of extracting procedures.

Conclusions: This paper demonstrates a successful approach to leveraging LLMs for training NER models using synthetic data, effectively preserving patient privacy. By avoiding reliance on human-annotated data, our method shows promise in developing models for low-resource languages, such as Estonian. Future work will focus on refining the synthetic data generation and expanding the method's applicability to other domains and languages.

Keywords: Estonian; LLM; NER; NLP; annotated data; artificial intelligence; clinical decision support; data annotation; data mining; health care data; language model; large language model; machine learning; medical entity; named entity recognition; natural language processing; synthetic data.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
Illustration of pipeline.
Figure 2
Figure 2
Annotation scores on validation data (N=300) for models of different amounts of training data.

Similar articles

Cited by

References

    1. Li I, Pan J, Goldwasser J, Verma N, Wong W, Nuzumlalı MY, Rosand B, Li Y, Zhang M, Chang D, Taylor Ra, Krumholz Hm, Radev D. Neural natural language processing for unstructured data in electronic health records: a review. Comput Sci Rev. 2022 Nov;46:100511. doi: 10.1016/j.cosrev.2022.100511. https://www.sciencedirect.com/science/article/pii/S1574013722000454 - DOI
    1. Torge S, Politov A, Lehmann C, Saffar B, Tao Z. Named entity recognition for low-resource languages - profiting from language families. Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023); May 6, 2023; Dubrovnik, Croatia. Association for Computational Linguistics; 2023. pp. 1–10. https://aclanthology.org/2023.bsnlp-1.1 - DOI
    1. Cotterell R, Duh K. Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers); November 27 - December 1, 2017; Taipei, Taiwan. ACM; 2017. pp. 91–96. https://aclanthology.org/I17-2016 - DOI
    1. Simpson E, Brown R, Sillence E, Coventry L, Lloyd K, Gibbs J, Tariq S, Durrant AC. Understanding the barriers and facilitators to sharing patient-generated health data using digital technology for people living with long-term health conditions: a narrative review. Front Public Health. 2021;9:641424. doi: 10.3389/fpubh.2021.641424. https://europepmc.org/abstract/MED/34888271 - DOI - PMC - PubMed
    1. Szarvas G, Farkas R, Busa-Fekete R. State-of-the-art anonymization of medical records using an iterative machine learning framework. J Am Med Inform Assoc. 2007;14(5):574–580. doi: 10.1197/j.jamia.M2441. https://europepmc.org/abstract/MED/17823086 14/5/574 - DOI - PMC - PubMed

Publication types

LinkOut - more resources