Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study

doi:10.2196/66279

. 2025 Mar 18:27:e66279.

doi: 10.2196/66279.

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study

Hendrik Šuvalov¹, Mihkel Lepson¹, Veronika Kukk¹, Maria Malk¹, Neeme Ilves¹, Hele-Andra Kuulmets¹, Raivo Kolde¹

Affiliations

PMID: 40101227
PMCID: PMC11962312
DOI: 10.2196/66279

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study

Hendrik Šuvalov et al. J Med Internet Res. 2025.

. 2025 Mar 18:27:e66279.

doi: 10.2196/66279.

Authors

Hendrik Šuvalov¹, Mihkel Lepson¹, Veronika Kukk¹, Maria Malk¹, Neeme Ilves¹, Hele-Andra Kuulmets¹, Raivo Kolde¹

Affiliation

¹ Institute of Computer Science, University of Tartu, Tartu, Estonia.

PMID: 40101227
PMCID: PMC11962312
DOI: 10.2196/66279

Abstract

Background: Named entity recognition (NER) plays a vital role in extracting critical medical entities from health care records, facilitating applications such as clinical decision support and data mining. Developing robust NER models for low-resource languages, such as Estonian, remains a challenge due to the scarcity of annotated data and domain-specific pretrained models. Large language models (LLMs) have proven to be promising in understanding text from any language or domain.

Objective: This study addresses the development of medical NER models for low-resource languages, specifically Estonian. We propose a novel approach by generating synthetic health care data and using LLMs to annotate them. These synthetic data are then used to train a high-performing NER model, which is applied to real-world medical texts, preserving patient data privacy.

Methods: Our approach to overcoming the shortage of annotated Estonian health care texts involves a three-step pipeline: (1) synthetic health care data are generated using a locally trained GPT-2 model on Estonian medical records, (2) the synthetic data are annotated with LLMs, specifically GPT-3.5-Turbo and GPT-4, and (3) the annotated synthetic data are then used to fine-tune an NER model, which is later tested on real-world medical data. This paper compares the performance of different prompts; assesses the impact of GPT-3.5-Turbo, GPT-4, and a local LLM; and explores the relationship between the amount of annotated synthetic data and model performance.

Results: The proposed methodology demonstrates significant potential in extracting named entities from real-world medical texts. Our top-performing setup achieved an F₁-score of 0.69 for drug extraction and 0.38 for procedure extraction. These results indicate a strong performance in recognizing certain entity types while highlighting the complexity of extracting procedures.

Conclusions: This paper demonstrates a successful approach to leveraging LLMs for training NER models using synthetic data, effectively preserving patient privacy. By avoiding reliance on human-annotated data, our method shows promise in developing models for low-resource languages, such as Estonian. Future work will focus on refining the synthetic data generation and expanding the method's applicability to other domains and languages.

Keywords: Estonian; LLM; NER; NLP; annotated data; artificial intelligence; clinical decision support; data annotation; data mining; health care data; language model; large language model; machine learning; medical entity; named entity recognition; natural language processing; synthetic data.

©Hendrik Šuvalov, Mihkel Lepson, Veronika Kukk, Maria Malk, Neeme Ilves, Hele-Andra Kuulmets, Raivo Kolde. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 18.03.2025.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 2**
Annotation scores on validation data (N=300) for models of different amounts of training data.

See this image and copyright information in PMC

Cited by

Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study.
Yang H, Li M, Zhou H, Xiao Y, Fang Q, Zhou S, Zhang R. Yang H, et al. J Med Internet Res. 2025 Jul 14;27:e70080. doi: 10.2196/70080. J Med Internet Res. 2025. PMID: 40658884 Free PMC article.

References

1. Li I, Pan J, Goldwasser J, Verma N, Wong W, Nuzumlalı MY, Rosand B, Li Y, Zhang M, Chang D, Taylor Ra, Krumholz Hm, Radev D. Neural natural language processing for unstructured data in electronic health records: a review. Comput Sci Rev. 2022 Nov;46:100511. doi: 10.1016/j.cosrev.2022.100511. https://www.sciencedirect.com/science/article/pii/S1574013722000454 - DOI
1. Torge S, Politov A, Lehmann C, Saffar B, Tao Z. Named entity recognition for low-resource languages - profiting from language families. Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023); May 6, 2023; Dubrovnik, Croatia. Association for Computational Linguistics; 2023. pp. 1–10. https://aclanthology.org/2023.bsnlp-1.1 - DOI
1. Cotterell R, Duh K. Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers); November 27 - December 1, 2017; Taipei, Taiwan. ACM; 2017. pp. 91–96. https://aclanthology.org/I17-2016 - DOI
1. Simpson E, Brown R, Sillence E, Coventry L, Lloyd K, Gibbs J, Tariq S, Durrant AC. Understanding the barriers and facilitators to sharing patient-generated health data using digital technology for people living with long-term health conditions: a narrative review. Front Public Health. 2021;9:641424. doi: 10.3389/fpubh.2021.641424. https://europepmc.org/abstract/MED/34888271 - DOI - PMC - PubMed
1. Szarvas G, Farkas R, Busa-Fekete R. State-of-the-art anonymization of medical records using an iterative machine learning framework. J Am Med Inform Assoc. 2007;14(5):574–580. doi: 10.1197/j.jamia.M2441. https://europepmc.org/abstract/MED/17823086 14/5/574 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- JMIR Publications
- PubMed Central

[1] Li I, Pan J, Goldwasser J, Verma N, Wong W, Nuzumlalı MY, Rosand B, Li Y, Zhang M, Chang D, Taylor Ra, Krumholz Hm, Radev D. Neural natural language processing for unstructured data in electronic health records: a review. Comput Sci Rev. 2022 Nov;46:100511. doi: 10.1016/j.cosrev.2022.100511. https://www.sciencedirect.com/science/article/pii/S1574013722000454 - DOI

[2] Li I, Pan J, Goldwasser J, Verma N, Wong W, Nuzumlalı MY, Rosand B, Li Y, Zhang M, Chang D, Taylor Ra, Krumholz Hm, Radev D. Neural natural language processing for unstructured data in electronic health records: a review. Comput Sci Rev. 2022 Nov;46:100511. doi: 10.1016/j.cosrev.2022.100511. https://www.sciencedirect.com/science/article/pii/S1574013722000454 - DOI

[3] Torge S, Politov A, Lehmann C, Saffar B, Tao Z. Named entity recognition for low-resource languages - profiting from language families. Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023); May 6, 2023; Dubrovnik, Croatia. Association for Computational Linguistics; 2023. pp. 1–10. https://aclanthology.org/2023.bsnlp-1.1 - DOI

[4] Torge S, Politov A, Lehmann C, Saffar B, Tao Z. Named entity recognition for low-resource languages - profiting from language families. Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023); May 6, 2023; Dubrovnik, Croatia. Association for Computational Linguistics; 2023. pp. 1–10. https://aclanthology.org/2023.bsnlp-1.1 - DOI

[5] Cotterell R, Duh K. Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers); November 27 - December 1, 2017; Taipei, Taiwan. ACM; 2017. pp. 91–96. https://aclanthology.org/I17-2016 - DOI

[6] Cotterell R, Duh K. Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields. Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers); November 27 - December 1, 2017; Taipei, Taiwan. ACM; 2017. pp. 91–96. https://aclanthology.org/I17-2016 - DOI

[7] Simpson E, Brown R, Sillence E, Coventry L, Lloyd K, Gibbs J, Tariq S, Durrant AC. Understanding the barriers and facilitators to sharing patient-generated health data using digital technology for people living with long-term health conditions: a narrative review. Front Public Health. 2021;9:641424. doi: 10.3389/fpubh.2021.641424. https://europepmc.org/abstract/MED/34888271 - DOI - PMC - PubMed

[8] Simpson E, Brown R, Sillence E, Coventry L, Lloyd K, Gibbs J, Tariq S, Durrant AC. Understanding the barriers and facilitators to sharing patient-generated health data using digital technology for people living with long-term health conditions: a narrative review. Front Public Health. 2021;9:641424. doi: 10.3389/fpubh.2021.641424. https://europepmc.org/abstract/MED/34888271 - DOI - PMC - PubMed

[9] Szarvas G, Farkas R, Busa-Fekete R. State-of-the-art anonymization of medical records using an iterative machine learning framework. J Am Med Inform Assoc. 2007;14(5):574–580. doi: 10.1197/j.jamia.M2441. https://europepmc.org/abstract/MED/17823086 14/5/574 - DOI - PMC - PubMed

[10] Szarvas G, Farkas R, Busa-Fekete R. State-of-the-art anonymization of medical records using an iterative machine learning framework. J Am Med Inform Assoc. 2007;14(5):574–580. doi: 10.1197/j.jamia.M2441. https://europepmc.org/abstract/MED/17823086 14/5/574 - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study

Affiliation

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources