. 2025 May 30:7:1497130.

doi: 10.3389/fdgth.2025.1497130. eCollection 2025.

Synthetic4Health: generating annotated synthetic clinical letters

Libo Ren¹, Samuel Belkadi², Lifeng Han^{1

3}, Warren Del-Pinto¹, Goran Nenadic¹

Affiliations

¹ Department of Computer Science, University of Manchester, Greater Manchester, Manchester, United Kingdom.
² Department of Engineering, University of Cambridge, Cambridge, United Kingdom.
³ Leiden Institute of Advanced Computer Science (LIACS) and Leiden University Medical Center (LUMC), Leiden University, Leiden, Netherlands.

PMID: 40520216
PMCID: PMC12163008
DOI: 10.3389/fdgth.2025.1497130

Synthetic4Health: generating annotated synthetic clinical letters

Libo Ren et al. Front Digit Health. 2025.

. 2025 May 30:7:1497130.

doi: 10.3389/fdgth.2025.1497130. eCollection 2025.

Authors

Libo Ren¹, Samuel Belkadi², Lifeng Han^{1

3}, Warren Del-Pinto¹, Goran Nenadic¹

Affiliations

¹ Department of Computer Science, University of Manchester, Greater Manchester, Manchester, United Kingdom.
² Department of Engineering, University of Cambridge, Cambridge, United Kingdom.
³ Leiden Institute of Advanced Computer Science (LIACS) and Leiden University Medical Center (LUMC), Leiden University, Leiden, Netherlands.

PMID: 40520216
PMCID: PMC12163008
DOI: 10.3389/fdgth.2025.1497130

Abstract

Clinical letters contain sensitive information, limiting their use in model training, medical research, and education. This study aims to generate reliable, diverse, and de-identified synthetic clinical letters to support these tasks. We investigated multiple pre-trained language models for text masking and generation, focusing on Bio_ClinicalBERT, and applied different masking strategies. Evaluation included qualitative and quantitative assessments, downstream named entity recognition (NER) tasks, and clinically focused evaluations using BioGPT and GPT-3.5-turbo. The experiments show: (1) encoder-only models perform better than encoder-decoder models; (2) models trained on general corpora perform comparably to clinical-domain models if clinical entities are preserved; (3) preserving clinical entities and document structure aligns with the task objectives; (4) Masking strategies have a noticeable impact on the quality of synthetic clinical letters: masking stopwords has a positive impact, while masking nouns or verbs has a negative effect; (5) The BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references; (6) Contextual information has only a limited effect on the models' understanding, suggesting that synthetic letters can effectively substitute real ones in downstream NER tasks; (7) Although the model occasionally generates hallucinated content, it appears to have little effect on overall clinical performance. Unlike previous research, which primarily focuses on reconstructing original letters by training language models, this paper provides a foundational framework for generating diverse, de-identified clinical letters. It offers a direction for utilizing the model to process real-world clinical letters, thereby helping to expand datasets in the clinical domain. Our codes and trained models are available at https://github.com/HECTA-UoM/Synthetic4Health.

Keywords: clinical NLP (natural language processing); encoder-only models; encoder–decoder models; masking and generating; named entity recognation; pre-trained language models (PLMs); synthetic data creation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
An example of the objective: sentence/segment-level generations.

**Figure 2**
Overall investigation workflow for Synthetic4Health.

**Figure 5**
Comparison of encoder-only and encoder–decoder model architectures.

**Figure 7**
Workflow of the downstream NER task.

**Figure 8**
GPT-3.5-turbo prompt for clinical evaluation.

**Figure 9**
Original unprocessed example sentence (–8) (“note_id”: “10807423-DS-19”) (the circled tokens will be masked).

**Figure 10**
An example of the masked sentence.

**Figure 11**
Example sentence generated by Bio_ClinicalBERT.

**Figure 12**
Example sentence generated by T5-base.

**Figure 13**
Example sentence 1 with different masked tokens.

**Figure 14**
Post-processing results with BERT-Base.

See this image and copyright information in PMC

References

1. Rayner H, Hickey M, Logan I, Mathers N, Rees P, Shah R. Writing outpatient letters to patients. BMJ. (2020) 368:1–4. 10.1136/bmj.m24 - DOI
1. Tarur SU, Prasanna S. Clinical case letter. Indian Pediatr. (2021) 58:188–9. 10.1007/s13312-021-2144-3 - DOI - PubMed
1. Tucker K, Branson J, Dilleen M, Hollis S, Loughlin P, Nixon MJ, et al. Protecting patient privacy when sharing patient-level data from clinical trials. BMC Med Res Methodol. (2016) 16:5–14. 10.1186/s12874-016-0169-4 - DOI - PMC - PubMed
1. Abouelmehdi K, Beni-Hessane A, Khaloufi H. Big healthcare data: preserving security and privacy. J Big Data. (2018) 5:1–18. 10.1186/s40537-017-0110-7 - DOI
1. Spasic I, Nenadic G. Clinical text data in machine learning: systematic review. JMIR Med Inform. (2020) 8:e17984. 10.2196/17984 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Frontiers Media SA
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Synthetic4Health: generating annotated synthetic clinical letters

Affiliations

Synthetic4Health: generating annotated synthetic clinical letters

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

References

Related information

LinkOut - more resources

Full Text Sources