Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 30:7:1497130.
doi: 10.3389/fdgth.2025.1497130. eCollection 2025.

Synthetic4Health: generating annotated synthetic clinical letters

Affiliations

Synthetic4Health: generating annotated synthetic clinical letters

Libo Ren et al. Front Digit Health. .

Abstract

Clinical letters contain sensitive information, limiting their use in model training, medical research, and education. This study aims to generate reliable, diverse, and de-identified synthetic clinical letters to support these tasks. We investigated multiple pre-trained language models for text masking and generation, focusing on Bio_ClinicalBERT, and applied different masking strategies. Evaluation included qualitative and quantitative assessments, downstream named entity recognition (NER) tasks, and clinically focused evaluations using BioGPT and GPT-3.5-turbo. The experiments show: (1) encoder-only models perform better than encoder-decoder models; (2) models trained on general corpora perform comparably to clinical-domain models if clinical entities are preserved; (3) preserving clinical entities and document structure aligns with the task objectives; (4) Masking strategies have a noticeable impact on the quality of synthetic clinical letters: masking stopwords has a positive impact, while masking nouns or verbs has a negative effect; (5) The BERTScore should be the primary quantitative evaluation metric, with other metrics serving as supplementary references; (6) Contextual information has only a limited effect on the models' understanding, suggesting that synthetic letters can effectively substitute real ones in downstream NER tasks; (7) Although the model occasionally generates hallucinated content, it appears to have little effect on overall clinical performance. Unlike previous research, which primarily focuses on reconstructing original letters by training language models, this paper provides a foundational framework for generating diverse, de-identified clinical letters. It offers a direction for utilizing the model to process real-world clinical letters, thereby helping to expand datasets in the clinical domain. Our codes and trained models are available at https://github.com/HECTA-UoM/Synthetic4Health.

Keywords: clinical NLP (natural language processing); encoder-only models; encoder–decoder models; masking and generating; named entity recognation; pre-trained language models (PLMs); synthetic data creation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
An example of the objective: sentence/segment-level generations.
Figure 2
Figure 2
Overall investigation workflow for Synthetic4Health.
Figure 3
Figure 3
Pre-processing pipeline.
Figure 4
Figure 4
Text chunking workflow.
Figure 5
Figure 5
Comparison of encoder-only and encoder–decoder model architectures.
Figure 6
Figure 6
Evaluation pipeline.
Figure 7
Figure 7
Workflow of the downstream NER task.
Figure 8
Figure 8
GPT-3.5-turbo prompt for clinical evaluation.
Figure 9
Figure 9
Original unprocessed example sentence (–8) (“note_id”: “10807423-DS-19”) (the circled tokens will be masked).
Figure 10
Figure 10
An example of the masked sentence.
Figure 11
Figure 11
Example sentence generated by Bio_ClinicalBERT.
Figure 12
Figure 12
Example sentence generated by T5-base.
Figure 13
Figure 13
Example sentence 1 with different masked tokens.
Figure 14
Figure 14
Post-processing results with BERT-Base.

Similar articles

References

    1. Rayner H, Hickey M, Logan I, Mathers N, Rees P, Shah R. Writing outpatient letters to patients. BMJ. (2020) 368:1–4. 10.1136/bmj.m24 - DOI
    1. Tarur SU, Prasanna S. Clinical case letter. Indian Pediatr. (2021) 58:188–9. 10.1007/s13312-021-2144-3 - DOI - PubMed
    1. Tucker K, Branson J, Dilleen M, Hollis S, Loughlin P, Nixon MJ, et al. Protecting patient privacy when sharing patient-level data from clinical trials. BMC Med Res Methodol. (2016) 16:5–14. 10.1186/s12874-016-0169-4 - DOI - PMC - PubMed
    1. Abouelmehdi K, Beni-Hessane A, Khaloufi H. Big healthcare data: preserving security and privacy. J Big Data. (2018) 5:1–18. 10.1186/s40537-017-0110-7 - DOI
    1. Spasic I, Nenadic G. Clinical text data in machine learning: systematic review. JMIR Med Inform. (2020) 8:e17984. 10.2196/17984 - DOI - PMC - PubMed

LinkOut - more resources