Privacy-, linguistic-, and information-preserving synthesis of clinical documentation through generative agents
- PMID: 41036253
- PMCID: PMC12479492
- DOI: 10.3389/frai.2025.1644084
Privacy-, linguistic-, and information-preserving synthesis of clinical documentation through generative agents
Abstract
The widespread adoption of generative agents (GAs) is reshaping the healthcare landscape. Nonetheless, broad utilization is impeded by restricted access to high-quality, interoperable clinical documentation from electronic health records (EHRs) due to persistent legal, ethical, and technical barriers. Synthetic health data generation (SHDG), leveraging pre-trained large language models (LLMs) instantiated as GAs, could offer a practical solution by creating synthetic patient information that mimics genuine EHRs. The use of LLMs, however, is not without issues; significant concerns remain regarding privacy, potential bias propagation, the risk of generating inaccurate or misleading content, and the lack of transparency in how these models make decisions. We therefore propose a privacy-, linguistic-, and information-preserving SHDG protocol that employs multiple context-aware, role-specific GAs. Guided by targeted prompting and authentic EHRs-serving as structural and linguistic templates-role-specific GAs can, in principle, operate collaboratively through multi-turn interactions. We theorized that utilizing GAs in this fashion permits LLMs not only to produce synthetic EHRs that are accurate, consistent, and contextually appropriate, but also to expose the underlying decision-making process. To test this hypothesis, we developed a no-code GA-driven SHDG workflow as a proof of concept, which was implemented within a predefined, multi-layered data science infrastructure (DSI) stack-an integrated ensemble of software and hardware designed to support rapid prototyping and deployment. The DSI stack streamlines implementation for healthcare professionals, improving accessibility, usability, and cybersecurity. To deploy and validate GA-assisted workflows, we implemented a fully automated SHDG evaluation framework-co-developed with GenAI technology-which holistically compares the informational and linguistic features of synthetic, anonymized, and real EHRs at both the document and corpus levels. Our findings highlight that SHDG implemented through GAs offers a scalable, transparent, and reproducible methodology for unlocking the potential of clinical documentation to drive innovation, accelerate research, and advance the development of learning health systems. The source code, synthetic datasets, toolchains and prompts created for this study can be accessed at the GitHub repository: https://github.com/HR-DataLab-Healthcare/RESEARCH_SUPPORT/tree/main/PROJECTS/Generative_Agent_based_Data-Synthesis.
Keywords: clinical natural language processing (NLP); data synthesis; generative agents; healthcare; information theory; linguistics; privacy; synthetic health data generation (SHDG).
Copyright © 2025 van Velzen, van der Willigen, de Beer, de Graaf-Waar, Janssen, van Leeuwen, van der Willigen, van der Willigen, Renardus, El Maaroufi, Satimin, Hartog, Hulsen, van Meeteren and Scheper.
Conflict of interest statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Figures
References
-
- Abdurahman S., Salkhordeh Ziabari A., Moore A. K., Bartels D. M., Dehghani M. (2025). A primer for evaluating large language models in social-science research. Adv. Methods Pract. Psychol. Sci. 8:25152459251325174. doi: 10.1177/25152459251325174 - DOI
-
- Abhishek M. K., Rao D. R. (2021). “Framework to secure docker containers” in Fifth world conference on smart trends in systems security and sustainability (WorldS4) (London, UK: IEEE; ), 152–156. doi: 10.1109/WorldS451998.2021.9514041 - DOI
-
- Ait A., Cánovas Izquierdo J. L., Cabot J. (2025). On the suitability of hugging face hub for empirical studies. Empir. Softw. Eng. 30, 1–48. doi: 10.1007/s10664-024-10608-8 - DOI
-
- Alemohammad S., Casco-Rodriguez J., Luzi L., Humayun A. I., Babaei H., LeJeune D., et al. (2024). Self-consuming generative models go mad, in: International conference on learning representations (ICLR), (Vienna, AT: ). doi: 10.48550/arXiv.2307.01850 - DOI
-
- Alsentzer E., Murphy J. R., Boag W., Weng W.-H., Jin D., Naumann T., et al. (2019). “Publicly available clinical BERT embeddings” in Proceedings of the 2nd clinical natural language processing workshop (Minneapolis, MN: Association for Computational Linguistics; ), 72–78. doi: 10.18653/v1/W19-1909 - DOI
LinkOut - more resources
Full Text Sources
