Toward Identifying New Risk Aversions and Subsequent Limitations and Biases When Making De-identified Structured Data Sets Openly Available in a Post-LLM world
- PMID: 40417480
- PMCID: PMC12099381
Toward Identifying New Risk Aversions and Subsequent Limitations and Biases When Making De-identified Structured Data Sets Openly Available in a Post-LLM world
Abstract
Making clinical datasets openly available is critical to promote reproducibility and transparency of scientific research. Currently, few datasets are accessible to the public. To support the open science initiative, we plan to release the structured clinical datasets from the CONCERN study. In this paper, we are presenting our de-identification approaches for structured data, considering the future inclusion of de-identified narrative notes and re-identification risks in the LLM era. Through literature review and collaborative consensus sessions, our team made informed decisions regarding dataset release, weighing the pros and cons of each choice, outlining limitation and bias introduced by the de-identification algorithm. To our best knowledge, this is the first study describing the rationales of de-identification decisions in the LLMs era, delineating the consequent problems that should be considered when using our data set. We advocate for transparent disclosure of de-identification decisions and associated limitations and biases with all openly available datasets.
©2024 AMIA - All rights reserved.
Figures
Similar articles
-
A survey on UK researchers' views regarding their experiences with the de-identification, anonymisation, release methods and re-identification risk estimation for clinical trial datasets.Clin Trials. 2025 Feb;22(1):11-23. doi: 10.1177/17407745241259086. Epub 2024 Jun 19. Clin Trials. 2025. PMID: 39927449 Free PMC article.
-
Use and Understanding of Anonymization and De-Identification in the Biomedical Literature: Scoping Review.J Med Internet Res. 2019 May 31;21(5):e13484. doi: 10.2196/13484. J Med Internet Res. 2019. PMID: 31152528 Free PMC article.
-
Sharing traumatic stress research data: assessing and reducing the risk of re-identification.Eur J Psychotraumatol. 2025 Dec;16(1):2499296. doi: 10.1080/20008066.2025.2499296. Epub 2025 May 19. Eur J Psychotraumatol. 2025. PMID: 40387730 Free PMC article. Review.
-
A framework for de-identification of free-text data in electronic medical records enabling secondary use.Aust Health Rev. 2022 Jun;46(3):289-293. doi: 10.1071/AH21361. Aust Health Rev. 2022. PMID: 35546422
-
Patient Privacy in the Era of Big Data.Balkan Med J. 2018 Jan 20;35(1):8-17. doi: 10.4274/balkanmedj.2017.0966. Epub 2017 Sep 13. Balkan Med J. 2018. PMID: 28903886 Free PMC article. Review.
References
-
- National Academies of Sciences. Global Affairs, Board on Research Data, Information, Committee on Toward an Open Science Enterprise. Open science by design: Realizing a vision for 21st century research. 2018. - PubMed
-
- Faltys M, Zimmermann M, Lyu X, Hüser M, Hyland S, Rätsch G, et al. HiRID, a high time-resolution ICU dataset (version 1.1. 1) Physio Net. 2021. p. 10.
-
- Pollard TJ, Johnson AE, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific Data. 2018;vol. 5(no. 1) DOI: https://doi org/101038/sdata 2018. - PMC - PubMed
MeSH terms
LinkOut - more resources
Full Text Sources
Miscellaneous