Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 22:2024:262-270.
eCollection 2024.

Toward Identifying New Risk Aversions and Subsequent Limitations and Biases When Making De-identified Structured Data Sets Openly Available in a Post-LLM world

Affiliations

Toward Identifying New Risk Aversions and Subsequent Limitations and Biases When Making De-identified Structured Data Sets Openly Available in a Post-LLM world

Fangyi Chen et al. AMIA Annu Symp Proc. .

Abstract

Making clinical datasets openly available is critical to promote reproducibility and transparency of scientific research. Currently, few datasets are accessible to the public. To support the open science initiative, we plan to release the structured clinical datasets from the CONCERN study. In this paper, we are presenting our de-identification approaches for structured data, considering the future inclusion of de-identified narrative notes and re-identification risks in the LLM era. Through literature review and collaborative consensus sessions, our team made informed decisions regarding dataset release, weighing the pros and cons of each choice, outlining limitation and bias introduced by the de-identification algorithm. To our best knowledge, this is the first study describing the rationales of de-identification decisions in the LLMs era, delineating the consequent problems that should be considered when using our data set. We advocate for transparent disclosure of de-identification decisions and associated limitations and biases with all openly available datasets.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
An overview of the data elements. The final version is subject to change and might differ from the preliminary version depicted above.

Similar articles

References

    1. National Academies of Sciences. Global Affairs, Board on Research Data, Information, Committee on Toward an Open Science Enterprise. Open science by design: Realizing a vision for 21st century research. 2018. - PubMed
    1. Johnson AE, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC-IV, a freely accessible electronic health record dataset. Scientific data. 2023;10(1):1. - PMC - PubMed
    1. Faltys M, Zimmermann M, Lyu X, Hüser M, Hyland S, Rätsch G, et al. HiRID, a high time-resolution ICU dataset (version 1.1. 1) Physio Net. 2021. p. 10.
    1. Zeng X, Yu G, Lu Y, Tan L, Wu X, Shi S, et al. PIC, a paediatric-specific intensive care database. Scientific data. 2020;7(1):14. - PMC - PubMed
    1. Pollard TJ, Johnson AE, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific Data. 2018;vol. 5(no. 1) DOI: https://doi org/101038/sdata 2018. - PMC - PubMed

LinkOut - more resources