Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 22:2024:1313-1322.
eCollection 2024.

Exploring the Utilization of Synthetic Data in Unsupervised Clustering for Opioid Misuse Analysis

Affiliations

Exploring the Utilization of Synthetic Data in Unsupervised Clustering for Opioid Misuse Analysis

Yili Zhang et al. AMIA Annu Symp Proc. .

Abstract

Privacy and security restrictions on medical data pose challenges to collaborative research, making synthetic data an increasingly attractive solution. Recent advancements in Generative AI technologies, like GAN models, have improved synthetic data generation. This study investigates the use of synthetic data in clustering models for opioid misuse analysis, generating a dataset that replicates real-world data from 2017 to 2019, including demographics and diagnosis codes. By maintaining patient privacy, we enable comprehensive analysis without compromising security. We developed unsupervised clustering models to identify opioid misuse patterns and assessed the effectiveness of synthetic data across four scenarios: training on real dataset and testing on real dataset (TRTR), training on real dataset and testing on synthetic dataset (TRTS), TSTR, and TSTS. Results demonstrate that synthetic data can replicate real data distributions and clustering characteristics as a training set, offering significant potential for collaborative model development and optimization without exposing privacy or security risks.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Real Data Preparation and Synthetic Data Generation Workflow.
Figure 2.
Figure 2.
Percentage of Opioid Misuses and Comorbidities Across Clustering Groups under Different Scenarios in Each Year. Percentage values for scenarios TRTR and TSTR are annotated below and to the right of the markers, corresponding to green and blue lines. Percentage values for scenarios TRTS and TSTS are annotated above and to the left of the markers, corresponding to yellow and red lines.
Figure 3.
Figure 3.
Distribution of Patient Demographics Across Clustering Groups under Different Scenarios. Figures (a), (b), and (c) use a common legend.

Similar articles

References

    1. Xu H, Dinev T, Smith J, Hart P. Information privacy concerns: Linking individual perceptions with institutional privacy assurances. Journal of the Association for Information Systems. 2011;12(12):1.
    1. Gostin LO, Levit LA, Nass SJ, editors. Beyond the HIPAA privacy rule: enhancing privacy, improving health through research - PubMed
    1. Van Panhuis WG, Paul P, Emerson C, Grefenstette J, Wilder R, Herbst AJ, Heymann D, Burke DS. A systematic review of barriers to data sharing in public health. BMC public health. 2014 Dec;14:1–9. - PMC - PubMed
    1. Gonzales A, Guruswamy G, Smith SR. Synthetic data in health care: A narrative review. PLOS Digital Health. 2023 Jan 6;2(1):e0000082. - PMC - PubMed
    1. Kuo NI, Perez-Concha O, Hanly M, Mnatzaganian E, Hao B, Di Sipio M, Yu G, Vanjara J, Valerie IC, de Oliveira Costa J, Churches T. Enriching Data Science and Health Care Education: Application and Impact of Synthetic Data Sets Through the Health Gym Project. JMIR Medical Education. 2024 Jan 16;10:e51388. - PMC - PubMed

LinkOut - more resources