Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 16;15(1):17094.
doi: 10.1038/s41598-025-01873-8.

Application of the joint clustering algorithm based on Gaussian kernels and differential privacy in lung cancer identification

Affiliations

Application of the joint clustering algorithm based on Gaussian kernels and differential privacy in lung cancer identification

Hang Yanping et al. Sci Rep. .

Abstract

In the age of big data, privacy, particularly medical data privacy, is becoming increasingly important. Differential privacy (DP) has emerged as a key method for safeguarding privacy during data analysis and publishing. Cancer identification and classification play a vital role in early detection and treatment. This paper introduces a novel algorithm, DPFCM_GK, which combines differential privacy with fuzzy c-means (FCM) clustering using a Gaussian kernel function. The algorithm enhances cancer detection while ensuring data privacy. Three publicly available lung cancer datasets, along with a dataset from our hospital, are used to test and demonstrate the effectiveness of DPFCM_GK. The experimental results show that DPFCM_GK achieves high clustering accuracy and enhanced privacy as the privacy budget (ε) increases. For the UCIML, NLST, and NSCLC datasets, it reaches optimal results at lower ε (1.52, 1.24, and 2.32) compared to DPFCM. In the lung cancer dataset, DPFCM_GK outperforms DPFCM within, 0.05 ≤ ε ≤ 2.5, with significant differences (χ2 = 4.54 ∼ 29.12; P < 0.05), and both methods converge to an accuracy of 94.5% as ε increases. Although differential privacy initially increases iteration counts, DPFCM_GK demonstrates faster convergence and fewer iterations compared to DPFCM, with significant reductions (T= 23.08, 43.47, and 48.93; P<0.05). For the UCIML dataset, DPFCM_GK significantly reduces runtime compared to other models (DPFCM, LDP-SGD, LDP-Fed, LDP-FedSGD, MGM-DPL, LDP-FL) under the same privacy budget. The runtime reduction is statistically significant with T-values of (T = 21.08, 316.24, 102.35, 222.37, 162.23, 159.25; P < 0.05). DPFCM_GK still maintains excellent time efficiency when applied to the NLST and NSCLC datasets(P < 0.05). For the LLCS dataset, For the LLCS dataset, the DPFCM_GK demonstrates significant improvement as the privacy budget increases, especially in low-budget scenarios, where the performance gap is most pronounced (T=4.20, 8.44, 10.92, 3.95, 7.16, 8.51, P < 0.05). These results confirm DPFCM_GK as a practical solution for medical data analysis, balancing accuracy, privacy, and efficiency.

Keywords: Big data; DPFCM_GK; Differential privacy; Gaussian kernel function; Privacy budget; Privacy-preserving.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The impact of Laplace noise settings with different parameters on image information. (A) The original lung CT image. (B) Laplacian noise distribution with different parameter settings. (C) The original lung cancer image after adding Laplacian noise with parameters formula image,formula image. (D) The original lung cancer image after adding Laplacian noise with parameters formula image,formula image. (E) The original lung cancer image after adding Laplacian noise with parameters formula image,formula image (F) The original lung cancer image after adding Laplacian noise with parameters formula image,formula image.
Fig. 2
Fig. 2
Diagram depicting the operational summary of the proposed model.
Fig. 3
Fig. 3
Results of the effectiveness analysis of various algorithms based on the experimental data. Figure 3(A, B, C) illustrate the ACC calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. Figure 3(D, E, F) illustrate the PRE calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. Figure 3(G, H, I) illustrate the REC calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. Figure 3(J, K, L) illustrate the F1-score calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. Figure 3(M, N, O) illustrate the ARI calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. The horizontal axis in all figures represents the privacy budget ε. The black line represents the identification results of FCM, the blue line represents the identification results of DPFCM, and the red line represents the identification results of DPFCM_GK.
Fig. 4
Fig. 4
Analysis results of clustering iteration times for various algorithms based on the experimental data. Figure 4(A, B, C) illustrate the calculation results of clustering iteration counts for each algorithm based on the UCIML, NLST, and NSCLC datasets, respectively.
Fig. 5
Fig. 5
Analysis results of clustering running time (ms) of various algorithms based on the experimental data. Figure 5(A, B, C) illustrate the calculation results of clustering iteration time (ms) for each algorithm based on the UCIML, NLST, and NSCLC datasets, respectively.
Fig. 6
Fig. 6
Performance evaluation and availability verification results of DPFCM_GK, DPFCM, and FCM using LLCS. (A) Comparison of identified cases for DPFCM_GK, DPFCM, and FCM. The numbers inside the colored blocks represent the count of identified cases corresponding to the horizontal axis. Red blocks indicate the results of DPFCM_GK, blue blocks indicate the results of DPFCM, and gray blocks indicate the results of FCM. M1, M2, and M3 denote DPFCM_GK, DPFCM, and FCM, respectively. The symbol “*” represents a statistically significant difference between two algorithms with a p-value less than 0.05, while “**” represents a p-value less than 0.01. (B, D, F) ROC curves for DPFCM_GK, DPFCM, and FCM, respectively, illustrating their performance under LLCS classification. (C, E, G) PR curves for DPFCM_GK, DPFCM, and FCM, respectively, demonstrating their precision-recall relationships under LLCS classification.
Fig. 7
Fig. 7
Performance evaluation of DPFCM_GK and other methods under varying privacy budgets using LLCS. (A) The relationship between the test accuracy (%) of various models and privacy budget values. (B) The relationship between the misclassification rate (%) of various models and privacy budget values. (C) The relationship between the misclassification rate (%) of various models and the number of iterations, with the privacy budget fixed at 5. (D-J) When the privacy budget ε is set to 5, the confusion matrices of the DPFCM, DPFCM_GK, LDP-Fed, LDP-FL, LDP-SGD, MGM-DPFL, and MGM-FedSGD models are evaluated using the LLCS dataset.

Similar articles

References

    1. BadeBC & Dela CruzCS Lung Cancer 2020: epidemiology, etiology, and prevention. Clin. Chest Med.41 (1), 1–24. 10.1016/j.ccm.2019.10.001 (2020). - PubMed
    1. Nasim, F., Sabath, B. F. & Eapen, G. A. Lung Cancer. Med. Clin. North. Am.103 (3), 463–473. 10.1016/j.mcna.2018.12.006 (2019). - PubMed
    1. Lockery, J. E. et al. Optimising medication data collection in a large-scale clinical trial. PLoS One. 14 (12), e0226868. 10.1371/journal.pone.0226868 (2019). Published 2019 Dec 27. - PMC - PubMed
    1. Greener, J. G., Kandathil, S. M., Moffat, L. & Jones, D. T. A guide to machine learning for biologists. Nat. Rev. Mol. Cell. Biol.23 (1), 40–55. 10.1038/s41580-021-00407-0 (2022). - PubMed
    1. Deo, R. C. Machine learning in medicine. Circulation132 (20), 1920–1930. 10.1161/CIRCULATIONAHA.115.001593 (2015). - PMC - PubMed