. 2025 May 16;15(1):17094.

doi: 10.1038/s41598-025-01873-8.

Application of the joint clustering algorithm based on Gaussian kernels and differential privacy in lung cancer identification

Hang Yanping¹, Zheng Haixia¹, Yang Minmin¹, Wang Nan¹, Kong Miaomiao¹, Zhao Mingming²

Affiliations

¹ Department of Respiratory and Critical Care Medicine, Affiliated Nanjing Gaochun People's Hospital, Jiangsu University, Nanjing, 210000, Jiangsu, China.
² Department of Respiratory and Critical Care Medicine, Affiliated Nanjing Gaochun People's Hospital, Jiangsu University, Nanjing, 210000, Jiangsu, China. zhaomingming10086@outlook.com.

PMID: 40379735
PMCID: PMC12084312
DOI: 10.1038/s41598-025-01873-8

Application of the joint clustering algorithm based on Gaussian kernels and differential privacy in lung cancer identification

Hang Yanping et al. Sci Rep. 2025.

. 2025 May 16;15(1):17094.

doi: 10.1038/s41598-025-01873-8.

Authors

Hang Yanping¹, Zheng Haixia¹, Yang Minmin¹, Wang Nan¹, Kong Miaomiao¹, Zhao Mingming²

Affiliations

¹ Department of Respiratory and Critical Care Medicine, Affiliated Nanjing Gaochun People's Hospital, Jiangsu University, Nanjing, 210000, Jiangsu, China.
² Department of Respiratory and Critical Care Medicine, Affiliated Nanjing Gaochun People's Hospital, Jiangsu University, Nanjing, 210000, Jiangsu, China. zhaomingming10086@outlook.com.

PMID: 40379735
PMCID: PMC12084312
DOI: 10.1038/s41598-025-01873-8

Abstract

In the age of big data, privacy, particularly medical data privacy, is becoming increasingly important. Differential privacy (DP) has emerged as a key method for safeguarding privacy during data analysis and publishing. Cancer identification and classification play a vital role in early detection and treatment. This paper introduces a novel algorithm, DPFCM_GK, which combines differential privacy with fuzzy c-means (FCM) clustering using a Gaussian kernel function. The algorithm enhances cancer detection while ensuring data privacy. Three publicly available lung cancer datasets, along with a dataset from our hospital, are used to test and demonstrate the effectiveness of DPFCM_GK. The experimental results show that DPFCM_GK achieves high clustering accuracy and enhanced privacy as the privacy budget (ε) increases. For the UCIML, NLST, and NSCLC datasets, it reaches optimal results at lower ε (1.52, 1.24, and 2.32) compared to DPFCM. In the lung cancer dataset, DPFCM_GK outperforms DPFCM within, 0.05 ≤ ε ≤ 2.5, with significant differences (χ² = 4.54 ∼ 29.12; P < 0.05), and both methods converge to an accuracy of 94.5% as ε increases. Although differential privacy initially increases iteration counts, DPFCM_GK demonstrates faster convergence and fewer iterations compared to DPFCM, with significant reductions (T= 23.08, 43.47, and 48.93; P<0.05). For the UCIML dataset, DPFCM_GK significantly reduces runtime compared to other models (DPFCM, LDP-SGD, LDP-Fed, LDP-FedSGD, MGM-DPL, LDP-FL) under the same privacy budget. The runtime reduction is statistically significant with T-values of (T = 21.08, 316.24, 102.35, 222.37, 162.23, 159.25; P < 0.05). DPFCM_GK still maintains excellent time efficiency when applied to the NLST and NSCLC datasets(P < 0.05). For the LLCS dataset, For the LLCS dataset, the DPFCM_GK demonstrates significant improvement as the privacy budget increases, especially in low-budget scenarios, where the performance gap is most pronounced (T=4.20, 8.44, 10.92, 3.95, 7.16, 8.51, P < 0.05). These results confirm DPFCM_GK as a practical solution for medical data analysis, balancing accuracy, privacy, and efficiency.

Keywords: Big data; DPFCM_GK; Differential privacy; Gaussian kernel function; Privacy budget; Privacy-preserving.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
The impact of Laplace noise settings with different parameters on image information. (A) The original lung CT image. (B) Laplacian noise distribution with different parameter settings. (C) The original lung cancer image after adding Laplacian noise with parameters ,. (D) The original lung cancer image after adding Laplacian noise with parameters ,. (E) The original lung cancer image after adding Laplacian noise with parameters , (F) The original lung cancer image after adding Laplacian noise with parameters ,.

formula image — **Fig. 1**
The impact of Laplace noise settings with different parameters on image information. (A) The original lung CT image. (B) Laplacian noise distribution with different parameter settings. (C) The original lung cancer image after adding Laplacian noise with parameters ,. (D) The original lung cancer image after adding Laplacian noise with parameters ,. (E) The original lung cancer image after adding Laplacian noise with parameters , (F) The original lung cancer image after adding Laplacian noise with parameters ,.

**Fig. 2**
Diagram depicting the operational summary of the proposed model.

**Fig. 3**
Results of the effectiveness analysis of various algorithms based on the experimental data. Figure 3(A, B, C) illustrate the ACC calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. Figure 3(D, E, F) illustrate the PRE calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. Figure 3(G, H, I) illustrate the REC calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. Figure 3(J, K, L) illustrate the F1-score calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. Figure 3(M, N, O) illustrate the ARI calculation results based on the UCIML, NLST, and NSCLC datasets, respectively. The horizontal axis in all figures represents the privacy budget ε. The black line represents the identification results of FCM, the blue line represents the identification results of DPFCM, and the red line represents the identification results of DPFCM_GK.

**Fig. 4**
Analysis results of clustering iteration times for various algorithms based on the experimental data. Figure 4(A, B, C) illustrate the calculation results of clustering iteration counts for each algorithm based on the UCIML, NLST, and NSCLC datasets, respectively.

**Fig. 5**
Analysis results of clustering running time (ms) of various algorithms based on the experimental data. Figure 5(A, B, C) illustrate the calculation results of clustering iteration time (ms) for each algorithm based on the UCIML, NLST, and NSCLC datasets, respectively.

**Fig. 6**
Performance evaluation and availability verification results of DPFCM_GK, DPFCM, and FCM using LLCS. (A) Comparison of identified cases for DPFCM_GK, DPFCM, and FCM. The numbers inside the colored blocks represent the count of identified cases corresponding to the horizontal axis. Red blocks indicate the results of DPFCM_GK, blue blocks indicate the results of DPFCM, and gray blocks indicate the results of FCM. M1, M2, and M3 denote DPFCM_GK, DPFCM, and FCM, respectively. The symbol “*” represents a statistically significant difference between two algorithms with a p-value less than 0.05, while “**” represents a p-value less than 0.01. (B, D, F) ROC curves for DPFCM_GK, DPFCM, and FCM, respectively, illustrating their performance under LLCS classification. (C, E, G) PR curves for DPFCM_GK, DPFCM, and FCM, respectively, demonstrating their precision-recall relationships under LLCS classification.

**Fig. 7**
Performance evaluation of DPFCM_GK and other methods under varying privacy budgets using LLCS. (A) The relationship between the test accuracy (%) of various models and privacy budget values. (B) The relationship between the misclassification rate (%) of various models and privacy budget values. (C) The relationship between the misclassification rate (%) of various models and the number of iterations, with the privacy budget fixed at 5. (**D-J**) When the privacy budget ε is set to 5, the confusion matrices of the DPFCM, DPFCM_GK, LDP-Fed, LDP-FL, LDP-SGD, MGM-DPFL, and MGM-FedSGD models are evaluated using the LLCS dataset.

See this image and copyright information in PMC

References

1. BadeBC & Dela CruzCS Lung Cancer 2020: epidemiology, etiology, and prevention. Clin. Chest Med.41 (1), 1–24. 10.1016/j.ccm.2019.10.001 (2020). - DOI - PubMed
1. Nasim, F., Sabath, B. F. & Eapen, G. A. Lung Cancer. Med. Clin. North. Am.103 (3), 463–473. 10.1016/j.mcna.2018.12.006 (2019). - DOI - PubMed
1. Lockery, J. E. et al. Optimising medication data collection in a large-scale clinical trial. PLoS One. 14 (12), e0226868. 10.1371/journal.pone.0226868 (2019). Published 2019 Dec 27. - DOI - PMC - PubMed
1. Greener, J. G., Kandathil, S. M., Moffat, L. & Jones, D. T. A guide to machine learning for biologists. Nat. Rev. Mol. Cell. Biol.23 (1), 40–55. 10.1038/s41580-021-00407-0 (2022). - DOI - PubMed
1. Deo, R. C. Machine learning in medicine. Circulation132 (20), 1920–1930. 10.1161/CIRCULATIONAHA.115.001593 (2015). - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

YKK20171/Nanjing Health Department Medical Technology Development Foundation

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Application of the joint clustering algorithm based on Gaussian kernels and differential privacy in lung cancer identification

Affiliations

Application of the joint clustering algorithm based on Gaussian kernels and differential privacy in lung cancer identification

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous