Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec 30;10(12):e38922.
doi: 10.2196/38922.

A Privacy-Preserving Distributed Medical Data Integration Security System for Accuracy Assessment of Cancer Screening: Development Study of Novel Data Integration System

Affiliations

A Privacy-Preserving Distributed Medical Data Integration Security System for Accuracy Assessment of Cancer Screening: Development Study of Novel Data Integration System

Atsuko Miyaji et al. JMIR Med Inform. .

Abstract

Background: Big data useful for epidemiological research can be obtained by integrating data corresponding to individuals between databases managed by different institutions. Privacy information must be protected while performing efficient, high-level data matching.

Objective: Privacy-preserving distributed data integration (PDDI) enables data matching between multiple databases without moving privacy information; however, its actual implementation requires matching security, accuracy, and performance. Moreover, identifying the optimal data item in the absence of a unique matching key is necessary. We aimed to conduct a basic matching experiment using a model to assess the accuracy of cancer screening.

Methods: To experiment with actual data, we created a data set mimicking the cancer screening and registration data in Japan and conducted a matching experiment using a PDDI system between geographically distant institutions. Errors similar to those found empirically in data sets recorded in Japanese were artificially introduced into the data set. The matching-key error rate of the data common to both data sets was set sufficiently higher than expected in the actual database: 85.0% and 59.0% for the data simulating colorectal and breast cancers, respectively. Various combinations of name, gender, date of birth, and address were used for the matching key. To evaluate the matching accuracy, the matching sensitivity and specificity were calculated based on the number of cancer-screening data points, and the effect of matching accuracy on the sensitivity and specificity of cancer screening was estimated based on the obtained values. To evaluate the performance, we measured central processing unit use, memory use, and network traffic.

Results: For combinations with a specificity ≥99% and high sensitivity, the date of birth and first name were used in the data simulating colorectal cancer, and the matching sensitivity and specificity were 55.00% and 99.85%, respectively. In the data simulating breast cancer, the date of birth and family name were used, and the matching sensitivity and specificity were 88.71% and 99.98%, respectively. Assuming the sensitivity and specificity of cancer screening at 90%, the apparent values decreased to 74.90% and 89.93%, respectively. A trial calculation was performed using a combination with the same data set and 100% specificity. When the matching sensitivity was 82.26%, the apparent screening sensitivity was maintained at 90%, and the screening specificity decreased to 89.89%. For 214 data points, the execution time was 82 minutes and 26 seconds without parallelization and 11 minutes and 38 seconds with parallelization; 19.33% of the calculation time was for the data-holding institutions. Memory use was 3.4 GB for the PDDI server and 2.7 GB for the data-holding institutions.

Conclusions: We demonstrated the rudimentary feasibility of introducing a PDDI system for cancer-screening accuracy assessment. We plan to conduct matching experiments based on actual data and compare them with the existing methods.

Keywords: PDDI; PSI; big data; cancer epidemiology; cancer prevention; data linkage; data security; epidemiological survey; medical informatics; privacy-preserving distributed data integration; privacy-preserving linkage; private set intersection; secure data integration; secure matching privacy-preserving linkage.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
Schematic of the privacy-preserving distributed data integration (PDDI) system algorithm. Steps 1 to 4 represent each step of the merging process using the PDDI system described in the main text. The data held by each institution are encrypted and matched by the PDDI server using the data as the matching key. The analysis target data, which are related to the matching key without distinction between institutions, are decrypted only when they are provided to the client, and the matching-key information is never provided to the client.
Figure 2
Figure 2
Number of false positives and false negatives. The points are placed according to the number of false positives and false negatives by the setting of each experiment conducted. Part A shows the result of data simulating colorectal cancer and Part B shows the result of data simulating breast cancer.
Figure 3
Figure 3
Execution time. The graph shows the relationship between the amount of data and the execution time. The solid line shows the execution time without parallelization, and the dashed line shows the execution time with parallelization.
Figure 4
Figure 4
Changes in central processing unit (CPU) usage. The graphs show the changes in CPU usage of the privacy-preserving distributed data integration (PDDI) server and the data-holding institutions when the process is executed on 214 datapoints without parallelization. Part A represents the results of the PDDI server, and part B represents the results of the data-holding institution.
Figure 5
Figure 5
Memory usage. The graphs show the relationship between the amount of data and the memory usage of the privacy-preserving distributed data integration (PDDI) server and the data-holding institutions. Part A represents the results of the PDDI server, and part B represents the results of the data-holding institution.

Similar articles

Cited by

References

    1. Matsuda T, Sobue T. Recent trends in population-based cancer registries in Japan: the Act on Promotion of Cancer Registries and drastic changes in the historical registry. Int J Clin Oncol. 2015 Feb;20(1):11–20. doi: 10.1007/s10147-014-0765-4. - DOI - PubMed
    1. Anazawa T, Miyata H, Gotoh M. Cancer registries in Japan: national clinical database and site-specific cancer registries. Int J Clin Oncol. 2015 Feb;20(1):5–10. doi: 10.1007/s10147-014-0757-4. - DOI - PubMed
    1. Rare Disease Data Registry of Japan (in Japanese) Japan Agency for Medical Research and Development. [2022-03-03]. https://www.raddarj.org .
    1. Tsugane S, Sawada N. The JPHC study: design and some findings on the typical Japanese diet. Jpn J Clin Oncol. 2014 Sep 07;44(9):777–82. doi: 10.1093/jjco/hyu096.hyu096 - DOI - PubMed
    1. Takeuchi K, Naito M, Kawai S, Tsukamoto M, Kadomatsu Y, Kubo Y, Okada R, Nagayoshi M, Tamura T, Hishida A, Nakatochi M, Sasakabe T, Hashimoto S, Eguchi H, Momozawa Y, Ikezaki H, Murata M, Furusyo N, Tanaka K, Hara M, Nishida Y, Matsuo K, Ito H, Oze I, Mikami H, Nakamura Y, Kusakabe M, Takezaki T, Ibusuki R, Shimoshikiryo I, Suzuki S, Nishiyama T, Watanabe M, Koyama T, Ozaki E, Watanabe I, Kuriki K, Kita Y, Ueshima H, Matsui K, Arisawa K, Uemura H, Katsuura-Kamano S, Nakamura S, Narimatsu H, Hamajima N, Tanaka H, Wakai K. Study profile of the Japan multi-institutional collaborative cohort (J-MICC) study. J Epidemiol. 2021 Dec 05;31(12):660–8. doi: 10.2188/jea.JE20200147. doi: 10.2188/jea.JE20200147. - DOI - DOI - PMC - PubMed

LinkOut - more resources