Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 23;4(1):1095.
doi: 10.23889/ijpds.v4i1.1095.

Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage

Affiliations

Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage

A P Brown et al. Int J Popul Data Sci. .

Abstract

Introduction: The need for increased privacy protection in data linkage has driven the development of privacy-preserving record linkage (PPRL) techniques. A popular technique using Bloom filters with cryptographic analyses, modifications, and hashing variations to optimise privacy has been the focus of much research in this area. With few applications of Bloom filters within a probabilistic framework, there is limited information on whether approximate matches between Bloom filtered fields can improve linkage quality.

Objectives: In this study, we evaluate the effectiveness of three approximate comparison methods for Bloom filters within the context of the Fellegi-Sunter model of recording linkage: Sørensen-Dice coefficient, Jaccard similarity and Hamming distance.

Methods: Using synthetic datasets with introduced errors to simulate datasets with a range of data quality and a large real-world administrative health dataset, the research estimated partial weight curves for converting similarity scores (for each approximate comparison method) to partial weights at both field and dataset level. Deduplication linkages were run on each dataset using these partial weight curves. This was to compare the resulting quality of the approximate comparison techniques with linkages using simple cut-off similarity values and only exact matching.

Results: Linkages using approximate comparisons produced significantly better quality results than those using exact comparisons only. Field level partial weight curves for a specific dataset produced the best quality results. The Sørensen-Dice coefficient and Jaccard similarity produced the most consistent results across a spectrum of synthetic and real-world datasets.

Conclusion: The use of Bloom filter similarity comparisons for probabilistic record linkage can produce linkage quality results which are comparable to Jaro-Winkler string similarities with unencrypted linkages. Probabilistic linkages using Bloom filters benefit significantly from the use of similarity comparisons, with partial weight curves producing the best results, even when not optimised for that particular dataset.

PubMed Disclaimer

Conflict of interest statement

Statement on conflicts of interest: All authors declare there are no conflicts of interest.

Figures

Figure 1: Estimated field and dataset weight curves
Figure 1: Estimated field and dataset weight curves
Weight proportion represents the proportion of a field match comparison weight (0 = full disagreement, 1 = full agreement)
Figure 2: Precision-recall for each comparison (synthetic datasets)
Figure 2: Precision-recall for each comparison (synthetic datasets)
WC = weight curve
Figure 3: Precision-recall for each comparison (NSW Emergency dataset)
Figure 3: Precision-recall for each comparison (NSW Emergency dataset)
WC = weight curve

References

    1. Brown AP, Randall SM, Ferrante AM, Semmens JB, Boyd JH. Estimating parameters for probabilistic linkage of privacy-preserved datasets. BMC Medical Research Methodology. 2017;17(1):95 10.1186/s12874-017-0370-0 - DOI - PMC - PubMed
    1. Pow C, Iron K, Boyd J, Brown A, Thompson S, Chong N, et al. Privacy-Preserving Record Linkage: An international collaboration between Canada, Australia and Wales. International Journal for Population Data Science. 2017;1(1). 10.23889/ijpds.v1i1.101 - DOI
    1. Schnell R, Borgs C. Secure Privacy Preserving Record Linkage of Large Databases by Modified Bloom Filter Encodings. 2016 International Population Data Linkage Conference. 2016 10.23889/ijpds.v1i1.29 - DOI - PMC - PubMed
    1. Vatsalan D, Christen P, Verykios VS. A taxonomy of privacy-preserving record linkage techniques. Information Systems. 2013;38(6):946-69. 10.1016/j.is.2012.11.005 - DOI
    1. Durham EA, Kantarcioglu M, Member S, Xue Y, Toth C, Kuzu M, et al. Composite Bloom Filters for Secure Record Linkage. IEEE Transactions on Knowledge and Data Engineering. 2014;26:2956-68. 10.1109/TKDE.2013.91 - DOI - PMC - PubMed

LinkOut - more resources