Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec;26(12):2956-2968.
doi: 10.1109/TKDE.2013.91.

Composite Bloom Filters for Secure Record Linkage

Affiliations

Composite Bloom Filters for Secure Record Linkage

Elizabeth Ashley Durham et al. IEEE Trans Knowl Data Eng. 2014 Dec.

Abstract

The process of record linkage seeks to integrate instances that correspond to the same entity. Record linkage has traditionally been performed through the comparison of identifying field values (e.g., Surname), however, when databases are maintained by disparate organizations, the disclosure of such information can breach the privacy of the corresponding individuals. Various private record linkage (PRL) methods have been developed to obscure such identifiers, but they vary widely in their ability to balance competing goals of accuracy, efficiency and security. The tokenization and hashing of field values into Bloom filters (BF) enables greater linkage accuracy and efficiency than other PRL methods, but the encodings may be compromised through frequency-based cryptanalysis. Our objective is to adapt a BF encoding technique to mitigate such attacks with minimal sacrifices in accuracy and efficiency. To accomplish these goals, we introduce a statistically-informed method to generate BF encodings that integrate bits from multiple fields, the frequencies of which are provably associated with a minimum number of fields. Our method enables a user-specified tradeoff between security and accuracy. We compare our encoding method with other techniques using a public dataset of voter registration records and demonstrate that the increases in security come with only minor losses to accuracy.

Keywords: Bloom filter; data matching; entity resolution; privacy; record linkage; security.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
RBF encoding generation. 1) FBF parameterization & generation. FBFs are sized based on the expected field length. In this example, there are two hash functions per FBF. 2) Eligible bit identification. More secure bits, shown with a heavy outline, are identified for inclusion in the RBF. 3) Field weighting. It is determined that 25% of the bits in the RBF should be drawn from Forename, 50% of the bits in the RBF should be drawn from Surname, and 25% of the bits in the RBF should be drawn from City. 4) RBF parameterization & generation. mRBF is determined and eligible bits are sampled according to the field weights. 5) RBF permutation. Bits in the RBF are randomly permuted. The number of bits set in the RBF shown to the left and right is 11 and 15, respectively. The number of bits in the intersection is 11, yielding a Dice coefficient of 0.84.
Fig. 2
Fig. 2
The effect of FBF sizing on the average (and standard deviation) percent of bits set to 1 for the static and dynamic settings. The percent of bits set is calculated over all of the encoded NCVR records, both clean and corrupt. The variability across fields for static FBFs reveals information about the length of encoded field values, but this is not the case for dynamic FBFs.
Fig. 3
Fig. 3
A heat map representation of the frequencies at which bits in the FBFs are set for select fields.
Fig. 4
Fig. 4
Histograms for the number of bits in a) static and b) dynamic FBFs set to 1 at frequency intervals of 1100.
Fig. 5
Fig. 5
The results of the neighborhood test where each color indicates the neighbor rank of the sample's True Match (black: top rank, dark gray: 2nd through 10th rank, light gray: beyond the 10th rank). The controls include binary field comparison and the JW string comparator. The experimental techniques are the RBFs based on statically or dynamically sized FBFs. The numbers on the bars correspond to how many of the 100 neighborhood tests in which the first rank was the True Match.
Fig. 6
Fig. 6
The results of the neighborhood test for dynamic/weighted RBFs as a function of mRBF (black: top rank, dark gray: 2nd through 10th rank, light gray: beyond the 10th rank).
Fig. 7
Fig. 7
The results of the neighborhood test for dynamic/weighted RBFs and CLK for varying levels of corruption (black: top rank, dark gray: 2nd through 10th rank, light gray: beyond the 10th rank).
Fig. 8
Fig. 8
The number of bits eligible for RBF inclusion, given various values for the security constraint q, for dynamic and static settings.
Fig. 9
Fig. 9
Neighborhood test results show that the effect of the security constraint q on Dynamic/Weighted RBF accuracy, as determined by the Neighborhood test (black: top rank, dark gray: 2nd through 10th rank, light gray: beyond the 10th rank).
Fig. 10
Fig. 10
These neighborhood test results show the effect of Dynamic/Weighted RBF sizing and the security constraint q (black: top rank, dark gray: 2nd through 10th rank, light gray: beyond the 10th rank).
Fig. 11
Fig. 11
A detailed view of the effect of the security constraint (q) on neighborhood order preservation.

References

    1. Elfeky M, Verykios V, Elmagarmid A. Tailor: a record linkage toolbox. Proc. 18th IEEE Int'l Conf. Data Eng. 2002:17–28.
    1. Bhattacharya I, Getoor L. Iterative record linkage for cleaning and integration. Proc. 9th ACM SIGMOD Workshop on Research Issues in Data Mining & Knowledge Discovery. 2004:11–18.
    1. Halevy A, Rajaraman A, Ordille J. Data integration: the teenage years. Proc. 32nd Int'l Conf. on Very Large Data Bases. 2006:9–16.
    1. Vatsalan D, Christen P, Verykios V. A taxonomy of privacy-preserving record linkage techniques. Information Systems. 2013 in press.
    1. Yao AC. Protocols for secure computations. IEEE Annual Symp. on Foundations of Computer Science. 1982:160–164.