Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jun;19(e1):e157-61.
doi: 10.1136/amiajnl-2011-000329. Epub 2012 Feb 1.

A simple heuristic for blindfolded record linkage

Affiliations

A simple heuristic for blindfolded record linkage

Susan C Weber et al. J Am Med Inform Assoc. 2012 Jun.

Abstract

Objectives: To address the challenge of balancing privacy with the need to create cross-site research registry records on individual patients, while matching the data for a given patient as he or she moves between participating sites. To evaluate the strategy of generating anonymous identifiers based on real identifiers in such a way that the chances of a shared patient being accurately identified were maximized, and the chances of incorrectly joining two records belonging to different people were minimized.

Methods: Our hypothesis was that most variation in names occurs after the first two letters, and that date of birth is highly reliable, so a single match variable consisting of a hashed string built from the first two letters of the patient's first and last names plus their date of birth would have the desired characteristics. We compared and contrasted the match algorithm characteristics (rate of false positive v. rate of false negative) for our chosen variable against both Social Security Numbers and full names.

Results: In a data set of 19 000 records, a derived match variable consisting of a 2-character prefix from both first and last names combined with date of birth has a 97% sensitivity; by contrast, an anonymized identifier based on the patient's full names and date of birth has a sensitivity of only 87% and SSN has sensitivity 86%.

Conclusion: The approach we describe is most useful in situations where privacy policies preclude the full exchange of the identifiers required by more sophisticated and sensitive linkage algorithms. For data sets of sufficiently high quality this effective approach, while producing a lower rate of matching than more complex algorithms, has the merit of being easy to explain to institutional review boards, adheres to the minimum necessary rule of the HIPAA privacy rule, and is faster and less cumbersome to implement than a full probabilistic linkage.

PubMed Disclaimer

Conflict of interest statement

Competing interests: None.

Figures

Figure 1
Figure 1
Research data flow between the two separate Health Insurance Portability and Accountability Act (HIPAA) covered entities participating in the study. Protected Health Information (PHI) remains secure at all times. Informaticians acting as honest brokers hold the electronic code book table that maps hash codes to patient research database identifiers. Anonymized data is extracted, transformed and loaded (ETLed) into a staging database and securely transmitted over a dedicated virtual private network (VPN).
Figure 2
Figure 2
Characteristics of name prefix plus date of birth as an identity preserving anonymized identifier. PAMF, Palo Alto Medical Foundation; SSN, social security number.
Figure 3
Figure 3
Analysis of reasons for false negatives in name prefix/date of birth (DoB) hash. These cases were identified by matching social security numbers. All 59 false negatives were determined by manual review to be the same person.
Figure 4
Figure 4
Reasons for false negatives in social security number (SSN) matching. In 12 cases the SSNs were completely different, despite our verification that the matched records referred to the same individual.

References

    1. http://www.capdregistry.org/files/Registry_Act-_Word_formatted.doc (accessed 10 Sep 2010).
    1. Szolovits P, Kohane I. Against simple universal health-care identifiers. J Am Med Inform Assoc 1994;1:316–19 - PMC - PubMed
    1. Arellano MG, Weber GI. Issues in identification and linkage of patient records across an integrated delivery system. J Healthc Inf Manag 1998;12:43–52 - PubMed
    1. Churches T, Christen P. Some methods for blindfolded record linkage. BMC Med Inf Decis Mak 2004;4:9 - PMC - PubMed
    1. Chen T, Zhong S. An efficient privacy preserving Method for matching patient data across different providers. Proceedings of the AMIA 2010 Symposium in San Francisco CA, Omnipress, 2010:1325 http://proceedings.amia.org/1210kh/

Publication types