Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jun;45(3):954-64.
doi: 10.1093/ije/dyv322. Epub 2015 Dec 20.

Probabilistic record linkage

Affiliations

Probabilistic record linkage

Adrian Sayers et al. Int J Epidemiol. 2016 Jun.

Abstract

Studies involving the use of probabilistic record linkage are becoming increasingly common. However, the methods underpinning probabilistic record linkage are not widely taught or understood, and therefore these studies can appear to be a 'black box' research tool. In this article, we aim to describe the process of probabilistic record linkage through a simple exemplar. We first introduce the concept of deterministic linkage and contrast this with probabilistic linkage. We illustrate each step of the process using a simple exemplar and describe the data structure required to perform a probabilistic linkage. We describe the process of calculating and interpreting matched weights and how to convert matched weights into posterior probabilities of a match using Bayes theorem. We conclude this article with a brief discussion of some of the computational demands of record linkage, how you might assess the quality of your linkage algorithm, and how epidemiologists can maximize the value of their record-linked research using robust record linkage methods.

Keywords: Record linkage; bias; data linkage; epidemiological methods; medical record linkage.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Illustration of two distinct files containing data on sex and education qualification. M, male; F, female; edu, education.
Figure 2
Figure 2
Results of merging two files using first and last name.
Figure 3
Figure 3
Results of joining two files and calculating simple agreement patterns. Fname, first name; Lname, last name; Ag pat, agreement pattern.
Figure 4
Figure 4
Results of joining two files and calculating complex agreement patterns using the edit distance between first name fields in the master file and the file of interest. Fname, first name; Lname, last name; Comp ag pat, complex agreement pattern.
Figure 5
Figure 5
Partitioning of two files into matched and unmatched records.
Figure 6
Figure 6
Matrix representation of true match status of two linked files (FileMaster and FileFOI) containing varying frequencies of surname Mγ=1 indicates a true matched pair of records where linkage fields agree (bold on the diagonal), Mγ=0 indicates a true matched pair of records where linkage fields disagree, Uγ=1 indicates a true unmatched pair of records where linkage fields agree (the off diagonal elements in the upper left and lower right quadrants), Uγ=0 indicates a true unmatched pair of records where linkage fields disagree (the lower left and upper right quadrants).
Figure 7
Figure 7
Comparison of results from a diagnostic test against the true disease status and a record linkage against the true match status. Dis, disease.
Figure 8
Figure 8
Calculation of simple agreement weights,  log2R(γj), using the Fellegi and Sunter record linkage framework. Fname, first name; Lname, last name; Ag pat, agreement pattern.
Figure 9
Figure 9
Calculating of complex agreement weights,  log2R(γj), using the Fellegi and Sunter record linkage framework. Fname, first name; Lname, last name; Ag pat, agreement pattern.
Figure 10
Figure 10
Probabilities of links based complex agreement weights,  log2(R(γj)) calculated using Bayes theorem. Fname, first name; Lname, last name; Comp ag pat, complex agreement pattern.

References

    1. Langan SM, Benchimol EI, Guttmann A et al. . Setting the RECORD straight: developing a guideline for the REporting of studies Conducted using Observational Routinely collected Data. Clin Epidemiol 2013;5:29–31. - PMC - PubMed
    1. Nicholls SG, Quach P, von Elm E et al. . The REporting of Studies Conducted Using Observational Routinely-Collected Health Data (RECORD) Statement: Methods for Arriving at Consensus and Developing Reporting Guidelines. PloS One 2015;10:e0125620. - PMC - PubMed
    1. von Elm E, Altman DG, Egger M et al. . The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Epidemiology 2007;18:800–04. - PubMed
    1. Clark DE. Practical introduction to record linkage for injury research. Inj Prev 2004;10:186–91. - PMC - PubMed
    1. Fellegi IP, Sunter AB. A Theory for Record Linkage. J Am Stat Assoc 1969;64:1183.

Publication types