Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 29;21(1):155.
doi: 10.1186/s12874-021-01299-6.

A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers

Collaborators, Affiliations

A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers

Yue Jiao et al. BMC Med Res Methodol. .

Abstract

Background: Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors.

Methods: To identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named "PRL + ML") combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. Results The Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network. Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988-0.992) than either PRL (range 0.916-0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants).

Conclusions: Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries.

Keywords: Hybrid process; Probabilistic linkage; Record linkage; Supervised machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Elaboration of hybrid record linkage process and main steps. a Assignment the matching status by PRL followed by manual review, so that we could obtain a set of true matches representing the gold standard. b Selection of the best-performing supervised machine learning algorithm c Selection of the best-performing methods among PRL, ML, and PRL + ML d Training of a final ML model on a larger subset of initial datasets. e Application of the optimal linkage method to link the updated databases
Fig. 2
Fig. 2
Score distribution of 15,653,232 record pairs in dataset 1. a Whole score distribution. b Zoom on the distribution for the highest scores
Fig. 3
Fig. 3
Performance of three linkage methods: PRL (Probabilistic Record Linkage), RF (Random Forest) and PRL + RF. PRL has thresholds varying from 0.6 to 0.8. a Comparison of their recalls. b Comparison of their precisions
Fig. 4
Fig. 4
Comparison of candidate matches predicted by the RF and PRL models for the updated databases. a RF and PRL identified 819 and 1268 new candidate matches, respectively; 772 candidate matches were common to both approaches. b After manual review, PRL + RF led to the identification of 738 true matches, among which 727 were identified by PRL alone and 715 by RF alone. 704 true matches were identified by both approaches. 23 true matches were identified only by PRL, and 11 true matches were identified only by the RF model
Fig. 5
Fig. 5
General overview of the hybrid record linkage process. a Probabilistic record linkage (PRL) followed by a stage of manual review is first applied to build a dataset allowing the construction of a supervised machine learning (ML) model. b The PRL + ML combined linkage is then used to classify the updated datasets (Record pair comparison from Database X’ and Database Y′). The ML model obtained in (a) is used (dotted arrow) for the prediction in (b)

References

    1. Newcombe HB, Kennedy JM, Axford SJ, James AP. Automatic linkage of vital records. Science. 1959;130(3381):954–959. doi: 10.1126/science.130.3381.954. - DOI - PubMed
    1. Christen P, Goiser K. Quality and complexity measures for data linkage and deduplication. In: Guillet FJ, Hamilton HJ, editors. Quality measures in data mining. Berlin, Heidelberg: Springer; 2007. pp. 127–151.
    1. Fellegi IP, Sunter AB. A theory for record linkage. J Am Stat Assoc. 1969;64(328):1183–1210. doi: 10.1080/01621459.1969.10501049. - DOI
    1. Newcombe HB. Handbook of record linkage: methods for health and statistical studies, administration, and business. USA: Oxford University Press, Inc.; 1988.
    1. Zhu Y, Matsuyama Y, Ohashi Y, Setoguchi S. When to conduct probabilistic linkage vs. deterministic linkage? A simulation study. J Biomed Inform. 2015;56:80–86. doi: 10.1016/j.jbi.2015.05.012. - DOI - PubMed

Publication types