Medical record linkage in health information systems by approximate string matching and clustering
- PMID: 16219102
- PMCID: PMC1274322
- DOI: 10.1186/1472-6947-5-32
Medical record linkage in health information systems by approximate string matching and clustering
Abstract
Background: Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity.
Methods: The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods.
Results: The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records.
Conclusion: Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e. real-time) proximity detection when inserting a new identity.
Figures
Similar articles
-
Real world performance of approximate string comparators for use in patient matching.Stud Health Technol Inform. 2004;107(Pt 1):43-7. Stud Health Technol Inform. 2004. PMID: 15360771
-
Record Linkage system in a complex relational database - MINPHIS example.Stud Health Technol Inform. 2010;160(Pt 2):1127-30. Stud Health Technol Inform. 2010. PMID: 20841859
-
Privacy preserving probabilistic record linkage (P3RL): a novel method for linking existing health-related data and maintaining participant confidentiality.BMC Med Res Methodol. 2015 May 30;15:46. doi: 10.1186/s12874-015-0038-6. BMC Med Res Methodol. 2015. PMID: 26024886 Free PMC article.
-
[Dematerialized management of information in transfusion].Transfus Clin Biol. 2008 Nov;15(5):266-8. doi: 10.1016/j.tracli.2008.09.025. Epub 2008 Oct 15. Transfus Clin Biol. 2008. PMID: 18926756 Review. French. No abstract available.
-
[Linking of individual data. Methods of linkage].Rev Epidemiol Sante Publique. 1997 Jun;45(3):248-56. Rev Epidemiol Sante Publique. 1997. PMID: 9280988 Review. French.
Cited by
-
Mapping biological entities using the longest approximately common prefix method.BMC Bioinformatics. 2014 Jun 14;15:187. doi: 10.1186/1471-2105-15-187. BMC Bioinformatics. 2014. PMID: 24928653 Free PMC article.
-
Child characteristics and early intervention referral and receipt of services: a retrospective cohort study.BMC Pediatr. 2020 Feb 22;20(1):84. doi: 10.1186/s12887-020-1965-x. BMC Pediatr. 2020. PMID: 32087676 Free PMC article.
-
An efficient record linkage scheme using graphical analysis for identifier error detection.BMC Med Inform Decis Mak. 2011 Feb 1;11:7. doi: 10.1186/1472-6947-11-7. BMC Med Inform Decis Mak. 2011. PMID: 21284874 Free PMC article.
-
Measuring the Degree of Unmatched Patient Records in a Health Information Exchange Using Exact Matching.Appl Clin Inform. 2016 May 11;7(2):330-40. doi: 10.4338/ACI-2015-11-RA-0158. eCollection 2016. Appl Clin Inform. 2016. PMID: 27437044 Free PMC article.
-
Nutrient Estimation from 24-Hour Food Recalls Using Machine Learning and Database Mapping: A Case Study with Lactose.Nutrients. 2019 Dec 13;11(12):3045. doi: 10.3390/nu11123045. Nutrients. 2019. PMID: 31847188 Free PMC article.
References
-
- Belin TR, Rubin DB. A method for calibrating false match rates in record linkage. Journal of the American Statistical Association. 1995;90:697–707.
-
- Newcombe HB, Kennedy JM. Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM. 1962;5:563–566. doi: 10.1145/368996.369026. - DOI
-
- Vintsyuk T. Speech discrimination by dynamic programming. Cybernetics. 1968;4:52–58. doi: 10.1007/BF01074755. - DOI
-
- Sellers P. The theory and computation of evolutionary distances: pattern recognition. Journal of Algorithms. 1980;1:359–373. doi: 10.1016/0196-6774(80)90016-4. - DOI
-
- Navarro G, Raffinot M. Flexible pattern matching in strings. Cambridge, Cambridge University Press; 2002.
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources