Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Oct 11:5:32.
doi: 10.1186/1472-6947-5-32.

Medical record linkage in health information systems by approximate string matching and clustering

Affiliations

Medical record linkage in health information systems by approximate string matching and clustering

Erik A Sauleau et al. BMC Med Inform Decis Mak. .

Abstract

Background: Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity.

Methods: The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods.

Results: The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records.

Conclusion: Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e. real-time) proximity detection when inserting a new identity.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Example of a complete graph (left panel) and of an incomplete graph (right panel).
Figure 2
Figure 2
Percentage of true positive couples by global similarity threshold value.
Figure 3
Figure 3
Summary of the entire linkage procedure.

Similar articles

Cited by

References

    1. Belin TR, Rubin DB. A method for calibrating false match rates in record linkage. Journal of the American Statistical Association. 1995;90:697–707.
    1. Newcombe HB, Kennedy JM. Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM. 1962;5:563–566. doi: 10.1145/368996.369026. - DOI
    1. Vintsyuk T. Speech discrimination by dynamic programming. Cybernetics. 1968;4:52–58. doi: 10.1007/BF01074755. - DOI
    1. Sellers P. The theory and computation of evolutionary distances: pattern recognition. Journal of Algorithms. 1980;1:359–373. doi: 10.1016/0196-6774(80)90016-4. - DOI
    1. Navarro G, Raffinot M. Flexible pattern matching in strings. Cambridge, Cambridge University Press; 2002.

LinkOut - more resources