Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 25;6(3):417-426.
doi: 10.1093/ehjdh/ztaf011. eCollection 2025 May.

Siamese neural network-enhanced electrocardiography can re-identify anonymized healthcare data

Affiliations

Siamese neural network-enhanced electrocardiography can re-identify anonymized healthcare data

Krzysztof Macierzanka et al. Eur Heart J Digit Health. .

Abstract

Aims: Many research databases with anonymized patient data contain electrocardiograms (ECGs) from which traditional identifiers have been removed. We evaluated the ability of artificial intelligence (AI) methods to determine the similarity between ECGs and assessed whether they have the potential to be misused to re-identify individuals from anonymized datasets.

Methods and results: We utilized a convolutional Siamese neural network (SNN) architecture, which derives a Euclidean distance similarity metric between two input ECGs. A secondary care dataset of 864 283 ECGs (72 455 subjects) was used. Siamese neural network-electrocardiogram (SNN-ECG) achieves an accuracy of 91.68% when classifying between 2 689 124 same-subject pairs and 2 689 124 different-subject pairs. This performance increases to 93.61% and 95.97% in outpatient and normal ECG subsets. In a simulated 'motivated intruder' test, SNN-ECG can identify individuals from large datasets. In datasets of 100, 1000, 10 000, and 20 000 ECGs, where only one ECG is also from the reference individual, it achieves success rates of 79.2%, 62.6%, 45.0%, and 40.0%, respectively. If this was random, the success would be 1%, 0.1%, 0.01%, and 0.005%, respectively. Additional basic information, like subject sex or age-range, enhances performance further. We also found that, on the subject level, ECG pair similarity is clinically relevant; greater ECG dissimilarity associates with all-cause mortality [hazard ratio, 1.22 (1.21-1.23), P < 0.0001] and is additive to an AI-ECG model trained for mortality prediction.

Conclusion: Anonymized ECGs retain information that may facilitate subject re-identification, raising privacy and data protection concerns. However, SNN-ECG models also have positive uses and can enhance risk prediction of cardiovascular disease.

Keywords: Artificial intelligence; Continuous monitoring; Electrocardiogram; Identification; Siamese neural network.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest: J.W.W. and D.B.K. were previously on the advisory board for HeartcoR Solutions LLC. J.W.W. has received research support from Anumana. F.S.N. reports speaker fees from GE HealthCare and is on the advisory board for AstraZeneca. The remaining authors have no conflicts to declare.

Figures

Graphical Abstract
Graphical Abstract
Figure 1
Figure 1
The SNN architecture and triplet loss function: the SNN receives three ECG inputs: a reference ECG from a given subject A (anchor), another ECG from the same subject A (positive), and an ECG from a different subject B (negative) and encodes embeddings for these using three identical CNNs. The Euclidean distances between anchor and positive embeddings and between anchor and negative embeddings are calculated, with the triplet loss function updating the weights of the CNN to encode anchor-positive embeddings closer together than anchor-negative embeddings. AN, anchor-negative; AP, anchor-positive; CNN, convolutional neural network; SNN, Siamese neural network.
Figure 2
Figure 2
Re-identification of a subject from an anonymized dataset: (A) A visualization of the re-identification process during which Euclidean distances are calculated between a subject’s reference ECG and all other ECGs in an anonymized dataset (where only one also belongs to that subject). The Euclidean distance for the only correct ECG pair (i.e. that belonging to the same subject) is shown in green, and the Euclidean distances for all other incorrect ECG pairs is shown in red. (B) SNN-ECG is only successful at re-identification when the Euclidean distance between the two ECGs belonging to the reference subject is the shortest. (C) In a real-world setting, where the correct ECG from the anonymized dataset is not known, a certainty score is output alongside the model’s top choice to aid interpretation. As the discrepancy between top two ECGs from the anonymized dataset that are closest to the reference ECG increases (i.e. their Euclidean distances are most different), the model is more confident of its top choice. SNN, Siamese neural network.
Figure 3
Figure 3
Distributions of AP and AN pair normalized Euclidean distances: these are shown for (A) all hold-out test subjects (n subjects = 21 737) and subsets of (B) outpatient (n = 17 085), (C) normal (n = 3438), (D) LBBB (n = 1438), (E) RBBB (n = 2139), and (F) AF (n = 4206) subjects. The binary thresholds, overall accuracies, and confusion matrices are shown. If there was perfect discrimination between AP and AN pairs, there would be no overlap between the two histograms. AF, atrial fibrillation; AN, anchor-negative; AP, anchor-positive; LBBB, left bundle branch block; RBBB, right bundle branch block.
Figure 4
Figure 4
Evolution of ECG similarity over time: data for two subjects with multiple ECGs, from the hold-out test set, are shown. (A) and (B) are from one subject, and (C) and (D) are from a second subject. (A) and (C) show the normalized Euclidean distances for all ECG pairs. A point represents a unique ECG pair and is coloured by normalized Euclidean distance as output by SNN-ECG with smaller distances in green and larger distances in brown. The chronologically successive ECG pairs are plotted along the bottom-left to top-right diagonal. The normalized Euclidean distance generally increases towards the top-left corner with increasing time between the acquisition of ECGs in a given AP pair. (B) and (D) show four ECG median beat traces for each subject. In (B), these correspond to the first recorded ECG, subsequent ECGs from dates i and ii, and the last recorded ECG. In (D), these correspond to the first recorded ECG, a subsequent ECG from date iii, a further ECG also from date iii, and the last recorded ECG. The normalized Euclidean distance and temporal difference (in days) between the four ECGs is shown. A normalized Euclidean distance < 1 indicates that the model correctly predicts these two ECGs to be from the same subject. Regardless of the time between ECGs, the model can account for slight variations within a subject’s ECGs, but begins to fail when ECG traces show gross changes in morphology, as in (D) with evolving QRS morphology, even on the same day.

Similar articles

Cited by

References

    1. Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 2015;12:e1001779. - PMC - PubMed
    1. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCH, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet. Circulation 2000;101:E215–E220. - PubMed
    1. Mittelstadt BD, Floridi L. The Ethics of Big Data: Current and Foreseeable Issues in Biomedical Contexts. Cham: Springer; 2016. p. 445–480. - PubMed
    1. Lippert C, Sabatini R, Maher MC, Kang EY, Lee S, Arikan O, et al. Identification of individuals by trait prediction using whole-genome sequencing data. Proc Natl Acad Sci U S A 2017;114:10166–10171. - PMC - PubMed
    1. Im HK, Gamazon ER, Nicolae DL, Cox NJ. On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. Am J Hum Genet 2012;90:591–598. - PMC - PubMed

LinkOut - more resources