Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug 1;27(8):1244-1251.
doi: 10.1093/jamia/ocaa096.

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning

Affiliations

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning

Romain Bey et al. J Am Med Inform Assoc. .

Abstract

Objective: We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs).

Materials and methods: Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records.

Results: In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome.

Discussion: Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient's date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates.

Conclusion: Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.

Keywords: data leakage; duplicated electronic health records; federated learning; privacy; validation.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Privacy-preserving federated learning: analysis by a data scientist of medical records (blue and red individuals) distributed in 2 hospitals without extracting personally identifying information. One individual’s record is duplicated in the 2 hospitals (red) (ID4), due for instance to multiple admissions. The performances of a model are estimated through cross-validation, partitioning the datasets in training and validation folds either randomly (left) or through stratification (ie, grouping similar patients in folds) (right). Whereas duplicated records (red) may be simultaneously in training and validation folds when random partitioning is applied, thus causing data leakage, this risk is circumvented by stratification.
Figure 2.
Figure 2.
Description of fold-stratified cross-validation.
Figure 3.
Figure 3.
Accuracies computed through cross-validation as a function of the number of boosting iterations in the case of synthetic datasets. Symbols and curves correspond respectively to training accuracies and validation accuracies. Green, red, and blue colors correspond to unbiased, random, and stratified, respectively, along x1 fold-partitioning strategies. Unbiased validation accuracy lies between the overoptimistic random and the pessimistic x1-stratified estimates. The horizontal black dashed line indicates the theoretical optimal accuracy Accuracyopt.
Figure 4.
Figure 4.
Violin plots for cross-validation estimates of accuracy adopting either an unbiased (green), random (red), or x1,x2,,x10–stratified (blue) fold-partitioning strategy and running 30 simulations in the case of synthetic datasets. Horizontal black and green dashed lines correspond to the optimal accuracy that a model could reach Accuracyopt and to the mean unbiased estimate of the accuracy Accuracyunb reached by the model under consideration, respectively. Whereas random fold partitioning leads to overoptimistic estimates of accuracy, x1,x2,,x10–stratified estimates feature pessimistic biases of various sizes.
Figure 5.
Figure 5.
Ratio of xstr-stratified estimate of accuracy over the unbiased estimate of accuracy plotted with respect to the normalized importance of the stratifying covariate xstr (see text) in the case of synthetic datasets. A total of 100 datasets are generated corresponding to different {Σ,a}, and for each dataset, each covariate is taken successively as stratifying covariate. The Pearson correlation coefficient is r=-0.77.
Figure 6.
Figure 6.
Violin plots for cross-validation estimates of accuracy adopting either an unbiased (green), a random (red) or a stratified (blue) fold-partitioning strategy and running 30 simulations in the case of MIMIC-III (Medical Information Mart for Intensive Care-III)–based datasets. The horizontal green dashed line corresponds to the mean unbiased estimate of the accuracy Accuracyunb reached by the model under consideration. Whereas random fold partitioning leads to overoptimistic estimates of accuracy, stratified estimates feature pessimistic biases of various sizes when the age at admission (age), the weight at admission (wei), the lowest creatinine value from the first 24 hours after admission (cre), the lowest blood urea nitrogen value from the first 24 hours after admission (bun), or the highest hemoglobin value from the first 24 hours after admission (hem) are used as stratifying covariates (see text). Inset shows the ratio of stratified estimate of accuracy over the unbiased estimate of accuracy plotted with respect to the normalized importance of the stratifying covariate (see text). The Pearson correlation coefficient is r=-0.79. AUC: area under the curve.

References

    1. Esteva A, Kuprel B, Novoa RA, et al.Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017; 542 (7639): 115–8. - PMC - PubMed
    1. Hosny A, Parmar C, Quackenbush J, et al.Artificial intelligence in radiology. Nat Rev Cancer 2018; 18 (8): 500–10. - PMC - PubMed
    1. Komorowski M, Celi LA, Badawi O, et al.The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med 2018; 24 (11): 1716–20. - PubMed
    1. Rajkomar A, Oren E, Chen K, et al.Scalable and accurate deep learning with electronic health records. NPJ Digit Med 2018; 1: 18. - PMC - PubMed
    1. Rahimian F, Salimi-Khorshidi G, Payberah AH, et al.Predicting the risk of emergency admission with machine learning: Development and validation using linked electronic health records. PLoS Med 2018; 15 (11): e1002695. - PMC - PubMed

Publication types