. 2020 Aug 1;27(8):1244-1251.

doi: 10.1093/jamia/ocaa096.

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning

Romain Bey¹, Romain Goussault², François Grolleau¹, Mehdi Benchoufi¹, Raphaël Porcher¹

Affiliations

¹ Centre of Research in Epidemiology and Statistics (CRESS), Université de Paris, French Institute of Health and Medical Research (INSERM), National Institute of Agricultural Research (INRA), Paris, France.
² CIC 1413, Center for Research in Cancerology and Immunology Nantes-Angers (CRCINA), Dermatology Department, Centre Hospitalier Universitaire Nantes, Nantes University, Nantes, France.

PMID: 32620945
PMCID: PMC7647321
DOI: 10.1093/jamia/ocaa096

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning

Romain Bey et al. J Am Med Inform Assoc. 2020.

. 2020 Aug 1;27(8):1244-1251.

doi: 10.1093/jamia/ocaa096.

Authors

Romain Bey¹, Romain Goussault², François Grolleau¹, Mehdi Benchoufi¹, Raphaël Porcher¹

Affiliations

¹ Centre of Research in Epidemiology and Statistics (CRESS), Université de Paris, French Institute of Health and Medical Research (INSERM), National Institute of Agricultural Research (INRA), Paris, France.
² CIC 1413, Center for Research in Cancerology and Immunology Nantes-Angers (CRCINA), Dermatology Department, Centre Hospitalier Universitaire Nantes, Nantes University, Nantes, France.

PMID: 32620945
PMCID: PMC7647321
DOI: 10.1093/jamia/ocaa096

Abstract

Objective: We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs).

Materials and methods: Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records.

Results: In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome.

Discussion: Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient's date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates.

Conclusion: Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.

Keywords: data leakage; duplicated electronic health records; federated learning; privacy; validation.

PubMed Disclaimer

Figures

**Figure 1.**
Privacy-preserving federated learning: analysis by a data scientist of medical records (blue and red individuals) distributed in 2 hospitals without extracting personally identifying information. One individual’s record is duplicated in the 2 hospitals (red) (ID4), due for instance to multiple admissions. The performances of a model are estimated through cross-validation, partitioning the datasets in training and validation folds either randomly (left) or through stratification (ie, grouping similar patients in folds) (right). Whereas duplicated records (red) may be simultaneously in training and validation folds when random partitioning is applied, thus causing data leakage, this risk is circumvented by stratification.

**Figure 2.**
Description of fold-stratified cross-validation.

**Figure 3.**
Accuracies computed through cross-validation as a function of the number of boosting iterations in the case of synthetic datasets. Symbols and curves correspond respectively to training accuracies and validation accuracies. Green, red, and blue colors correspond to unbiased, random, and stratified, respectively, along $x_{1}$ fold-partitioning strategies. Unbiased validation accuracy lies between the overoptimistic random and the pessimistic $x_{1}$ -stratified estimates. The horizontal black dashed line indicates the theoretical optimal accuracy $Accurac y_{opt}$ .

**Figure 4.**
Violin plots for cross-validation estimates of accuracy adopting either an unbiased (green), random (red), or $x_{1}, x_{2}, \dots, x_{10}$ –stratified (blue) fold-partitioning strategy and running 30 simulations in the case of synthetic datasets. Horizontal black and green dashed lines correspond to the optimal accuracy that a model could reach $Accurac y_{opt}$ and to the mean unbiased estimate of the accuracy $Accurac y_{unb}$ reached by the model under consideration, respectively. Whereas random fold partitioning leads to overoptimistic estimates of accuracy, $x_{1}, x_{2}, \dots, x_{10}$ –stratified estimates feature pessimistic biases of various sizes.

**Figure 5.**
Ratio of $x_{str}$ -stratified estimate of accuracy over the unbiased estimate of accuracy plotted with respect to the normalized importance of the stratifying covariate $x_{str}$ (see text) in the case of synthetic datasets. A total of 100 datasets are generated corresponding to different ${Σ, a}$ , and for each dataset, each covariate is taken successively as stratifying covariate. The Pearson correlation coefficient is $r = - 0.77$ .

**Figure 6.**
Violin plots for cross-validation estimates of accuracy adopting either an unbiased (green), a random (red) or a stratified (blue) fold-partitioning strategy and running 30 simulations in the case of MIMIC-III (Medical Information Mart for Intensive Care-III)–based datasets. The horizontal green dashed line corresponds to the mean unbiased estimate of the accuracy $Accurac y_{unb}$ reached by the model under consideration. Whereas random fold partitioning leads to overoptimistic estimates of accuracy, stratified estimates feature pessimistic biases of various sizes when the age at admission (age), the weight at admission (wei), the lowest creatinine value from the first 24 hours after admission (cre), the lowest blood urea nitrogen value from the first 24 hours after admission (bun), or the highest hemoglobin value from the first 24 hours after admission (hem) are used as stratifying covariates (see text). Inset shows the ratio of stratified estimate of accuracy over the unbiased estimate of accuracy plotted with respect to the normalized importance of the stratifying covariate $($ see text). The Pearson correlation coefficient is $r = - 0.79$ . AUC: area under the curve.

See this image and copyright information in PMC

References

1. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017; 542 (7639): 115–8. - PMC - PubMed
1. Hosny A, Parmar C, Quackenbush J, et al. Artificial intelligence in radiology. Nat Rev Cancer 2018; 18 (8): 500–10. - PMC - PubMed
1. Komorowski M, Celi LA, Badawi O, et al. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med 2018; 24 (11): 1716–20. - PubMed
1. Rajkomar A, Oren E, Chen K, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med 2018; 1: 18. - PMC - PubMed
1. Rahimian F, Salimi-Khorshidi G, Payberah AH, et al. Predicting the risk of emergency admission with machine learning: Development and validation using linked electronic health records. PLoS Med 2018; 15 (11): e1002695. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning

Affiliations

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources