Review

. 2021 Sep 28;10(9):giab055.

doi: 10.1093/gigascience/giab055.

Preventing dataset shift from breaking machine-learning biomarkers

Jérôme Dockès¹, Gaël Varoquaux^{1

2}, Jean-Baptiste Poline¹

Affiliations

PMID: 34585237
PMCID: PMC8478611
DOI: 10.1093/gigascience/giab055

Review

Preventing dataset shift from breaking machine-learning biomarkers

Jérôme Dockès et al. Gigascience. 2021.

. 2021 Sep 28;10(9):giab055.

doi: 10.1093/gigascience/giab055.

Authors

Jérôme Dockès¹, Gaël Varoquaux^{1

2}, Jean-Baptiste Poline¹

Affiliations

¹ McGill University, 845 Sherbrooke St W, Montreal, Quebec H3A 0G4, Canada.
² INRIA.

PMID: 34585237
PMCID: PMC8478611
DOI: 10.1093/gigascience/giab055

Abstract

Machine learning brings the hope of finding new biomarkers extracted from cohorts with rich biomedical measurements. A good biomarker is one that gives reliable detection of the corresponding condition. However, biomarkers are often extracted from a cohort that differs from the target population. Such a mismatch, known as a dataset shift, can undermine the application of the biomarker to new individuals. Dataset shifts are frequent in biomedical research, e.g., because of recruitment biases. When a dataset shift occurs, standard machine-learning techniques do not suffice to extract and validate biomarkers. This article provides an overview of when and how dataset shifts break machine-learning-extracted biomarkers, as well as detection and correction strategies.

Keywords: biomarker; dataset shift; generalization; machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Figure 1:**
Classification with dataset shift—regressing out a correlate of the shift does not help generalization. The task is to classify patients (orange) from healthy controls (blue), using 2D features. Age, indicated by the shade of gray, influences both the features and the probability of disease. *Left:* Generative process for the simulated data. Age influences both the target Y and the features X, and Y also has an effect on X. Between the source and target datasets, the distribution of age changes. The 2 arrows point towards increasing age and represent the Healthy and Diseased populations, corresponding to the orange and blue clouds of points in the right panel. The grayscale gradient in the arrows represents the increasing age of the individuals (older individuals correspond to a darker shade). Throughout their life, individuals can jump from the Healthy trajectory to the Diseased trajectory, which is slightly offset in this 2D feature space. As age increases, the prevalence of the disease increases, hence the Healthy trajectory contains more individuals of young ages (its wide end) and fewer at older ages (its narrow end)—and vice versa for the Diseased trajectory. *Right:* Predictive models. In the target data (bottom row), the age distribution is shifted: individuals tend to be older. Elderly individuals are indeed often less likely to participate in clinical studies [24]. *First column:* No correction is applied. As the situation is close to a covariate shift (see Section “Covariate shift"), a powerful learner (RBF-SVM) generalizes well to the target data. An over-constrained model—Linear-SVM—generalizes poorly. *Second column:* Wrong approach. To remove associations with age, features are replaced by the residuals after regressing them on age. This destroys the signal and results in poor performance for both models and datasets. *Third column:* Samples are weighted to give more importance to those more likely in the target distribution. Small circles indicate younger individuals, with less influence on the classifier estimation. This reweighting improves prediction for the linear model on the older population. AUC: area under the curve.

**Figure 2:**
Predicting the smoking status of UKBiobank participants. Different predictive models are trained on 90,000 UKBiobank participants and tested on 9,000 participants with a possibly shifted age distribution. “Young → old” means the training set was drawn from a younger sample than the testing set. Models perform better when trained on a sample drawn from the same population as the testing set. Reweighting examples that are more likely in the test distribution (“+ reweighting” strategy, known as Importance Weighting, see Section “Importance Weighting”) alleviates the issue for the simple linear model but is detrimental for the gradient boosting. Regressing out the age (“+ regress-out” strategy) is a bad idea and degrades prediction performance in all configurations. The boxes represent the first, second and third quartiles of scores across cross-validation folds. Whiskers represent the rest of the distribution, except for outliers, defined as points beyond 1.5 times the IQR past the low and high quartiles, and represented with diamond fliers.

**Figure 3:**
Sample selection bias: three examples. On the right are graphs giving conditional independence relations [40]. Y is the lesion volume to be predicted (i.e., the output). M are the imaging parameters, e.g., contrast agent dosage. X is the image, and depends both on Y and M (in this toy example X is computed as , where ϵ is additive noise). S indicates that data are selected to enter the source dataset (orange points) or not (blue points). The symbol means independence between variables. Preferentially selecting samples results in a dataset shift (middle and bottom row). Depending on whether , the conditional distribution of —here lesion volume given the image—estimated on the selected data may be biased or not.

formula image — **Figure 3:**
Sample selection bias: three examples. On the right are graphs giving conditional independence relations [40]. Y is the lesion volume to be predicted (i.e., the output). M are the imaging parameters, e.g., contrast agent dosage. X is the image, and depends both on Y and M (in this toy example X is computed as , where ϵ is additive noise). S indicates that data are selected to enter the source dataset (orange points) or not (blue points). The symbol means independence between variables. Preferentially selecting samples results in a dataset shift (middle and bottom row). Depending on whether , the conditional distribution of —here lesion volume given the image—estimated on the selected data may be biased or not.

**Figure 4:**
Dataset shifts that may or may not be compensated by reweighting. *Left:* Distribution of sex can be balanced by downweighting men and upweighting women. *Right:* Women are completely missing; the dataset shift cannot be fixed by importance weighting.

**Figure 5:**
Covariate shift: stays the same, but the feature space is sampled differently in the source and target datasets. A powerful learner may generalize well as is correctly captured [27]. Thus the polynomial fit of degree 4 performs well on the new dataset. However, an overconstrained learner such as the linear fit can benefit from reweighting training examples to give more importance to the most relevant region of the feature space.

**Figure 6:**
Prior probability shift: when P(Y) changes but stays the same. This can happen for example when participants are selected on the basis of Y—possibly to have a dataset with a balanced number of patients and healthy participants: . When the prior probability (marginal distribution of Y) in the target population is known, this is easily corrected by applying Bayes’ rule. The output Y is typically low-dimensional and discrete (often it is a single binary value), so P(Y) can often be estimated precisely from few examples.

See this image and copyright information in PMC

Cited by

Integrated bioinformatical analysis, machine learning and in vitro experiment-identified m6A subtype, and predictive drug target signatures for diagnosing renal fibrosis.
Feng C, Wang Z, Liu C, Liu S, Wang Y, Zeng Y, Wang Q, Peng T, Pu X, Liu J. Feng C, et al. Front Pharmacol. 2022 Aug 31;13:909784. doi: 10.3389/fphar.2022.909784. eCollection 2022. Front Pharmacol. 2022. PMID: 36120336 Free PMC article.
Power and reproducibility in the external validation of brain-phenotype predictions.
Rosenblatt M, Tejavibulya L, Camp CC, Jiang R, Westwater ML, Noble S, Scheinost D. Rosenblatt M, et al. bioRxiv [Preprint]. 2023 Oct 30:2023.10.25.563971. doi: 10.1101/2023.10.25.563971. bioRxiv. 2023. Update in: Nat Hum Behav. 2024 Oct;8(10):2018-2033. doi: 10.1038/s41562-024-01931-7. PMID: 37961654 Free PMC article. Updated. Preprint.
Identification of antigen-presentation related B cells as a key player in Crohn's disease using single-cell dissecting, hdWGCNA, and deep learning.
Shen X, Mo S, Zeng X, Wang Y, Lin L, Weng M, Sugasawa T, Wang L, Gu W, Nakajima T. Shen X, et al. Clin Exp Med. 2023 Dec;23(8):5255-5267. doi: 10.1007/s10238-023-01145-7. Epub 2023 Aug 8. Clin Exp Med. 2023. PMID: 37550553
Deep neural networks learn general and clinically relevant representations of the ageing brain.
Leonardsen EH, Peng H, Kaufmann T, Agartz I, Andreassen OA, Celius EG, Espeseth T, Harbo HF, Høgestøl EA, Lange AM, Marquand AF, Vidal-Piñeiro D, Roe JM, Selbæk G, Sørensen Ø, Smith SM, Westlye LT, Wolfers T, Wang Y. Leonardsen EH, et al. Neuroimage. 2022 Aug 1;256:119210. doi: 10.1016/j.neuroimage.2022.119210. Epub 2022 Apr 21. Neuroimage. 2022. PMID: 35462035 Free PMC article.
Considerations for Quality Control Monitoring of Machine Learning Models in Clinical Practice.
Faust L, Wilson P, Asai S, Fu S, Liu H, Ruan X, Storlie C. Faust L, et al. JMIR Med Inform. 2024 Jun 28;12:e50437. doi: 10.2196/50437. JMIR Med Inform. 2024. PMID: 38941140 Free PMC article.

See all "Cited by" articles

References

1. Strimbu K, Tavel JA. What are biomarkers?. Curr Opin HIV AIDS. 2010;5(6):463. - PMC - PubMed
1. Andreu-Perez J, Poon CC, Merrifield RD, et al. Big data for health. IEEE J Biomed Health Inform. 2015;19(4):1193–208. - PubMed
1. Faust O, Hagiwara Y, Hong TJ, et al. Deep learning for healthcare applications based on physiological signals: A review. Comput Methods Programs Biomed. 2018;161:1–13. - PubMed
1. Deo RC. Machine learning in medicine. Circulation. 2015;132(20):1920–30. - PMC - PubMed
1. FDA. FDA report on “Mammoscreen.". 2020. https://fda.report/PMN/K192854, accessed: 10 August 2021.

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Preventing dataset shift from breaking machine-learning biomarkers

Affiliations

Preventing dataset shift from breaking machine-learning biomarkers

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources