Multi-Source Conformal Inference Under Distribution Shift
- PMID: 39193374
- PMCID: PMC11345809
Multi-Source Conformal Inference Under Distribution Shift
Abstract
Recent years have experienced increasing utilization of complex machine learning models across multiple sources of data to inform more generalizable decision-making. However, distribution shifts across data sources and privacy concerns related to sharing individual-level data, coupled with a lack of uncertainty quantification from machine learning predictions, make it challenging to achieve valid inferences in multi-source environments. In this paper, we consider the problem of obtaining distribution-free prediction intervals for a target population, leveraging multiple potentially biased data sources. We derive the efficient influence functions for the quantiles of unobserved outcomes in the target and source populations, and show that one can incorporate machine learning prediction algorithms in the estimation of nuisance functions while still achieving parametric rates of convergence to nominal coverage probabilities. Moreover, when conditional outcome invariance is violated, we propose a data-adaptive strategy to upweight informative data sources for efficiency gain and downweight non-informative data sources for bias reduction. We highlight the robustness and efficiency of our proposals for a variety of conformal scores and data-generating mechanisms via extensive synthetic experiments. Hospital length of stay prediction intervals for pediatric patients undergoing a high-risk cardiac surgical procedure between 2016-2022 in the U.S. illustrate the utility of our methodology.
Figures
References
-
- Barber RF, Candes EJ, Ramdas A, and Tibshirani RJ Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816–845, 2023.
-
- Bickel P, Klaassen C, Ritov Y, and Wellner J Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press Baltimore, 1993.
-
- Cai TT, Namkoong H, Yadlowsky S, et al. Diagnosing model performance under distribution shift. arXiv preprint arXiv:2303.02011, 2023.
-
- Duan R, Boland MR, Liu Z, Liu Y, Chang HH, Xu H, Chu H, Schmid CH, Forrest CB, Holmes JH, et al. Learning from electronic health records across multiple sites: A communication-efficient and privacy-preserving distributed algorithm. Journal of the American Medical Informatics Association, 27(3):376–385, 2020a. - PMC - PubMed
Grants and funding
LinkOut - more resources
Full Text Sources