When no answer is better than a wrong answer: A causal perspective on batch effects
- PMID: 40800974
- PMCID: PMC12319767
- DOI: 10.1162/imag_a_00458
When no answer is better than a wrong answer: A causal perspective on batch effects
Abstract
Batch effects, undesirable sources of variability across multiple experiments, present significant challenges for scientific and clinical discoveries. Batch effects can (i) produce spurious signals and/or (ii) obscure genuine signals, contributing to the ongoing reproducibility crisis. Because batch effects are typically modeled as classical statistical effects, they often cannot differentiate between sources of variability due to confounding biases, which may lead them to erroneously conclude batch effects are present (or not). We formalize batch effects as causal effects, and introduce algorithms leveraging causal machinery, to address these concerns. Simulations illustrate that when non-causal methods provide the wrong answer, our methods either produce more accurate answers or "no answer," meaning they assert the data are inadequate to confidently conclude on the presence of a batch effect. Applying our causal methods to 27 neuroimaging datasets yields qualitatively similar results: in situations where it is unclear whether batch effects are present, non-causal methods confidently identify (or fail to identify) batch effects, whereas our causal methods assert that it is unclear whether there are batch effects or not. In instances where batch effects should be discernable, our techniques produce different results from prior art, each of which produce results more qualitatively similar to not applying any batch effect correction to the data at all. This work, therefore, provides a causal framework for understanding the potential capabilities and limitations of analysis of multi-site data.
Keywords: batch effects; causal; connectomics; harmonization; mega-analysis; mega-study.
© 2025 The Authors. Published under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Conflict of interest statement
None of the authors have any known financial or non-financial competing interests to declare in relation to this work. The methodological tools developed have been made openly available through the causalBatch R package on CRAN, and no proprietary or commercial claims have been made on these methods.
Figures
References
-
- Abadie , A. , & Imbens , G. W. ( 2011. ). Bias-corrected matching estimators for average treatment effects . Journal of Business & Economic Statistics , 29 ( 1 ), 1 – 11 . 10.1198/jbes.2009.07333 - DOI
-
- Arjovsky , M. ( 2021. ). Out of distribution generalization in machine learning . arXiv . 10.48550/arXiv.2103.02667 - DOI
-
- Bareinboim , E. , & Pearl , J. ( 2012. ). Controlling selection bias in causal inference . In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (Vol. 22 , pp. 100 – 108 ). PMLR; . https://proceedings.mlr.press/v22/bareinboim12.html
LinkOut - more resources
Full Text Sources