Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 29:3:imag_a_00458.
doi: 10.1162/imag_a_00458. eCollection 2025.

When no answer is better than a wrong answer: A causal perspective on batch effects

Affiliations

When no answer is better than a wrong answer: A causal perspective on batch effects

Eric W Bridgeford et al. Imaging Neurosci (Camb). .

Abstract

Batch effects, undesirable sources of variability across multiple experiments, present significant challenges for scientific and clinical discoveries. Batch effects can (i) produce spurious signals and/or (ii) obscure genuine signals, contributing to the ongoing reproducibility crisis. Because batch effects are typically modeled as classical statistical effects, they often cannot differentiate between sources of variability due to confounding biases, which may lead them to erroneously conclude batch effects are present (or not). We formalize batch effects as causal effects, and introduce algorithms leveraging causal machinery, to address these concerns. Simulations illustrate that when non-causal methods provide the wrong answer, our methods either produce more accurate answers or "no answer," meaning they assert the data are inadequate to confidently conclude on the presence of a batch effect. Applying our causal methods to 27 neuroimaging datasets yields qualitatively similar results: in situations where it is unclear whether batch effects are present, non-causal methods confidently identify (or fail to identify) batch effects, whereas our causal methods assert that it is unclear whether there are batch effects or not. In instances where batch effects should be discernable, our techniques produce different results from prior art, each of which produce results more qualitatively similar to not applying any batch effect correction to the data at all. This work, therefore, provides a causal framework for understanding the potential capabilities and limitations of analysis of multi-site data.

Keywords: batch effects; causal; connectomics; harmonization; mega-analysis; mega-study.

PubMed Disclaimer

Conflict of interest statement

None of the authors have any known financial or non-financial competing interests to declare in relation to this work. The methodological tools developed have been made openly available through the causalBatch R package on CRAN, and no proprietary or commercial claims have been made on these methods.

Figures

Fig. 1.
Fig. 1.
Non-causal batch effect mitigation procedures are subject to both over- and undercorrection and cannot rectify “confounding.” Our causally enriched methods address these issues. (I) shows the observed data (points), where color indicates batch. The orange line and the blue line indicate the expected outcome per batch, and the “batch effect” describes the observed difference (red band) between the expected outcomes. Ideally, after batch effect correction, the blue and orange lines should overlap. The orange batch tends to oversample people with younger ages, and the blue batch tends to oversample people with higher ages. (A) A scenario where the covariate distributions are moderately confounded and partially overlap, and the orange batch tends to see higher outcomes than the blue batch. (B) A scenario where the covariate distributions are moderately confounded and partially overlap, and the orange batch tends to see lower outcomes than the blue batch. (II) and (III) illustrate the corrected data after correction, via non-causal and causal methods, respectively. If the batch effect is removed, the orange and blue lines should be approximately equal. Non-causal methods attempt to adjust for the batch effect over the entire covariate range, and in so doing, are subject to strong confounding biases. Supposed “batch effect correction” instead introduces spurious artifacts (A) or fails to mitigate batch effects (B). Causal methods instead look to a reduced covariate range (gray box), finding points between the two datasets that are “similar,” and are not subject to these biases. Simulation settings are described inSupplementary Material D.2.
Fig. 2.
Fig. 2.
Causal Graph of Study Covariates. Causal graphs illustrating the underlying assumptions under which various procedures to detect or correct for batch effects are reasonable. Boxes represent variables of interest, and arrows indicate “cause and effect” relationships between variables. The causal estimand is the impact of the exposure (the batch) on the outcome (the participant’s measurement) and is a cumulative effect of effect modifiers (black variables) both known and unknown that yield batch-specific differences. The relationship between the exposure and the outcome isconfoundedif there are open backdoor paths (Pearl, 2009b). (A) Associational procedures and (B) conditional procedures are reasonable when there is no confounding. (C) Adjusted causal procedures are reasonable when backdoor paths can be blocked by measured covariates (Pearl, 1995,2009b). (D) Crossover procedures are reasonable under many forms of potential confounding, measured or unmeasured, so long as participant states are not changing or randomized. If participant states are changing or not randomized, the states must be measured (red arrow) to avoid aliasing batch effects with mediation effects due to participant state. Note that some participant traits (e.g., intelligence or mental health) may be caused by the connectome, and introduce the potential for bi-directional arrows in the causal graph with the measurement, due to a failure to reflect the underlying connectome free from measurement errors in the form of batch effects, visited inSection 4.
Fig. 3.
Fig. 3.
The demographic balancing procedure serves to demographically align poorly balanced datasets using causal approaches. (A) The unadjusted datasets are imbalanced in covariate distributions. Thereference datasetis indicated. (B) Propensity trimming (shaded boxes) provides general alignment of demographics, such that no datasets will include demographics unrepresented in other datasets. (C.I) Samples from other datasets are matched to samples from the reference dataset and (D.I) samples without matches are discarded from subsequent analysis. The adjusted data after matching have nearly identical covariate distributions. (C.II) Samples are weighted according to their inverse propensities, such that samples which look non-representative of other datasets are down-weighted. (D.II) This can serve to also yield somewhat similar covariate distributions after re-weighting. (D.III) The trimmed data generally feature overlapping covariate distributions, but may not be identical across datasets. (E) Downstream analysis for batch effect correction or detection applied to the adjusted data (and potentially propensity weights) via Matching cComBat, AIPW cComBat, or CausalcDcorr.
Fig. 4.
Fig. 4.
Simulation regimes illustrate that non-causal procedures are subject to strong biases without covariate matching. (A) illustrates the relationship between the relative expected outcome and the covariate value, for each batch (color), across (I.) linear, (II.) non-linear, and (III.) non-monotone regimes. The conditional average treatment effect (red box) highlights the batch effect for each covariate value. The average treatment effect (ATE) is the average width of this box, and the average absolute treatment effect (AATE) is the average absolute width of this box. In these simulations, the AATE before treatment is1. (B) The effectiveness of the techniques at removing the batch effect. Techniques with high performance will have a mean AATE after correction at or near0(the batch effect was eliminated). (C) illustrates the effectiveness of different batch effect correction techniques for preserving the underlying true signal. Techniques with high performance will have higher correlations with the underlying true signal. Simulation settings are described inSupplementary Material D.3.
Fig. 5.
Fig. 5.
Demographic data for the27studies from the CoRR mega-study. (A) Each point represents the age of a participant corresponding to a single measurement. Rows are studies, boxes are continents, and color indicates sex.n=3,597samples are shown which featured age, sex, and continent information, and were successfully processed to connectomes. (B) Even with onlythreeobserved covariates (sex, age, and continent of measurement), the CoRR studies often show extremely limited covariate overlap (Pastore & Calcagnì, 2019). This makes inference regarding batch effects difficult.
Fig. 6.
Fig. 6.
Comparison of types of effects between datasets from the CoRR study. (A) Heatmap of different types of effects (conditional and adjusted causal procedures) that can be used to detect differences between each pair of datasets in the CoRR study. Whereas most (25.8%) non-confounded conditional effects are not significant, most (66.1%) non-confounded adjusted effects are significant. (B).(I) Delineation of how one possible source of batch effects, scanner model, impacts significance rates of batch effects. Almost all pairs of studies conducted on different scanners with high covariate overlap (>.05) have discernable batch effects. The rate of detected effects is lower when the scanner model is the same. (B).(II) and (B).(III) When the level of estimated covariate overlap is lower (<.05) or zero (different continent), conditional effects never detect a difference across datasets. However, adjusted causal procedures instead report that the data are too confounded for subsequent inference and avoid running entirely.
Fig. 7.
Fig. 7.
Significant Edges Before and After Batch Effect Removal. (A) The procedure for reducing the data to the approximately propensity balanced subset of the CoRR study. Non-causal methods learn and apply batch effect correction to the American Clique, which is further reduced to the approximately propensity balanced individuals. Causal methods learn batch effect corrections from the full matched data, and then apply the learned corrections to the approximately propensity balanced subset. (B) The presence of a sex effect (conditional on individual age) is investigated for each edge in the connectome. Significant edges are shown in rank order from largest (rank = 1) to smallest sex effect (α=.05, Benjamini Hochberg (Benjamini & Hochberg, 1995) correction). (C) the DICE overlap of the topn=100edges, by effect size, between all pairs in (B).

References

    1. Abadie , A. , & Imbens , G. W. ( 2011. ). Bias-corrected matching estimators for average treatment effects . Journal of Business & Economic Statistics , 29 ( 1 ), 1 – 11 . 10.1198/jbes.2009.07333 - DOI
    1. Akey , J. M. , Biswas , S. , Leek , J. T. , & Storey , J. D. ( 2007. ). On the design and analysis of gene expression studies in human populations . Nature Genetics , 39 , 807 – 808 . 10.1038/ng0707-807 - DOI - PubMed
    1. Arjovsky , M. ( 2021. ). Out of distribution generalization in machine learning . arXiv . 10.48550/arXiv.2103.02667 - DOI
    1. Bareinboim , E. , & Pearl , J. ( 2012. ). Controlling selection bias in causal inference . In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (Vol. 22 , pp. 100 – 108 ). PMLR; . https://proceedings.mlr.press/v22/bareinboim12.html
    1. Bareinboim , E. , & Pearl , J. ( 2016. ). Causal inference and the data-fusion problem . Proceedings of the National Academy of Sciences of the United States of America , 113 ( 27 ), 7345 – 7352 . 10.1073/pnas.1510507113 - DOI - PMC - PubMed

LinkOut - more resources