Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jul 14;24(3):635-652.
doi: 10.1093/biostatistics/kxab039.

Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference

Affiliations

Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference

Tenglong Li et al. Biostatistics. .

Abstract

Nonignorable technical variation is commonly observed across data from multiple experimental runs, platforms, or studies. These so-called batch effects can lead to difficulty in merging data from multiple sources, as they can severely bias the outcome of the analysis. Many groups have developed approaches for removing batch effects from data, usually by accommodating batch variables into the analysis (one-step correction) or by preprocessing the data prior to the formal or final analysis (two-step correction). One-step correction is often desirable due it its simplicity, but its flexibility is limited and it can be difficult to include batch variables uniformly when an analysis has multiple stages. Two-step correction allows for richer models of batch mean and variance. However, prior investigation has indicated that two-step correction can lead to incorrect statistical inference in downstream analysis. Generally speaking, two-step approaches introduce a correlation structure in the corrected data, which, if ignored, may lead to either exaggerated or diminished significance in downstream applications such as differential expression analysis. Here, we provide more intuitive and more formal evaluations of the impacts of two-step batch correction compared to existing literature. We demonstrate that the undesired impacts of two-step correction (exaggerated or diminished significance) depend on both the nature of the study design and the batch effects. We also provide strategies for overcoming these negative impacts in downstream analyses using the estimated correlation matrix of the corrected data. We compare the results of our proposed workflow with the results from other published one-step and two-step methods and show that our methods lead to more consistent false discovery controls and power of detection across a variety of batch effect scenarios. Software for our method is available through GitHub (https://github.com/jtleek/sva-devel) and will be available in future versions of the $\texttt{sva}$ R package in the Bioconductor project (https://bioconductor.org/packages/release/bioc/html/sva.html).

Keywords: Batch effect; ComBat; Generalized least squares; Sample correlation adjustment; Two-step batch adjustment.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Three figures are used to illustrate that ComBat+Cor reduces the exaggerated significance seen when ComBat is applied based on simulated data that mimics the bladderbatch experimental design. Note that the original bladderbatch data has unbalanced group-batch design and small (mean and variance) batch effects. The benchmark approach refers to the approach that applies ordinary differential expression analysis to data without any batch effects. (a) QQ plot of p-values using ComBat and the p-values using the benchmark approach. The line falls above the formula image identity line, suggesting that p-values generated by ComBat concentrate at smaller values than those generated on the data without batch effect. (b) QQ plot of p-values using ComBat+Cor (formula image) and p-values using the benchmark approach. (c) line chart comparing the distributions of p-values using ComBat, ComBat+Cor, and the benchmark approach.
Fig. 2.
Fig. 2.
Plot of TPR for different choices of formula image for ComBat+Cor. The results were simulated based on the unbalanced/balanced group-batch design for the bladderbatch study.
Fig. 3.
Fig. 3.
Plot of FPR for different choices of formula image for ComBat+Cor. The results were simulated based on the unbalanced/balanced group-batch design for the bladderbatch study.
Fig. 4.
Fig. 4.
Simulation results for examples 2, 3, and 4. In each plot, we illustrate the distributions of the p-values for the benchmark approach, ComBat, and ComBat+Cor. (a) Simulation results based on Towfic and others (2014). (b) Simulation results based on Johnson and others (2007). (c) Simulation results based on the TB data for comparing progressors versus nonprogressors.
Fig. 5.
Fig. 5.
Plot of TPR for different choices of formula image for ComBat+Cor. The results were simulated based on the original TB data set.
Fig. 6.
Fig. 6.
Guidance about the choice of ComBat and ComBat+Cor for addressing the exaggerated significance problem in batch correction.

References

    1. Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology 11, R106. - PMC - PubMed
    1. Cheng, S. H. and Higham, N. J. (1998). A modified Cholesky algorithm based on a symmetric indefinite factorization. SIAM Journal on Matrix Analysis and Applications 19, 1097–1110.
    1. Dyrskjøt, L., Kruhøffer, M., Thykjaer, T., Marcussen, N., Jensen, J. L., Møller, K. and Ørntoft, T. F. (2004). Gene expression in the urinary bladder: a common carcinoma in situ gene expression signature exists disregarding histopathological classification. Cancer Research 64, 4040–4048. - PubMed
    1. Gagnon-Bartsch, J. A. and Speed, T. P. (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–552. - PMC - PubMed
    1. Johnson, W. E., Li, C. and Rabinovic, l. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127. - PubMed