Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference
- PMID: 34893807
- PMCID: PMC10449015
- DOI: 10.1093/biostatistics/kxab039
Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference
Abstract
Nonignorable technical variation is commonly observed across data from multiple experimental runs, platforms, or studies. These so-called batch effects can lead to difficulty in merging data from multiple sources, as they can severely bias the outcome of the analysis. Many groups have developed approaches for removing batch effects from data, usually by accommodating batch variables into the analysis (one-step correction) or by preprocessing the data prior to the formal or final analysis (two-step correction). One-step correction is often desirable due it its simplicity, but its flexibility is limited and it can be difficult to include batch variables uniformly when an analysis has multiple stages. Two-step correction allows for richer models of batch mean and variance. However, prior investigation has indicated that two-step correction can lead to incorrect statistical inference in downstream analysis. Generally speaking, two-step approaches introduce a correlation structure in the corrected data, which, if ignored, may lead to either exaggerated or diminished significance in downstream applications such as differential expression analysis. Here, we provide more intuitive and more formal evaluations of the impacts of two-step batch correction compared to existing literature. We demonstrate that the undesired impacts of two-step correction (exaggerated or diminished significance) depend on both the nature of the study design and the batch effects. We also provide strategies for overcoming these negative impacts in downstream analyses using the estimated correlation matrix of the corrected data. We compare the results of our proposed workflow with the results from other published one-step and two-step methods and show that our methods lead to more consistent false discovery controls and power of detection across a variety of batch effect scenarios. Software for our method is available through GitHub (https://github.com/jtleek/sva-devel) and will be available in future versions of the $\texttt{sva}$ R package in the Bioconductor project (https://bioconductor.org/packages/release/bioc/html/sva.html).
Keywords: Batch effect; ComBat; Generalized least squares; Sample correlation adjustment; Two-step batch adjustment.
© The Author 2021. Published by Oxford University Press.
Figures
References
-
- Cheng, S. H. and Higham, N. J. (1998). A modified Cholesky algorithm based on a symmetric indefinite factorization. SIAM Journal on Matrix Analysis and Applications 19, 1097–1110.
-
- Dyrskjøt, L., Kruhøffer, M., Thykjaer, T., Marcussen, N., Jensen, J. L., Møller, K. and Ørntoft, T. F. (2004). Gene expression in the urinary bladder: a common carcinoma in situ gene expression signature exists disregarding histopathological classification. Cancer Research 64, 4040–4048. - PubMed
-
- Johnson, W. E., Li, C. and Rabinovic, l. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127. - PubMed
