. 2021 Jul 12;37(11):1521-1527.

doi: 10.1093/bioinformatics/btaa986.

Robustifying genomic classifiers to batch effects via ensemble learning

Yuqing Zhang¹, Prasad Patil², W Evan Johnson^{2

3}, Giovanni Parmigiani^{4

5}

Affiliations

¹ Clinical Bioinformatics, Gilead Sciences, Inc., Foster City, CA 94404, USA.
² Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA.
³ Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA 02118, USA.
⁴ Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA.
⁵ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.

PMID: 33245114
PMCID: PMC8485848
DOI: 10.1093/bioinformatics/btaa986

Robustifying genomic classifiers to batch effects via ensemble learning

Yuqing Zhang et al. Bioinformatics. 2021.

. 2021 Jul 12;37(11):1521-1527.

doi: 10.1093/bioinformatics/btaa986.

Authors

Yuqing Zhang¹, Prasad Patil², W Evan Johnson^{2

3}, Giovanni Parmigiani^{4

5}

Affiliations

¹ Clinical Bioinformatics, Gilead Sciences, Inc., Foster City, CA 94404, USA.
² Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA.
³ Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA 02118, USA.
⁴ Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA.
⁵ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.

PMID: 33245114
PMCID: PMC8485848
DOI: 10.1093/bioinformatics/btaa986

Abstract

Motivation: Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across batches. Such 'batch effects' often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here, we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods.

Results: We provide a systematic comparison between these two strategies using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed that in independent validation, while merging followed by batch adjustment provides better discrimination at low level of heterogeneity, our ensemble learning strategy achieves more robust performance, especially at high severity of batch effects. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers.

Availability and implementation: The data underlying this article are available in the article and in its online supplementary material. Processed data is available in the Github repository with implementation code, at https://github.com/zhangyuqing/bea_ensemble.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Comparison between ensembling and merging when using Random Forests. Three out of our five choices of ensembling weights are displayed: batch size weights, cross-study weights and stacking regression weights (see Section 2 for details)

**Fig. 2.**
Application of ensemble learning to predicting active TB against latent infection. We iteratively selected one of the studies in Table 1 as the independent test study. The remaining studies are viewed as ‘batches’ in the training set. We trained LASSO, Random Forest and SVM, then aggregated predictions from all three algorithms to construct the ensemble. The figure shows average prediction performance over 100 bootstrap samples of the test data, with error bars showing 95% confidence intervals. Above the bars we note the percentage of bootstrap experiments where each method achieves the lowest mean cross-entropy loss. When the four homogeneous studies are used, the average performance using the three ensemble strategies are better than the merging strategy, which is consistent with observations from the simulation study at high severity of batch effects. Different ensemble methods can be the best in a different test set (the optimal study—ensemble combination: D—batch-size weights, E—stacking regression weights, G—cross-study weights). For study F, the three ensemble methods are roughly equal, each wins 33% of the time

See this image and copyright information in PMC

Cited by

Hierarchical resampling for bagging in multistudy prediction with applications to human neurochemical sensing.
Loewinger G, Patil P, Kishida KT, Parmigiani G. Loewinger G, et al. Ann Appl Stat. 2022 Dec;16(4):2145-2165. doi: 10.1214/21-aoas1574. Epub 2022 Sep 26. Ann Appl Stat. 2022. PMID: 36274786 Free PMC article.
An immuno-score signature of tumor immune microenvironment predicts clinical outcomes in locally advanced rectal cancer.
Xue Z, Yang S, Luo Y, He M, Qiao H, Peng W, Tong S, Hong G, Guo Y. Xue Z, et al. Front Oncol. 2022 Sep 29;12:993726. doi: 10.3389/fonc.2022.993726. eCollection 2022. Front Oncol. 2022. PMID: 36248969 Free PMC article.
Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies.
Gao Y, Sun F. Gao Y, et al. PLoS Comput Biol. 2023 Oct 16;19(10):e1010608. doi: 10.1371/journal.pcbi.1010608. eCollection 2023 Oct. PLoS Comput Biol. 2023. PMID: 37844077 Free PMC article.
Evaluation of normalization methods for predicting quantitative phenotypes in metagenomic data analysis.
Wang B, Luan Y. Wang B, et al. Front Genet. 2024 Jun 5;15:1369628. doi: 10.3389/fgene.2024.1369628. eCollection 2024. Front Genet. 2024. PMID: 38903761 Free PMC article.
A multimodal approach for visualization and identification of electrophysiological cell types in vivo.
Lee EK, Gül AE, Heller G, Lakunina A, Yu H, Shelton A, Olsen S, Steinmetz NA, Hurwitz C, Jaramillo S, Przytycki PF, Chandrasekaran C. Lee EK, et al. bioRxiv [Preprint]. 2025 Jul 31:2025.07.24.666654. doi: 10.1101/2025.07.24.666654. bioRxiv. 2025. PMID: 40766549 Free PMC article. Preprint.

See all "Cited by" articles

References

1. Alcaïs A. et al. (2005) Tuberculosis in children and adults: two distinct genetic diseases. J. Exp. Med., 202, 1617–1621. - PMC - PubMed
1. Anderson S.T. et al. (2014) Diagnosis of childhood tuberculosis and host RNA expression in Africa. N. Engl. J. Med., 370, 1712–1723. - PMC - PubMed
1. Badani K.K. et al. (2015) Effect of a genomic classifier test on clinical practice decisions for patients with high-risk prostate cancer after surgery. BJU Int., 115, 419–429. - PMC - PubMed
1. Benito M. et al. (2004) Adjustment of systematic microarray data biases. Bioinformatics, 20, 105–114. - PubMed
1. Bernau C. et al. (2014) Cross-study validation for the assessment of prediction algorithms. Bioinformatics, 30, i105–i112. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Robustifying genomic classifiers to batch effects via ensemble learning

Affiliations

Robustifying genomic classifiers to batch effects via ensemble learning

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources