Robustifying genomic classifiers to batch effects via ensemble learning
- PMID: 33245114
- PMCID: PMC8485848
- DOI: 10.1093/bioinformatics/btaa986
Robustifying genomic classifiers to batch effects via ensemble learning
Abstract
Motivation: Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across batches. Such 'batch effects' often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here, we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods.
Results: We provide a systematic comparison between these two strategies using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed that in independent validation, while merging followed by batch adjustment provides better discrimination at low level of heterogeneity, our ensemble learning strategy achieves more robust performance, especially at high severity of batch effects. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers.
Availability and implementation: The data underlying this article are available in the article and in its online supplementary material. Processed data is available in the Github repository with implementation code, at https://github.com/zhangyuqing/bea_ensemble.
Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author(s) 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Figures


Similar articles
-
Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies.PLoS Comput Biol. 2023 Oct 16;19(10):e1010608. doi: 10.1371/journal.pcbi.1010608. eCollection 2023 Oct. PLoS Comput Biol. 2023. PMID: 37844077 Free PMC article.
-
BatchQC: interactive software for evaluating sample and batch effects in genomic data.Bioinformatics. 2016 Dec 15;32(24):3836-3838. doi: 10.1093/bioinformatics/btw538. Epub 2016 Aug 18. Bioinformatics. 2016. PMID: 27540268 Free PMC article.
-
Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.Med Phys. 2018 Jul;45(7):3449-3459. doi: 10.1002/mp.12967. Epub 2018 Jun 13. Med Phys. 2018. PMID: 29763967 Free PMC article.
-
Benchmarking of Machine Learning classifiers on plasma proteomic for COVID-19 severity prediction through interpretable artificial intelligence.Artif Intell Med. 2023 Mar;137:102490. doi: 10.1016/j.artmed.2023.102490. Epub 2023 Jan 18. Artif Intell Med. 2023. PMID: 36868685 Free PMC article. Review.
-
Reviewing ensemble classification methods in breast cancer.Comput Methods Programs Biomed. 2019 Aug;177:89-112. doi: 10.1016/j.cmpb.2019.05.019. Epub 2019 May 20. Comput Methods Programs Biomed. 2019. PMID: 31319964 Review.
Cited by
-
Hierarchical resampling for bagging in multistudy prediction with applications to human neurochemical sensing.Ann Appl Stat. 2022 Dec;16(4):2145-2165. doi: 10.1214/21-aoas1574. Epub 2022 Sep 26. Ann Appl Stat. 2022. PMID: 36274786 Free PMC article.
-
An immuno-score signature of tumor immune microenvironment predicts clinical outcomes in locally advanced rectal cancer.Front Oncol. 2022 Sep 29;12:993726. doi: 10.3389/fonc.2022.993726. eCollection 2022. Front Oncol. 2022. PMID: 36248969 Free PMC article.
-
Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies.PLoS Comput Biol. 2023 Oct 16;19(10):e1010608. doi: 10.1371/journal.pcbi.1010608. eCollection 2023 Oct. PLoS Comput Biol. 2023. PMID: 37844077 Free PMC article.
-
Evaluation of normalization methods for predicting quantitative phenotypes in metagenomic data analysis.Front Genet. 2024 Jun 5;15:1369628. doi: 10.3389/fgene.2024.1369628. eCollection 2024. Front Genet. 2024. PMID: 38903761 Free PMC article.
-
A multimodal approach for visualization and identification of electrophysiological cell types in vivo.bioRxiv [Preprint]. 2025 Jul 31:2025.07.24.666654. doi: 10.1101/2025.07.24.666654. bioRxiv. 2025. PMID: 40766549 Free PMC article. Preprint.
References
-
- Benito M. et al. (2004) Adjustment of systematic microarray data biases. Bioinformatics, 20, 105–114. - PubMed