Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 12;37(11):1521-1527.
doi: 10.1093/bioinformatics/btaa986.

Robustifying genomic classifiers to batch effects via ensemble learning

Affiliations

Robustifying genomic classifiers to batch effects via ensemble learning

Yuqing Zhang et al. Bioinformatics. .

Abstract

Motivation: Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across batches. Such 'batch effects' often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here, we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods.

Results: We provide a systematic comparison between these two strategies using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed that in independent validation, while merging followed by batch adjustment provides better discrimination at low level of heterogeneity, our ensemble learning strategy achieves more robust performance, especially at high severity of batch effects. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers.

Availability and implementation: The data underlying this article are available in the article and in its online supplementary material. Processed data is available in the Github repository with implementation code, at https://github.com/zhangyuqing/bea_ensemble.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Comparison between ensembling and merging when using Random Forests. Three out of our five choices of ensembling weights are displayed: batch size weights, cross-study weights and stacking regression weights (see Section 2 for details)
Fig. 2.
Fig. 2.
Application of ensemble learning to predicting active TB against latent infection. We iteratively selected one of the studies in Table 1 as the independent test study. The remaining studies are viewed as ‘batches’ in the training set. We trained LASSO, Random Forest and SVM, then aggregated predictions from all three algorithms to construct the ensemble. The figure shows average prediction performance over 100 bootstrap samples of the test data, with error bars showing 95% confidence intervals. Above the bars we note the percentage of bootstrap experiments where each method achieves the lowest mean cross-entropy loss. When the four homogeneous studies are used, the average performance using the three ensemble strategies are better than the merging strategy, which is consistent with observations from the simulation study at high severity of batch effects. Different ensemble methods can be the best in a different test set (the optimal study—ensemble combination: D—batch-size weights, E—stacking regression weights, G—cross-study weights). For study F, the three ensemble methods are roughly equal, each wins 33% of the time

Similar articles

Cited by

References

    1. Alcaïs A. et al. (2005) Tuberculosis in children and adults: two distinct genetic diseases. J. Exp. Med., 202, 1617–1621. - PMC - PubMed
    1. Anderson S.T. et al. (2014) Diagnosis of childhood tuberculosis and host RNA expression in Africa. N. Engl. J. Med., 370, 1712–1723. - PMC - PubMed
    1. Badani K.K. et al. (2015) Effect of a genomic classifier test on clinical practice decisions for patients with high-risk prostate cancer after surgery. BJU Int., 115, 419–429. - PMC - PubMed
    1. Benito M. et al. (2004) Adjustment of systematic microarray data biases. Bioinformatics, 20, 105–114. - PubMed
    1. Bernau C. et al. (2014) Cross-study validation for the assessment of prediction algorithms. Bioinformatics, 30, i105–i112. - PMC - PubMed

Publication types