Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020:25:451-462.

Tree-Weighting for Multi-Study Ensemble Learners

Affiliations

Tree-Weighting for Multi-Study Ensemble Learners

Maya Ramchandran et al. Pac Symp Biocomput. 2020.

Abstract

Multi-study learning uses multiple training studies, separately trains classifiers on each, and forms an ensemble with weights rewarding members with better cross-study prediction ability. This article considers novel weighting approaches for constructing tree-based ensemble learners in this setting. Using Random Forests as a single-study learner, we compare weighting each forest to form the ensemble, to extracting the individual trees trained by each Random Forest and weighting them directly. We find that incorporating multiple layers of ensembling in the training process by weighting trees increases the robustness of the resulting predictor. Furthermore, we explore how ensembling weights correspond to tree structure, to shed light on the features that determine whether weighting trees directly is advantageous. Finally, we apply our approach to genomic datasets and show that weighting trees improves upon the basic multi-study learning paradigm. Code and supplementary material are available at https://github.com/m-ramchandran/tree-weighting.

PubMed Disclaimer

Figures

Fig. 1:
Fig. 1:
Baseline analyses. (A) Percent variation in outcome explained by interactions, as a function of magnitude of interaction coefficients in the generating model. (B)-(C) Percent change in average RMSE of each ensembling approach compared to the Merged learner, as a function of between- study heterogeneity level. In (B) all training sets have identical feature distributions while in (C) the TCGA study is randomly split into 5 sub-datasets at every iteration for training and testing. Weighting Trees and Weighting Forests significantly improve upon Merged, with the difference in performance decreasing as heterogeneity increases. Smoothing is applied to reduce simulation noise.
Fig. 2:
Fig. 2:
Average RMSE’s of ensembling approaches (color labeled) across different data-generating scenarios, as a function of increasing interaction strength or heterogeneity. (A) 2 datasets with interaction terms between features in the outcome-generating generating mechanism are included in the training set, and 2 are included in the testing set. (B) 6 datasets with interactions are included in the training set, 2 in the testing set. (C) No datasets with interaction terms are included in either training or testing, and performance is evaluated for increasing feature effect heterogeneity.
Fig. 3:
Fig. 3:
Distribution of tree-level weights using the Weighting Trees or Weighting Forests methods, as well as their difference. 2 training sets contain interaction terms. Tree-level weights for Weighting Forests are obtained by dividing the forest-level weight returned by the stacking algorithm by the number of trees per forest. Correspondingly, each point in Weighting Forests represents the value of the weight given to 10 trees. The dashed red line at y = .01 represents the weight given to every tree within the Merged and Unweighted.
Fig. 4:
Fig. 4:
Performance of ensembling approaches on the breast cancer datasets in the multi-study setting, with associated 95% confidence intervals. (A.1) Average percent change in prediction Log Loss from the Merged for each of the ensembling approaches on the binary outcome variable, Overall Survival (OS). Confidence intervals were obtained by training each of the ensembling approaches 100 times, with differences in performance across iterations induced by the randomization within the Random Forest algorithm. (A.2) A view of panel A.1, without the Merged learner to improve scaling, so differences between the ensembles can be clearly visualized. (B) Average percent change in RMSE from the Merged when predicting expression levels for each of the top 500 variable genes given the rest of the gene expression data. The standard errors were therefore computed over 500 samples, as opposed to the 100 in panel A.2.

References

    1. Bernau C, Riester M, Boulesteix A-L, Parmigiani G, Huttenhower C, Waldron L and Trippa L, Cross-study validation for the assessment of prediction algorithms, Bioinformatics 30, i105 (2014). - PMC - PubMed
    1. Patil P and Parmigiani G, Training replicable predictors in multiple studies, Proceedings of the National Academy of Sciences 115, 2578 (2018). - PMC - PubMed
    1. Breiman L, Random forests, Machine Learning 45, 5 (October 2001).
    1. Breiman L, Bagging predictors, Machine Learning 24, 123 (August 1996).
    1. Maudes J, Rodŕíguez JJ, Garćía-Osorio C and Garćía-Pedrajas N, Random feature weights for decision tree ensemble construction, Inf. Fusion 13, 20 (January 2012).

Publication types