Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 28;40(Suppl 1):i208-i217.
doi: 10.1093/bioinformatics/btae255.

A machine-learning-based alternative to phylogenetic bootstrap

Affiliations

A machine-learning-based alternative to phylogenetic bootstrap

Noa Ecker et al. Bioinformatics. .

Abstract

Motivation: Currently used methods for estimating branch support in phylogenetic analyses often rely on the classic Felsenstein's bootstrap, parametric tests, or their approximations. As these branch support scores are widely used in phylogenetic analyses, having accurate, fast, and interpretable scores is of high importance.

Results: Here, we employed a data-driven approach to estimate branch support values with a probabilistic interpretation. To this end, we simulated thousands of realistic phylogenetic trees and the corresponding multiple sequence alignments. Each of the obtained alignments was used to infer the phylogeny using state-of-the-art phylogenetic inference software, which was then compared to the true tree. Using these extensive data, we trained machine-learning algorithms to estimate branch support values for each bipartition within the maximum-likelihood trees obtained by each software. Our results demonstrate that our model provides fast and more accurate probability-based branch support values than commonly used procedures. We demonstrate the applicability of our approach on empirical datasets.

Availability and implementation: The data supporting this work are available in the Figshare repository at https://doi.org/10.6084/m9.figshare.25050554.v1, and the underlying code is accessible via GitHub at https://github.com/noaeker/bootstrap_repo.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
ROC curves on various test data. Each panel displays the ROC curve obtained with the branch score predictions generated using the trained machine-learning procedure compared to existing scores obtained with the respective tree search software. The top, middle, and bottom panels represent the scores obtained with trees reconstructed using IQTREE, RAxML-NG, and FastTree, respectively, on the test data. The dotted diagonal line is the y = x line. The remaining curves represent the performance of our machine-learning model along with support values provided by the other programs.
Figure 2.
Figure 2.
Calibration plot on the test. Each panel displays the calibration curve obtained with the branch score predictions generated using the trained machine-learning procedure compared to existing scores obtained with the respective tree search software. The top, middle, and bottom panels represent the scores obtained with trees reconstructed using IQTREE, RAxML-NG, and FastTree, respectively, on the test data. The dotted diagonal line is the x = y line. The remaining curves showcase the performance of our machine-learning model compared to other programs.
Figure 3.
Figure 3.
Influence of various factors on prediction accuracy in FastTree, IQTREE, and RAxML-NG models: (A) AUC as a function of the number of sequences; (B) AUC as a function of the number of MSA positions; (C) AUC as a function of MSA difficulty score; (D) AUC as a function of number of sequences in the smaller part of the bipartition (E) logarithmic loss as a function of the number of MSAs used for training. In figures A–D, the x-axis denotes the median value derived from dividing the numerical column into 30 quantile-based bins
Figure 4.
Figure 4.
Comparison of machine-learning-based support values to aLRT and aBayes support values for the rpl16b gene using IQTREE: The x-axis represents the machine-learning score and the y-axis represents the scores of the other methods. Dots labeled as “N1” correspond to the lineage within stony corals leading to the following species: Agaricia, Galaxea, Porites, Montastraea, and Favia. Dots labeled as “N2” indicate support for sponge paraphyly. Dots labeled as “N3” represent the grouping of two box-jelly genera, Carybdea and Tripedalia

References

    1. Abadi S, Avram O, Rosset S. et al. ModelTeller: model selection for optimal phylogenetic reconstruction using machine learning. Mol Biol Evol 2020;37:3338–52. - PubMed
    1. Anisimova M, Gil M, Dufayard J-F. et al. Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes. Syst Biol 2011;60:685–99. - PMC - PubMed
    1. Anisimova M, Gascuel O.. Approximate likelihood-ratio test for branches: a fast, accurate, and powerful alternative. Syst Biol 2006;55:539–52. - PubMed
    1. Azouri D, , AbadiS, , Mansour Y. et al. Harnessing machine learning to guide phylogenetic-tree search algorithms. Nat Commun 2021;12:1983. Doi: 10.1038/s41467-021-22073-8. - DOI - PMC - PubMed
    1. Barba-Montoya J, Tao Q, Kumar S. et al. Using a GTR+Γ substitution model for dating sequence divergence when stationarity and time-reversibility assumptions are violated. Bioinformatics 2020;36:I884–94. - PMC - PubMed

Publication types