Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Mar 31;12(1):1983.
doi: 10.1038/s41467-021-22073-8.

Harnessing machine learning to guide phylogenetic-tree search algorithms

Affiliations

Harnessing machine learning to guide phylogenetic-tree search algorithms

Dana Azouri et al. Nat Commun. .

Abstract

Inferring a phylogenetic tree is a fundamental challenge in evolutionary studies. Current paradigms for phylogenetic tree reconstruction rely on performing costly likelihood optimizations. With the aim of making tree inference feasible for problems involving more than a handful of sequences, inference under the maximum-likelihood paradigm integrates heuristic approaches to evaluate only a subset of all potential trees. Consequently, existing methods suffer from the known tradeoff between accuracy and running time. In this proof-of-concept study, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus potentially accelerating heuristic tree searches without losing accuracy. Our analyses suggest that machine learning can guide tree-search methodologies towards the most promising candidate trees.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. The trees defined by an SPR move.
For each data sample, two trees and four subtrees were considered: a an example of a starting tree; b, c the two subtrees induced by the pruned branch of a tree a (in red), where b is the pruned subtree and c the remaining subtree; c1c2 the regrafted branch (in blue) also induces two subtrees, from both sides of the regrafted branch of subtree c; d the resulting tree following the SPR move.
Fig. 2
Fig. 2. Performance evaluation scores on empirical datasets.
A histogram of the three performance scores of the learning algorithm evaluated on: a the 4200 starting trees, using cross-validation; b the 1000 validation set starting trees. On the Y axis of both panels a and b are the percentages of empirical datasets in each accuracy score bin: (1) accuracy is computed as the Spearman correlation coefficient between the predicted ranking and the true ranking of neighboring trees; (2) accuracy is the rank (in percentile) of the empirically best neighbor within the predicted ranking; (3) accuracy is the rank (in percentile) of the predicted best neighbor within the empirical ranking.
Fig. 3
Fig. 3. Example of an iterative chain of moves.
Evaluation metrics for the convergence behavior of an iterative chain of trees towards the maximum-likelihood tree. The chain was initiated with the Neighbor-Joining tree reconstructed for the protein-coding algae dataset we used as an example throughout the manuscript (iteration #0). The line plots represent the Robinson-Foulds distance from the maximum-likelihood tree (in blue; left Y axis), and the log-likelihood of the tree obtained in each iteration (in orange; right Y axis).
Fig. 4
Fig. 4. Feature selection following a backward stepwise elimination procedure.
The mean Spearman correlation coefficient obtained when using a decreasing number of features for training the algorithm on the 4200 starting trees. The box shows the quartiles of the dataset (thus the center is the median) while the whiskers extend to show the 1.5 × IQR past the low and high quartiles. The table at the bottom elaborates the feature composition within each set of features, as determined by the backward stepwise elimination procedure.

References

    1. Thorne JL. Models of protein sequence evolution and their applications. Curr. Opin. Genet. Dev. 2000;10:602–605. doi: 10.1016/S0959-437X(00)00142-8. - DOI - PubMed
    1. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 1981;17:368–376. doi: 10.1007/BF01734359. - DOI - PubMed
    1. Chor B, Tuller T. Maximum likelihood of evolutionary trees: Hardness and approximation. Bioinformatics. 2005;21:i97–i106. doi: 10.1093/bioinformatics/bti1027. - DOI - PubMed
    1. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987;4:406–425. - PubMed
    1. Ogden TH, Rosenberg MS. Multiple sequence alignment accuracy and phylogenetic inference. Syst. Biol. 2006;55:314–328. doi: 10.1080/10635150500541730. - DOI - PubMed

Publication types

LinkOut - more resources