Harnessing machine learning to guide phylogenetic-tree search algorithms

Dana Azouri^{1

2}, Shiran Abadi¹, Yishay Mansour³, Itay Mayrose⁴, Tal Pupko⁵

Affiliations

¹ School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel.
² The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel.
³ Balvatnik School of Computer Science, Tel-Aviv University, Ramat Aviv, Tel-Aviv, Israel.
⁴ School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel. itaymay@tauex.tau.ac.il.
⁵ The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel. talp@tauex.tau.ac.il.

PMID: 33790270
PMCID: PMC8012635
DOI: 10.1038/s41467-021-22073-8

Harnessing machine learning to guide phylogenetic-tree search algorithms

Dana Azouri et al. Nat Commun. 2021.

. 2021 Mar 31;12(1):1983.

doi: 10.1038/s41467-021-22073-8.

Authors

Dana Azouri^{1

2}, Shiran Abadi¹, Yishay Mansour³, Itay Mayrose⁴, Tal Pupko⁵

Affiliations

¹ School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel.
² The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel.
³ Balvatnik School of Computer Science, Tel-Aviv University, Ramat Aviv, Tel-Aviv, Israel.
⁴ School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel. itaymay@tauex.tau.ac.il.
⁵ The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel. talp@tauex.tau.ac.il.

PMID: 33790270
PMCID: PMC8012635
DOI: 10.1038/s41467-021-22073-8

Abstract

Inferring a phylogenetic tree is a fundamental challenge in evolutionary studies. Current paradigms for phylogenetic tree reconstruction rely on performing costly likelihood optimizations. With the aim of making tree inference feasible for problems involving more than a handful of sequences, inference under the maximum-likelihood paradigm integrates heuristic approaches to evaluate only a subset of all potential trees. Consequently, existing methods suffer from the known tradeoff between accuracy and running time. In this proof-of-concept study, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus potentially accelerating heuristic tree searches without losing accuracy. Our analyses suggest that machine learning can guide tree-search methodologies towards the most promising candidate trees.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. The trees defined by an SPR move.**
For each data sample, two trees and four subtrees were considered: a an example of a starting tree; b, c the two subtrees induced by the pruned branch of a tree a (in red), where b is the pruned subtree and c the remaining subtree; c1, c2 the regrafted branch (in blue) also induces two subtrees, from both sides of the regrafted branch of subtree c; d the resulting tree following the SPR move.

**Fig. 2. Performance evaluation scores on empirical datasets.**
A histogram of the three performance scores of the learning algorithm evaluated on: a the 4200 starting trees, using cross-validation; b the 1000 validation set starting trees. On the Y axis of both panels a and b are the percentages of empirical datasets in each accuracy score bin: (1) accuracy is computed as the Spearman correlation coefficient between the predicted ranking and the true ranking of neighboring trees; (2) accuracy is the rank (in percentile) of the empirically best neighbor within the predicted ranking; (3) accuracy is the rank (in percentile) of the predicted best neighbor within the empirical ranking.

**Fig. 3. Example of an iterative chain of moves.**
Evaluation metrics for the convergence behavior of an iterative chain of trees towards the maximum-likelihood tree. The chain was initiated with the Neighbor-Joining tree reconstructed for the protein-coding algae dataset we used as an example throughout the manuscript (iteration #0). The line plots represent the Robinson-Foulds distance from the maximum-likelihood tree (in blue; left Y axis), and the log-likelihood of the tree obtained in each iteration (in orange; right Y axis).

**Fig. 4. Feature selection following a backward stepwise elimination procedure.**
The mean Spearman correlation coefficient obtained when using a decreasing number of features for training the algorithm on the 4200 starting trees. The box shows the quartiles of the dataset (thus the center is the median) while the whiskers extend to show the 1.5 × IQR past the low and high quartiles. The table at the bottom elaborates the feature composition within each set of features, as determined by the backward stepwise elimination procedure.

See this image and copyright information in PMC

References

1. Thorne JL. Models of protein sequence evolution and their applications. Curr. Opin. Genet. Dev. 2000;10:602–605. doi: 10.1016/S0959-437X(00)00142-8. - DOI - PubMed
1. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 1981;17:368–376. doi: 10.1007/BF01734359. - DOI - PubMed
1. Chor B, Tuller T. Maximum likelihood of evolutionary trees: Hardness and approximation. Bioinformatics. 2005;21:i97–i106. doi: 10.1093/bioinformatics/bti1027. - DOI - PubMed
1. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987;4:406–425. - PubMed
1. Ogden TH, Rosenberg MS. Multiple sequence alignment accuracy and phylogenetic inference. Syst. Biol. 2006;55:314–328. doi: 10.1080/10635150500541730. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Harnessing machine learning to guide phylogenetic-tree search algorithms

Affiliations

Harnessing machine learning to guide phylogenetic-tree search algorithms

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources