Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 1;37(5):1495-1507.
doi: 10.1093/molbev/msz307.

Deep Residual Neural Networks Resolve Quartet Molecular Phylogenies

Affiliations

Deep Residual Neural Networks Resolve Quartet Molecular Phylogenies

Zhengting Zou et al. Mol Biol Evol. .

Abstract

Phylogenetic inference is of fundamental importance to evolutionary as well as other fields of biology, and molecular sequences have emerged as the primary data for this task. Although many phylogenetic methods have been developed to explicitly take into account substitution models of sequence evolution, such methods could fail due to model misspecification or insufficiency, especially in the face of heterogeneities in substitution processes across sites and among lineages. In this study, we propose to infer topologies of four-taxon trees using deep residual neural networks, a machine learning approach needing no explicit modeling of the subject system and having a record of success in solving complex nonlinear inference problems. We train residual networks on simulated protein sequence data with extensive amino acid substitution heterogeneities. We show that the well-trained residual network predictors can outperform existing state-of-the-art inference methods such as the maximum likelihood method on diverse simulated test data, especially under extensive substitution heterogeneities. Reassuringly, residual network predictors generally agree with existing methods in the trees inferred from real phylogenetic data with known or widely believed topologies. Furthermore, when combined with the quartet puzzling algorithm, residual network predictors can be used to reconstruct trees with more than four taxa. We conclude that deep learning represents a powerful new approach to phylogenetic reconstruction, especially when sequences evolve via heterogeneous substitution processes. We present our best trained predictor in a freely available program named Phylogenetics by Deep Learning (PhyDL, https://gitlab.com/ztzou/phydl; last accessed January 3, 2020).

Keywords: deep learning; heterotachy; long-branch attraction; phylogenetic inference; protein sequence evolution; residual neural network.

PubMed Disclaimer

Figures

<sc>Fig</sc>. 1.
Fig. 1.
Residual networks gain predictive power in resolving quartet trees through training. (a) Cross-entropy loss (a measure of error; see Materials and Methods) of the predictors on corresponding training and validation data sets after each training epoch. Blue arrows indicate the predictors used in subsequent analyses, because these predictors have the lowest validation cross-entropy losses. (b) Performances of residual networks at sampled epochs on test data sets with normal trees and LBA trees, respectively. A dashed line indicates the best performance among the existing methods examined on normal trees (green) or LBA trees (purple), with the best performing method indicated below the dashed line.
<sc>Fig</sc>. 2.
Fig. 2.
Residual network predictors generally outperform existing methods on quartet trees with diverse properties. Numbers of correct inferences out of 1,000 trees are shown for different test data sets with (a) different ranges of branch lengths, (b) different ranges of amino acid sequence lengths, and (c) different levels of heterotachy.
<sc>Fig</sc>. 3.
Fig. 3.
Residual network predictors generally outperform existing methods on LBA trees with heterotachy. (a) A schematic quartet tree showing branch length notations. (b) A 3-D surface of mean DNN3 accuracies (also indicated by color) across all c/b levels in each subplot of panel (c), in the space of b levels and a/b levels. Circled numbers correspond to those in (c). (c) Proportions of 100 quartet trees correctly inferred by our predictors (shown by different colors) and the existing methods (shown by different gray symbols) under various combinations of the parameters b, a/b, and c/b. For each c/b level indicated on the X-axis, if a residual network predictor performs better than all existing methods, a pentagon of the corresponding color is drawn on the top of the panel.

Similar articles

Cited by

References

    1. Atkinson QD, Meade A, Venditti C, Greenhill SJ, Pagel M. 2008. Languages evolve in punctuational bursts. Science 319(5863):588–588. - PubMed
    1. Bhattacharya S. 2014. Science in court: disease detectives. Nature 506(7489):424–426. - PubMed
    1. Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. 2012. Epistasis as the primary factor in molecular evolution. Nature 490(7421):535–538. - PubMed
    1. Byng J, Chase M, Christenhusz M, Fay M, Judd W, Mabberley D, Sennikov A, Soltis D, Soltis P, Stevens P, et al.. 2016. An update of the angiosperm phylogeny group classification for the orders and families of flowering plants: aPG IV. Bot J Linn Soc. 181:1–20.
    1. Carvalho SB, Velo-Antón G, Tarroso P, Portela AP, Barata M, Carranza S, Moritz C, Possingham HP. 2017. Spatial conservation prioritization of biodiversity spanning the evolutionary continuum. Nat Ecol Evol. 1(6):151.. - PubMed

Publication types

Substances