Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 4;42(6):msaf124.
doi: 10.1093/molbev/msaf124.

A General Substitution Matrix for Structural Phylogenetics

Affiliations

A General Substitution Matrix for Structural Phylogenetics

Sriram G Garg et al. Mol Biol Evol. .

Abstract

Sequence-based maximum likelihood phylogenetics is a widely used method for inferring evolutionary relationships, which has illuminated the evolutionary histories of proteins and the organisms that harbor them. However, modern implementations with sophisticated models of sequence evolution struggle to resolve deep evolutionary relationships, which can be obscured by excessive sequence divergence and substitution saturation. Structural phylogenetics has emerged as a promising alternative because protein structure evolves much more slowly than protein sequences. Recent developments in protein structure prediction using AI have made it possible to predict protein structures for entire protein families and then to translate these structures into a sequence representation-the 3Di structural alphabet-that can in theory be directly fed into existing sequence-based phylogenetic software. To unlock the full potential of this idea, however, requires the inference of a general substitution matrix for structural phylogenetics, which has so far been missing. Here, we infer this matrix from large datasets of protein structures and show that it results in a better fit to empirical datasets than previous approaches. We then use this matrix to re-visit the question of the root of the tree of life. Using structural phylogenies of universal paralogs, we provide the first unambiguous evidence for a root between archaea and bacteria. Finally, we discuss some practical and conceptual limitations of structural phylogenetics. Our 3Di substitution matrix provides a starting point for revisiting many deep phylogenetic problems that have so far been extremely difficult to solve.

Keywords: evolution; maximum likelihood; phylogenetics; structural phylogenetics; substitution models.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
a) Overview of the pipeline employed in the manuscript. Briefly, AlphaFold structures or AA protein sequences were translated into 3Di characters using FoldSeek or the bilingual ProtT5 model, respectively. These 3Di characters are aligned with MAFFT using the 3Di scoring matrix before being used to estimate the general substitution Q-matrix using QMaker which were subsequently used to estimate 3Di-ML trees using IQ-Tree. b) Lower triangular portion is a representation of the Q-matrix estimated from 1,660 AF clusters, while the upper triangular section denotes the Q-matrix estimated from 6,653 PFAM clusters translated to 3Di alphabet using the ProtT5 bilingual language model. In both cases, values higher than 2 are colored orange. c) Ratio of exchangeabilities between the Q.3Di.AF and the Q.3Di.LLM matrix. Each square represents the value (m1ijm2ij)/(m1ij+m2ij), where m1 and m2 represent Q.3Di.AF and Q.3Di.LLM, respectively. d) Pearson’s correlation between the exchangeabilities of the two matrices indicates very little differences between the two matrices.
Fig. 2.
Fig. 2.
a) A schematic representation of paralog rooting. Three possible root positions are shown with the “true” root depicted with a green star and two other possible roots with circles. In the scenario where paralogous rooting is successful, both paralog subtrees reciprocally root each other (right). Other possible scenarios are also shown where the paralog subtrees are ambiguously rooted (left). b) AA ML tree containing 1,076 EF-Tu and EF-G homologs from eukaryotes, bacteria, and archaea. The mitochondrial and plastid-encoded copies are not included. Note that the branch separating the EF-Tu and EF-G is broken for illustration. c) 3Di structural ML tree estimated 3Di sequences and the Q.3Di.AF model from the predicted AlphaFold structures of 1,069 EF-Tu and EF-G homologs. In both cases, blue, red, and gray clades represent bacteria, archaea, and eukaryotes, respectively. Numbers in red and black indicate branch lengths and ultrafast bootstrap supports, respectively.
Fig. 3.
Fig. 3.
a) A schematic representation of the bacterial and archaeal ATPase highlighting the subunits under investigation. They are represented using the same colors in the phylogenetic trees. b) AA ML tree of 1,520 sequences across the ToL reproduced from Mahendrarajah et al. (2023) of the catalytic and noncatalytic subunits of bacterial, archaeal, and eukaryotic rotary ATPase. The early branching transfer from bacteria and archaea in the noncatalytic V1 clade is highlighted in white with a black outline. The corresponding clade in the V1 catalytic clade branches deep inside the archaeal sequences and is highlighted similarly. c) 3Di structural tree estimated using the Q.3Di.AF model. Sequences assigned to the early transfer from the archaeal clade to bacteria are highlighted as in (b), but now this transfer is inferred for both the catalytic and noncatalytic subunits. Numbers in red and black indicate branch lengths and ultrafast bootstrap supports, respectively. In both cases, gray clades represent eukaryotes. The green circles and orange squares indicate cyanobacterial and proteobacterial contributions in eukaryotes representing the plastid and mitochondrial ATPases.
Fig. 4.
Fig. 4.
a) AA ML tree of 321 RC1 protein sequences. Note that long branches are broken as indicated for illustration. b) 3Di structural ML tree of 297 3Di sequences from AlphaFold structures using the Q.3Di.AF model. Numbers in red and black indicate branch lengths and ultrafast bootstrap supports, respectively.
Fig. 5.
Fig. 5.
a) 3Di structural ML tree constructed from KaiB proteins modeled in the ground state. b) 3Di structural ML tree constructed from approximately 50% of the KaiB proteins modeled in the ground state (blue) and the other 50% modeled in the fold-switched state (green). c) 3Di structural ML tree constructed from RPC10 proteins modeled in the IN conformation. d) 3Di structural ML tree constructed from approximately 50% of the RPC10 proteins modeled in the OUT conformation (blue) and the other 50% modeled in the IN conformation (green). In both cases, the distinct conformations form monophyletic groups in contrast to their placements in (a) and (c), respectively.

Similar articles

Cited by

References

    1. Atteson K. The performance of neighbor-joining algorithms of phylogeny reconstruction. Algorithmica. 1999:25:251–278. 10.1007/PL00008277. - DOI
    1. Balaji S, Srinivasan N. Use of a database of structural alignments and phylogenetic trees in investigating the relationship between sequence and structural variability among homologous proteins. Protein Eng Des Select. 2001:14(4):219–226. 10.1093/protein/14.4.219. - DOI - PubMed
    1. Balaji S, Sujatha S, Kumar SSC, Srinivasan N. PALI—a database of Phylogeny and ALIgnment of homologous protein structures. Nucleic Acids Res. 2001:29(1):61–65. 10.1093/nar/29.1.61. - DOI - PMC - PubMed
    1. Baldauf SL, Palmer JD, Doolittle WF. The root of the universal tree and the origin of eukaryotes based on elongation factor phylogeny. Proc Natl Acad Sci U S A. 1996:93(15):7749–7754. 10.1073/pnas.93.15.7749. - DOI - PMC - PubMed
    1. Brown JR, Doolittle WF. Root of the universal tree of life based on ancient aminoacyl-tRNA synthetase gene duplications. Proc Natl Acad Sci U S A. 1995:92(7):2441–2445. 10.1073/pnas.92.7.2441. - DOI - PMC - PubMed

LinkOut - more resources