Newly Developed Structure-Based Methods Do Not Outperform Standard Sequence-Based Methods for Large-Scale Phylogenomics

Giacomo Mutti^{1

2}, Eduard Ocaña-Pallarès^{1

2

3}, Toni Gabaldón^{1

2

4

5}

Affiliations

¹ Barcelona Supercomputing Centre (BSC-CNS), Barcelona 08034, Spain.
² Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona 08028, Spain.
³ Institut de Biologia Evolutiva (CSIC-Universitat Pompeu Fabra), Barcelona, Spain.
⁴ Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, Spain.
⁵ CIBER de Enfermedades Infecciosas, Instituto de Salud Carlos III, Madrid, Spain.

PMID: 40580945
PMCID: PMC12290511
DOI: 10.1093/molbev/msaf149

Newly Developed Structure-Based Methods Do Not Outperform Standard Sequence-Based Methods for Large-Scale Phylogenomics

Giacomo Mutti et al. Mol Biol Evol. 2025.

. 2025 Jul 1;42(7):msaf149.

doi: 10.1093/molbev/msaf149.

Authors

Giacomo Mutti^{1

2}, Eduard Ocaña-Pallarès^{1

2

3}, Toni Gabaldón^{1

2

4

5}

Affiliations

¹ Barcelona Supercomputing Centre (BSC-CNS), Barcelona 08034, Spain.
² Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona 08028, Spain.
³ Institut de Biologia Evolutiva (CSIC-Universitat Pompeu Fabra), Barcelona, Spain.
⁴ Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, Spain.
⁵ CIBER de Enfermedades Infecciosas, Instituto de Salud Carlos III, Madrid, Spain.

PMID: 40580945
PMCID: PMC12290511
DOI: 10.1093/molbev/msaf149

Abstract

Recent developments in protein structure prediction have allowed the use of this previously limited source of information at genome-wide scales. It has been proposed that the use of structural information may offer advantages over sequences in phylogenetic reconstruction, due to their slower rate of evolution and direct correlation to function. Here, we examined how recently developed methods for structure-based homology search and tree reconstruction compare with current state-of-the-art sequence-based methods in reconstructing genome-wide collections of gene phylogenies (i.e. phylomes). While structure-based methods can be useful in specific scenarios, we found that their current performance does not justify using the newly developed structure-based methods as a default choice in large-scale phylogenetic studies. On the one hand, the best performing sequence-based tree reconstruction methods still outperform structure-based methods for this task. On the other hand, structure-based homology detection methods provide larger lists of candidate homologs, as previously reported. However, this comes at the expense of missing hits identified by sequence-based methods, as well as providing sets of homolog candidates with higher fractions of false positives. These insights help to guide the use of structural data in comparative genomics and highlight the need to continue improving structure-based approaches. Our pipeline is fully reproducible and has been implemented in a Snakemake workflow. This will facilitate a continuous assessment of future improvements of structure-based tools in the AlphaFold era.

Keywords: homology; orthology; phylogenetics; phylome; structural phylogenetics.

PubMed Disclaimer

Figures

**Fig. 1.**
Schematic representation of the pipeline. a) Primary amino acid sequences and 3Di-recoded structures from *Homo sapiens*’ proteins (seeds) are aligned against a dataset of 18 eukaryotic species with BlastP (Bp) and Foldseek (Fs), respectively. b) Before entering into the phylogenetic pipeline, Bp and Fs results are divided into four target sets per seed as shown in the Venn diagram. The number of query-target pairs in each target set before and after filtering is shown below. c) The four target sets of each seed are submitted to eight tree reconstruction methods. (For computational reasons, we restricted step C to 1,000 randomly selected seeds). *Among the randomly selected seeds, only those with at least four common hits entered into step C.

**Fig. 2.**
Distributions and correlation of a) percentage identity, b) −Log10(E-value) and c) query coverage for unfiltered BlastP and Foldseek results. The marginal distributions are color-coded according to the target set. Distribution of d) Local Difference Distance Test (LDDT), e) Template Modeling (TM) score and f) mean target predicted LDDT between different target sets. g) Cumulative distribution of mean Jaccard index per query for all levels of CATH annotation, including protein class (C), architecture (A), topology (T), and homologous superfamily (H) (indicatedcoded with different transparency levels, see legend). See supplementary methods, Supplementary Material online for details on this analysis.

**Fig. 3.**
Boxplot distribution grouped by target set and tree reconstruction method of a) normalised Robinson–Foulds (RF) distance of decomposed single copy gene trees to the species tree, b) First Quartet Frequency support values (this measure indicates how many times the nodes in the species trees are observed in the gene trees), and c) number of gene duplications and losses inferred by gene tree-species tree reconciliation normalised by number of tips. Significant differences to LG are annotated as asterisks over each respective boxplot. See the “Performance assessment of tree reconstruction methods” section of supplementary Methods, Supplementary Material online for details on how P-values were computed.

See this image and copyright information in PMC

Cited by

The evolutionary history and modern diversity of triterpenoid cyclases.
McShea HS, Viens RA, Olagunju BO, Giner JL, Welander PV. McShea HS, et al. bioRxiv [Preprint]. 2025 Aug 2:2024.10.28.620730. doi: 10.1101/2024.10.28.620730. bioRxiv. 2025. Update in: Mol Biol Evol. 2025 Aug 19:msaf203. doi: 10.1093/molbev/msaf203. PMID: 40766600 Free PMC article. Updated. Preprint.
Protein Structural Phylogenetics.
Puente-Lelievre C, Malik A, Douglas J. Puente-Lelievre C, et al. Genome Biol Evol. 2025 Jul 30;17(8):evaf139. doi: 10.1093/gbe/evaf139. Genome Biol Evol. 2025. PMID: 40839422 Free PMC article. Review.

References

1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990:215(3):403–410. 10.1016/S0022-2836(05)80360-2. - DOI - PubMed
1. The UniProt Consortium, Bateman A, Martin M-J, Orchard S, Magrane M, Ahmad S, Alpi E, Bowler-Barnett EH, Britto R, Bye-A-Jee H, et al. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023:51(D1):D523–D531. 10.1093/nar/gkac1052. - DOI - PMC - PubMed
1. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. Trimal: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009:25(15):1972–1973. 10.1093/bioinformatics/btp348. - DOI - PMC - PubMed
1. Edgar RC. Sequence alignment using large protein structure alphabets improves sensitivity to remote homologs. bioRxiv 2024.05.24.595840. 10.1101/2024.05.24.595840, 9 June 2024, preprint: not peer reviewed. - DOI
1. Garg SG, Hochberg G. A General substitution matrix for structural phylogenetics. Mol Biol Evol. 2025:42(6). 10.1093/molbev/msaf124. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Newly Developed Structure-Based Methods Do Not Outperform Standard Sequence-Based Methods for Large-Scale Phylogenomics

Affiliations

Newly Developed Structure-Based Methods Do Not Outperform Standard Sequence-Based Methods for Large-Scale Phylogenomics

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources