Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 1;42(7):msaf149.
doi: 10.1093/molbev/msaf149.

Newly Developed Structure-Based Methods Do Not Outperform Standard Sequence-Based Methods for Large-Scale Phylogenomics

Affiliations

Newly Developed Structure-Based Methods Do Not Outperform Standard Sequence-Based Methods for Large-Scale Phylogenomics

Giacomo Mutti et al. Mol Biol Evol. .

Abstract

Recent developments in protein structure prediction have allowed the use of this previously limited source of information at genome-wide scales. It has been proposed that the use of structural information may offer advantages over sequences in phylogenetic reconstruction, due to their slower rate of evolution and direct correlation to function. Here, we examined how recently developed methods for structure-based homology search and tree reconstruction compare with current state-of-the-art sequence-based methods in reconstructing genome-wide collections of gene phylogenies (i.e. phylomes). While structure-based methods can be useful in specific scenarios, we found that their current performance does not justify using the newly developed structure-based methods as a default choice in large-scale phylogenetic studies. On the one hand, the best performing sequence-based tree reconstruction methods still outperform structure-based methods for this task. On the other hand, structure-based homology detection methods provide larger lists of candidate homologs, as previously reported. However, this comes at the expense of missing hits identified by sequence-based methods, as well as providing sets of homolog candidates with higher fractions of false positives. These insights help to guide the use of structural data in comparative genomics and highlight the need to continue improving structure-based approaches. Our pipeline is fully reproducible and has been implemented in a Snakemake workflow. This will facilitate a continuous assessment of future improvements of structure-based tools in the AlphaFold era.

Keywords: homology; orthology; phylogenetics; phylome; structural phylogenetics.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Schematic representation of the pipeline. a) Primary amino acid sequences and 3Di-recoded structures from Homo sapiens’ proteins (seeds) are aligned against a dataset of 18 eukaryotic species with BlastP (Bp) and Foldseek (Fs), respectively. b) Before entering into the phylogenetic pipeline, Bp and Fs results are divided into four target sets per seed as shown in the Venn diagram. The number of query-target pairs in each target set before and after filtering is shown below. c) The four target sets of each seed are submitted to eight tree reconstruction methods. (For computational reasons, we restricted step C to 1,000 randomly selected seeds). *Among the randomly selected seeds, only those with at least four common hits entered into step C.
Fig. 2.
Fig. 2.
Distributions and correlation of a) percentage identity, b) −Log10(E-value) and c) query coverage for unfiltered BlastP and Foldseek results. The marginal distributions are color-coded according to the target set. Distribution of d) Local Difference Distance Test (LDDT), e) Template Modeling (TM) score and f) mean target predicted LDDT between different target sets. g) Cumulative distribution of mean Jaccard index per query for all levels of CATH annotation, including protein class (C), architecture (A), topology (T), and homologous superfamily (H) (indicatedcoded with different transparency levels, see legend). See supplementary methods, Supplementary Material online for details on this analysis.
Fig. 3.
Fig. 3.
Boxplot distribution grouped by target set and tree reconstruction method of a) normalised Robinson–Foulds (RF) distance of decomposed single copy gene trees to the species tree, b) First Quartet Frequency support values (this measure indicates how many times the nodes in the species trees are observed in the gene trees), and c) number of gene duplications and losses inferred by gene tree-species tree reconciliation normalised by number of tips. Significant differences to LG are annotated as asterisks over each respective boxplot. See the “Performance assessment of tree reconstruction methods” section of supplementary Methods, Supplementary Material online for details on how P-values were computed.

Similar articles

Cited by

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990:215(3):403–410. 10.1016/S0022-2836(05)80360-2. - DOI - PubMed
    1. The UniProt Consortium, Bateman A, Martin M-J, Orchard S, Magrane M, Ahmad S, Alpi E, Bowler-Barnett EH, Britto R, Bye-A-Jee H, et al. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023:51(D1):D523–D531. 10.1093/nar/gkac1052. - DOI - PMC - PubMed
    1. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. Trimal: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009:25(15):1972–1973. 10.1093/bioinformatics/btp348. - DOI - PMC - PubMed
    1. Edgar RC. Sequence alignment using large protein structure alphabets improves sensitivity to remote homologs. bioRxiv 2024.05.24.595840. 10.1101/2024.05.24.595840, 9 June 2024, preprint: not peer reviewed. - DOI
    1. Garg SG, Hochberg G. A General substitution matrix for structural phylogenetics. Mol Biol Evol. 2025:42(6). 10.1093/molbev/msaf124. - DOI - PMC - PubMed

LinkOut - more resources