Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun;34(6):e70169.
doi: 10.1002/pro.70169.

3-D substructure search by transitive closure in AlphaFold database

Affiliations

3-D substructure search by transitive closure in AlphaFold database

Hao Liu et al. Protein Sci. 2025 Jun.

Abstract

Identifying structural relationships between proteins is crucial for understanding their functions and evolutionary histories. We present ISS_ProtSci, a Python package designed for structural similarity searches within the AlphaFold Database v2 (AFDB2). ISS_ProtSci incorporates DaliLite to identify geometrically similar structures and uses a transitive closure algorithm to iteratively explore neighboring shells of proteins. The precomputed all-against-all comparisons generated by Foldseek, chosen for its speed, are validated by DaliLite for precision. Search results are annotated with metadata from UniProtKB and Pfam protein family classifications, using hmmsearch to identify protein domains. Outputs, including Dali pairwise alignment data, are provided in TSV format for easy filtering and analysis. Our method offers a significant improvement in recall over existing tools like Foldseek, especially in detecting more distantly related proteins. This is particularly valuable in structurally diverse protein families where traditional sequence-based or fast structural methods struggle. ISS_ProtSci delivers practical runtimes and flexibility, allowing users to input a PDB file, define the minimum size of the common core, and evaluate results using Pfam clans. In evaluating our method across 12 test cases based on Pfam clans, we achieved over 99% recall of relevant proteins, even in challenging cases where Foldseek's recall dropped below 50%. ISS_ProtSci not only identifies closely related proteins but also uncovers previously unrecognized structural relationships, contributing to more accurate protein family classifications. The software can be downloaded from http://ekhidna2.biocenter.helsinki.fi/ISS_ProtSci/.

Keywords: Dali; Foldseek; Pfam; protein space; superfamily.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1
Schematic representation of the transitive closure search method. The goal is to efficiently identify all protein structures with significant geometrical similarity to the Query structure, as validated by DaliLite. Our method iteratively explores neighboring shells in protein space, capturing distant structural relationships that might be missed by single‐step searches. Dashed blue arrows represent neighbor relationships predicted by Foldseek. Solid arrows are color‐coded based on the validation results: green indicates pairs that pass the filtering criteria (Dali Z‐score and alignment length), while red indicates pairs that fail. The search terminates when no new candidates in a shell meet the validation criteria.
FIGURE 2
FIGURE 2
Cartoon representations of the query structures for the 12 test cases of Table 1.
FIGURE 3
FIGURE 3
Z‐score distributions for the P_2 set of cases 1–12 for transitive closure (red), Foldseek first shell (green), and Pfam clan (blue).
FIGURE 4
FIGURE 4
(a) Precision and recall values of Supplementary Table 3 for the end point of transitive closure search (blue squares), and for several e‐value cutoff points for direct Foldseek (fs) search (open circles). (b) F1‐scores of the P_2 sets by transitive closure and the maximum F1‐score of Foldseek at any e‐value cutoff.
FIGURE 5
FIGURE 5
Precision‐recall curves for transitive closure results ranked by Z‐score (blue curve), Foldseek direct hits up to e‐value 1 ranked by Z‐score (orange curve), and Foldseek direct hits up to e‐value 10 ranked by e‐value (green curve). Pairs assigned to the same SCOPe fold were considered correct. The inset shows the Z‐score cutoff as a function of recall for transitive closure (blue curve), Foldseek ranked by Z‐score (orange curve), and the SCOPe reference (red curve).
FIGURE 6
FIGURE 6
Visualization of large result sets. (a) The search results for benchmark case #4 revealed an outlier family (PF06788, magenta) and a newly identified addition (PF06219, red triangles) to Pfam clan CL0395 (filled circles). (b) Stacked alignment for test set case #4, colored by secondary structure (red: beta strands; blue: alpha helix; green: loop; gray: unaligned). The data is seriated by maximizing the similarity of secondary structure assignments between adjacent rows. Rendered with msaviewer (Yachdav et al., 2016). (c) Pfam family representatives u383A (apparent false negative), amseA (query structure), and g8jA (apparent false positive) were visualized using PyMOL (Schrödinger & DeLano, 2020). Structures are colored from blue at the N‐terminus to red at the C‐terminus in a rainbow gradient.
FIGURE 7
FIGURE 7
System overview: most computations are done locally on the user's computer, while a huge knowledge base is stored on a remote data server.

Similar articles

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. 10.1016/S0022-2836(05)80360-2 - DOI - PubMed
    1. Bateman A, Finn RD, Sims PJ, Wiedmer T, Biegert A, Soding J. Phospholipid scramblases and tubby‐like proteins belong to a new superfamily of membrane tethered transcription factors. Bioinformatics. 2009;25:159–162. - PMC - PubMed
    1. Chandonia JM, Guan L, Lin S, Yu C, Fox NK, Brenner SE. SCOPe: improvements to the structural classification of proteins – extended database to facilitate variant interpretation and machine learning. Nucleic Acids Res. 2022;50:D553–D559. 10.1093/nar/gkab1054 - DOI - PMC - PubMed
    1. Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;7:e1002195. 10.1371/journal.pcbi.1002195 - DOI - PMC - PubMed
    1. Edgar RC. Protein structure alignment by Reseek improves sensitivity to remote homologs. Bioinformatics. 2024;40:btae687. - PMC - PubMed

LinkOut - more resources