Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Jul 11:2023.07.10.548308.
doi: 10.1101/2023.07.10.548308.

Petascale Homology Search for Structure Prediction

Affiliations

Petascale Homology Search for Structure Prediction

Sewon Lee et al. bioRxiv. .

Update in

  • Petabase-Scale Homology Search for Structure Prediction.
    Lee S, Kim G, Karin EL, Mirdita M, Park S, Chikhi R, Babaian A, Kryshtafovych A, Steinegger M. Lee S, et al. Cold Spring Harb Perspect Biol. 2024 May 2;16(5):a041465. doi: 10.1101/cshperspect.a041465. Cold Spring Harb Perspect Biol. 2024. PMID: 38316555 Review.

Abstract

The recent CASP15 competition highlighted the critical role of multiple sequence alignments (MSAs) in protein structure prediction, as demonstrated by the success of the top AlphaFold2-based prediction methods. To push the boundaries of MSA utilization, we conducted a petabase-scale search of the Sequence Read Archive (SRA), resulting in gigabytes of aligned homologs for CASP15 targets. These were merged with default MSAs produced by ColabFold-search and provided to ColabFold-predict. By using SRA data, we achieved highly accurate predictions (GDT_TS > 70) for 66% of the non-easy targets, whereas using ColabFold-search default MSAs scored highly in only 52%. Next, we tested the effect of deep homology search and ColabFold's advanced features, such as more recycles, on prediction accuracy. While SRA homologs were most significant for improving ColabFold's CASP15 ranking from 11th to 3rd place, other strategies contributed too. We analyze these in the context of existing strategies to improve prediction.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. MSA enrichment using SRA and other strategies to improve protein structure prediction.
Workflow of the different strategies examined in this study ①~⑦. All strategies construct an MSA (but differ in the homology DBs they utilize) and provide it to CF-predict (but differ in the way they tune its parameters). The size of each homology DB is denoted close to it. The baseline MSA (cfdb MSA, ①) is constructed by CF-search. The SRA-detected homologs are aligned to create ② using MMseqs2. The sra_cfdb MSA (③) is constructed by combining ① and ②. The hh_sra_cfdb MSA (④) is constructed by querying ③ against UniRef30 and BFD using HHblits. Strategies ⑤, ⑥, ⑦ refer to the following CF-predict options: use of templates, multimer (homo-oligomer) modeling, and 12 recycles (instead of the default 3). Before being provided to CF-predict, each MSA is filtered based on the sequence identity between its members and the query.
Figure 2.
Figure 2.. Effect of ColabFold parameters on structure prediction accuracy.
(A) Comparison of homology search of 109 domains of 77 CASP15 targets. Each mark denotes the number of hits found for each target domain using either CF-search against CFDB (triangle) or MMseqs2 against SRA-mined and assembled proteins (circle) before the MSA filtering step. (B) Neff scores of the different MSAs computed for each domain. The MSA with the most homologs and the highest Neff is indicated with a filled mark in panels A and B, respectively. (C) Structure prediction of 62 target domains in the categories: FM, FM/TBM, and TBM-hard was evaluated based on GDT_TS scores of three prediction strategies: cfdb MSA, sra_cfdb MSA and sra_cfdb_recyc. The best-scoring strategy for each target domain is indicated with filled marks. (D) Prediction performance comparison between server groups in CASP15. The x-axis refers to the Sum Z (> 0.0) in Table 3. The score of this study is from the Model1 in Table 3. Here, ColabFold refers to the performance of the server group submitted in CASP15.

References

    1. Alexander H, Hu SK, Krinos AI, Pachiadaki M, Tully BJ, Neely CJ, Reiter T. 2022. Eukaryotic genomes from a global metagenomic dataset illuminate trophic modes and biogeography of ocean plankton. bioRxiv 2021.07.25.453713. 10.1101/2021.07.25.453713v2 (Accessed July 2, 2023). - DOI - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215: 403–410. - PubMed
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402. - PMC - PubMed
    1. Ashkenazy H, Unger R, Kliger Y. 2009. Optimal data collection for correlated mutation analysis. Proteins 74: 545–555. - PubMed
    1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. 2000. The Protein Data Bank. Nucleic Acids Res 28: 235–242. - PMC - PubMed

Publication types