This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Jul 11:2023.07.10.548308.

doi: 10.1101/2023.07.10.548308.

Petascale Homology Search for Structure Prediction

Sewon Lee¹, Gyuri Kim¹, Eli Levy Karin², Milot Mirdita¹, Sukhwan Park³, Rayan Chikhi⁴, Artem Babaian^{5

6}, Andriy Kryshtafovych⁷, Martin Steinegger^{1

3

8

9}

Affiliations

¹ School of Biological Sciences, Seoul National University, Seoul 08826, South Korea.
² ELKMO, Copenhagen 2720, Denmark.
³ Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, South Korea.
⁴ Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, 75015 Paris, France.
⁵ Department of Molecular Genetics, University of Toronto, Toronto, Ontario M5S 1A8, Canada.
⁶ Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada.
⁷ Genome Center, University of California, Davis, California 95616, USA.
⁸ Artificial Intelligence Institute, Seoul National University, Seoul 08826, South Korea.
⁹ Institute of Molecular Biology and Genetics, Seoul National University, Seoul 08826, South Korea.

PMID: 37503235
PMCID: PMC10369885
DOI: 10.1101/2023.07.10.548308

Petascale Homology Search for Structure Prediction

Sewon Lee et al. bioRxiv. 2023.

[Preprint]. 2023 Jul 11:2023.07.10.548308.

doi: 10.1101/2023.07.10.548308.

Authors

Sewon Lee¹, Gyuri Kim¹, Eli Levy Karin², Milot Mirdita¹, Sukhwan Park³, Rayan Chikhi⁴, Artem Babaian^{5

6}, Andriy Kryshtafovych⁷, Martin Steinegger^{1

3

8

9}

Affiliations

¹ School of Biological Sciences, Seoul National University, Seoul 08826, South Korea.
² ELKMO, Copenhagen 2720, Denmark.
³ Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, South Korea.
⁴ Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, 75015 Paris, France.
⁵ Department of Molecular Genetics, University of Toronto, Toronto, Ontario M5S 1A8, Canada.
⁶ Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada.
⁷ Genome Center, University of California, Davis, California 95616, USA.
⁸ Artificial Intelligence Institute, Seoul National University, Seoul 08826, South Korea.
⁹ Institute of Molecular Biology and Genetics, Seoul National University, Seoul 08826, South Korea.

PMID: 37503235
PMCID: PMC10369885
DOI: 10.1101/2023.07.10.548308

Update in

Petabase-Scale Homology Search for Structure Prediction.
Lee S, Kim G, Karin EL, Mirdita M, Park S, Chikhi R, Babaian A, Kryshtafovych A, Steinegger M. Lee S, et al. Cold Spring Harb Perspect Biol. 2024 May 2;16(5):a041465. doi: 10.1101/cshperspect.a041465. Cold Spring Harb Perspect Biol. 2024. PMID: 38316555 Review.

Abstract

The recent CASP15 competition highlighted the critical role of multiple sequence alignments (MSAs) in protein structure prediction, as demonstrated by the success of the top AlphaFold2-based prediction methods. To push the boundaries of MSA utilization, we conducted a petabase-scale search of the Sequence Read Archive (SRA), resulting in gigabytes of aligned homologs for CASP15 targets. These were merged with default MSAs produced by ColabFold-search and provided to ColabFold-predict. By using SRA data, we achieved highly accurate predictions (GDT_TS > 70) for 66% of the non-easy targets, whereas using ColabFold-search default MSAs scored highly in only 52%. Next, we tested the effect of deep homology search and ColabFold's advanced features, such as more recycles, on prediction accuracy. While SRA homologs were most significant for improving ColabFold's CASP15 ranking from 11th to 3rd place, other strategies contributed too. We analyze these in the context of existing strategies to improve prediction.

PubMed Disclaimer

Figures

**Figure 1.. MSA enrichment using SRA and other strategies to improve protein structure prediction.**
Workflow of the different strategies examined in this study ①~⑦. All strategies construct an MSA (but differ in the homology DBs they utilize) and provide it to CF-predict (but differ in the way they tune its parameters). The size of each homology DB is denoted close to it. The baseline MSA (*cfdb* MSA, ①) is constructed by CF-search. The SRA-detected homologs are aligned to create ② using MMseqs2. The *sra_cfdb* MSA (③) is constructed by combining ① and ②. The *hh_sra_cfdb* MSA (④) is constructed by querying ③ against UniRef30 and BFD using HHblits. Strategies ⑤, ⑥, ⑦ refer to the following CF-predict options: use of templates, multimer (homo-oligomer) modeling, and 12 recycles (instead of the default 3). Before being provided to CF-predict, each MSA is filtered based on the sequence identity between its members and the query.

**Figure 2.. Effect of ColabFold parameters on structure prediction accuracy.**
**(A)** Comparison of homology search of 109 domains of 77 CASP15 targets. Each mark denotes the number of hits found for each target domain using either CF-search against CFDB (triangle) or MMseqs2 against SRA-mined and assembled proteins (circle) before the MSA filtering step. **(B)** N_eff scores of the different MSAs computed for each domain. The MSA with the most homologs and the highest N_eff is indicated with a filled mark in panels A and B, respectively. **(C)** Structure prediction of 62 target domains in the categories: FM, FM/TBM, and TBM-hard was evaluated based on GDT_TS scores of three prediction strategies: *cfdb* MSA, *sra_cfdb* MSA and *sra_cfdb_recyc*. The best-scoring strategy for each target domain is indicated with filled marks. **(D)** Prediction performance comparison between server groups in CASP15. The x-axis refers to the Sum Z (> 0.0) in Table 3. The score of this study is from the Model1 in Table 3. Here, ColabFold refers to the performance of the server group submitted in CASP15.

See this image and copyright information in PMC

References

1. Alexander H, Hu SK, Krinos AI, Pachiadaki M, Tully BJ, Neely CJ, Reiter T. 2022. Eukaryotic genomes from a global metagenomic dataset illuminate trophic modes and biogeography of ocean plankton. bioRxiv 2021.07.25.453713. 10.1101/2021.07.25.453713v2 (Accessed July 2, 2023). - DOI - PMC - PubMed
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215: 403–410. - PubMed
1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402. - PMC - PubMed
1. Ashkenazy H, Unger R, Kliger Y. 2009. Optimal data collection for correlated mutation analysis. Proteins 74: 545–555. - PubMed
1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. 2000. The Protein Data Bank. Nucleic Acids Res 28: 235–242. - PMC - PubMed

Publication types

Actions

Grants and funding

R01 GM100482/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Petascale Homology Search for Structure Prediction

Affiliations

Petascale Homology Search for Structure Prediction

Authors

Affiliations

Update in

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources