. 2024 Aug;21(8):1514-1524.

doi: 10.1038/s41592-024-02272-z. Epub 2024 May 14.

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Gustaf Ahdritz^#^{1

2}, Nazim Bouatta^#³, Christina Floristean¹, Sachin Kadyan¹, Qinghui Xia¹, William Gerecke⁴, Timothy J O'Donnell⁵, Daniel Berenberg⁶, Ian Fisk⁷, Niccolò Zanichelli⁸, Bo Zhang⁹, Arkadiusz Nowaczynski¹⁰, Bei Wang¹⁰, Marta M Stepniewska-Dziubinska¹⁰, Shang Zhang¹⁰, Adegoke Ojewole¹⁰, Murat Efe Guney¹⁰, Stella Biderman^{11

12}, Andrew M Watkins¹³, Stephen Ra¹³, Pablo Ribalta Lorenzo¹⁰, Lucas Nivon¹⁴, Brian Weitzner¹⁵, Yih-En Andrew Ban¹⁶, Shiyang Chen¹⁷, Minjia Zhang¹⁸, Conglong Li¹⁹, Shuaiwen Leon Song¹⁹, Yuxiong He¹⁹, Peter K Sorger⁴, Emad Mostaque²⁰, Zhao Zhang¹⁷, Richard Bonneau¹³, Mohammed AlQuraishi²¹

Affiliations

¹ Department of Systems Biology, Columbia University, New York, NY, USA.
² Harvard University, Cambridge, MA, USA.
³ Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA. nbouatta@gmail.com.
⁴ Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA.
⁵ Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁶ Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
⁷ Flatiron Institute, New York, NY, USA.
⁸ OpenBioML, Cambridge, MA, USA.
⁹ Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, USA.
¹⁰ NVIDIA, Santa Clara, CA, USA.
¹¹ EleutherAI, New York, NY, USA.
¹² Booz Allen Hamilton, McLean, VA, USA.
¹³ Prescient Design, Genentech, New York, NY, USA.
¹⁴ Cyrus Bio, Seattle, WA, USA.
¹⁵ Outpace Bio, Seattle, WA, USA.
¹⁶ Arzeda, Seattle, WA, USA.
¹⁷ Rutgers University, New Brunswick, NJ, USA.
¹⁸ University of Illinois at Urbana-Champaign, Champaign, IL, USA.
¹⁹ Microsoft, Redmond, WA, USA.
²⁰ Stability AI, Los Altos, CA, USA.
²¹ Department of Systems Biology, Columbia University, New York, NY, USA. m.alquraishi@columbia.edu.

^# Contributed equally.

PMID: 38744917
PMCID: PMC11645889
DOI: 10.1038/s41592-024-02272-z

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Gustaf Ahdritz et al. Nat Methods. 2024 Aug.

. 2024 Aug;21(8):1514-1524.

doi: 10.1038/s41592-024-02272-z. Epub 2024 May 14.

Authors

Affiliations

¹ Department of Systems Biology, Columbia University, New York, NY, USA.
² Harvard University, Cambridge, MA, USA.
³ Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA. nbouatta@gmail.com.
⁴ Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA.
⁵ Icahn School of Medicine at Mount Sinai, New York, NY, USA.
⁶ Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
⁷ Flatiron Institute, New York, NY, USA.
⁸ OpenBioML, Cambridge, MA, USA.
⁹ Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, USA.
¹⁰ NVIDIA, Santa Clara, CA, USA.
¹¹ EleutherAI, New York, NY, USA.
¹² Booz Allen Hamilton, McLean, VA, USA.
¹³ Prescient Design, Genentech, New York, NY, USA.
¹⁴ Cyrus Bio, Seattle, WA, USA.
¹⁵ Outpace Bio, Seattle, WA, USA.
¹⁶ Arzeda, Seattle, WA, USA.
¹⁷ Rutgers University, New Brunswick, NJ, USA.
¹⁸ University of Illinois at Urbana-Champaign, Champaign, IL, USA.
¹⁹ Microsoft, Redmond, WA, USA.
²⁰ Stability AI, Los Altos, CA, USA.
²¹ Department of Systems Biology, Columbia University, New York, NY, USA. m.alquraishi@columbia.edu.

^# Contributed equally.

PMID: 38744917
PMCID: PMC11645889
DOI: 10.1038/s41592-024-02272-z

Abstract

AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein-ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model's capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community.

PubMed Disclaimer

Figures

**Extended Data Fig. 1 ∣. OpenFold matches the accuracy of AlphaFold2 on CASP15 targets.**
Scatter plot of GDT-TS values of AlphaFold and OpenFold ‘Model 1’ predictions against all currently available ‘all groups’ CASP15 domains (n = 90). OpenFold’s mean accuracy (95% confidence interval = 68.6-78.8) is on par with AlphaFold’s (95% confidence interval = 69.7-79.2) and OpenFold does at least as well as the latter on exactly 50% of targets. Confidence intervals of each mean are estimated from 10,000 bootstrap samples.

**Extended Data Fig. 2 ∣. OpenFold learns decoy ranking slowly.**
Decoy ranking results (mean Spearman correlation between pLDDT and decoy TM Score) using intermediate checkpoints of OpenFold on 28 randomly chosen proteins from the Rosetta decoy ranking dataset from. See Supplementary Information section B.1 for more details.

**Extended Data Fig. 3 ∣. Fine-tuning does not materially improve prediction accuracy on long proteins.**
Mean lDDT-Cα over validation proteins with at least 500 residues as a function of fine-tuning step.

**Extended Data Fig. 4 ∣. The ‘Mostly alpha’ CATH class contains some beta sheets, and vice versa.**
Counts for alpha helices and beta sheets in the mostly alpha and mostly beta CATH class-stratified training sets from Fig. 2, based on 1,000 random samples. Counts are binned by size, defined as the number of residues for alpha helices and number of strands for beta sheets.

**Extended Data Fig. 5 ∣. Reduced dataset diversity disproportionately affects global structure.**
Mean GDT-TS and lDDT-Cα of non-overlapping protein fragments from CAMEO validation set as a function of the percentage of CATH clusters in elided training sets. Data for both topology and architecture elisions are included. The fragmenting procedure is the same as that described in Fig. 5a.

**Extended Data Fig. 6 ∣. Early predictions crudely approximate lower-dimensional PCA projections.**
(A) Mean dRMSD, as a function of training step, between low- dimensional PCA projections of predicted structures and the final 3D prediction at step 5,000 (denoted by *). Averages are computed over the CAMEO validation set. Insets show idealized behavior corresponding to unstaggered, simultaneous growth in all dimensions and perfectly staggered growth. Empirical training behavior more closely resembles the staggered scenario. (B) Low-dimensional projections as in (A) compared to projections of the final predicted structures at step 5,000. (C) Mean displacement, as a function of training step, of C? atoms along the directions of their final structure’s PCA eigenvectors. Results are shown for two individual proteins (PDB accession codes 7DQ9_A ref. and 7RDT_A ref. 67). Shaded regions correspond loosely to ‘1D,’ ‘2D,’ and ‘3D’ phases of dimensionality.

**Extended Data Fig. 7 ∣. Radius of gyration as an order parameter for learning protein phase structure.**
Radii of gyration for proteins in the CAMEO validation set (or- ange) as a function of sequence length over training time, plotted on a log-log scale against experimental structures (blue). Legends show equations of best fit curves, computed using non-linear least squares. The training steps chosen correspond loosely to four phases of dimensional growth. See Supplementary Information section B.3 for extended discussion.

**Extended Data Fig. 8 ∣. Contact prediction for beta sheets at different ranges.**
Binned contact F1 scores (8 Å threshold) for beta sheets of various widths as a function of training step at different residue-residue separation ranges (SMLR ≥ 6 residues apart; LR ≥ 24 residues apart, as in). Sheet widths are weighted averages of sheet thread counts within each bin, as in Fig. 5b.

**Fig. 1 ∣. OpenFold matches the accuracy of AlphaFold2.**
a, Scatterplot of lDDT-Cα values of AlphaFold and OpenFold predictions on the CAMEO validation set. b, Average pLDDT versus lDDT-Cα values of OpenFold predictions on the CAMEO set during the early stage of training. OpenFold is initially overconfident but quickly becomes underconfident, gradually converging to accurate confidence estimation. c, Predictions by OpenFold and AlphaFold2 overlaid with an experimental structure of *Streptomyces tokunonesis* TokK protein (ref. ; PDB accession code 7KDX_A). d, Average lDDT-Cα for OpenFold computed over the training set during the course of training. The template-free branch is shown in green, the template-using one is in orange, and the initial training and/or fine-tuning boundary is in gray. Template-free accuracy is initially poor because the exponential moving average of the weights used for validation was being reinitialized.

**Fig. 2 ∣. OpenFold generalization capacity on elided training sets.**
a, Validation set lDDT-Cα values as a function of training step for models trained on elided training sets (10,000 random split repeated 3× demonstrates inter-run variance). b, Same as a but for CATH-stratified dataset elisions. Validation sets vary across stratifications and are not directly comparable. c, Experimental structures (orange) and mainly α-trained (yellow) and mainly β-trained (red) predictions of largely helical Lsi1 (top) and β-sheet-heavy TMED1 (bottom).

**Fig. 3 ∣. Model improvements.**
a, OpenFold trains more stably than AlphaFold2. lDDT-Cα and dRMSD-Cα (distance root mean squared deviation with respect to the alpha carbon) on the CAMEO validation set as a function of training step for five independent training runs with (orange) and without (blue) the new FAPE clamping protocol. Runs using the old protocol exhibit substantial instability with two rapidly converging runs, two late converging runs and one non-converging run. By contrast, all 15 independent runs using the new protocol converge rapidly. Runs using the new protocol also reach high accuracy faster. b, OpenFold is consistently three to four times faster than AlphaFold2 and can be run on longer sequences. Prediction runtimes in seconds on a single A100 NVIDIA GPU for OpenFold and AlphaFold2 with proteins of varying length.

**Fig. 4 ∣. Secondary structure categories are learned in succession.**
a, F1 scores for secondary structure categories over time. The corner pane depicts the same data using a simplified three-state assignment (details are in the Supplementary Discussion (section B.5)). GDT-TS and final values are also provided. b, Corresponding counts of individual secondary structure assignments. c, Contiguous fractions of individual helices recovered early in training.

**Fig. 5 ∣. Learning proceeds at multiple scales.**
a, Mean GDT-TS and dRMSD-Cα validation scores as a function of training step for non-overlapping protein fragments of varying lengths (color bars indicate fragment length). b, Average contact F1 score (threshold of 8 Å) and dRMSD for predicted α-helices and β-sheets of varying lengths and number of strands, respectively, as a function of training step. Color bars indicate the weighted average of the lengths and widths of helices and sheets in each bin, respectively.

See this image and copyright information in PMC

References

1. Anfinsen CB Principles that govern the folding of protein chains. Science 181, 223–230 (1973). - PubMed
1. Dill KA, Ozkan SB, Shell MS & Weikl TR The protein folding problem. Annu. Rev. Biophys 37, 289–316 (2008). - PMC - PubMed
1. Jones DT, Singh T, Kosciolek T & Tetchner S MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics 31, 999–1006 (2015). - PMC - PubMed
1. Golkov V et al. Protein contact prediction from amino acid co-evolution using convolutional networks for graph-valued images. In Advances in Neural Information Processing Systems (eds Lee D et al.) (Curran Associates, 2016).
1. Wang S, Sun S, Li Z, Zhang R & Xu J Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol 13, e1005324 (2017). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

R35 GM150546/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Affiliations

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources