Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 May 30:2025.05.29.656875.
doi: 10.1101/2025.05.29.656875.

Assessment of Protein Complex Predictions in CASP16: Are we making progress?

Affiliations

Assessment of Protein Complex Predictions in CASP16: Are we making progress?

Jing Zhang et al. bioRxiv. .

Abstract

The assessment of oligomer targets in the Critical Assessment of Structure Prediction Round 16 (CASP16) suggests that complex structure prediction remains an unsolved challenge. More than 30% of targets, particularly antibody-antigen targets, were highly challenging, with each group correctly predicting structures for only about a quarter of such targets. Most CASP16 groups relied on AlphaFold-Multimer (AFM) or AlphaFold3 (AF3) as their core modeling engines. By optimizing input MSAs, refining modeling constructs (using partial rather than full sequences), and employing massive model sampling and selection, top-performing groups were able to significantly outperform the default AFM/AF3 predictions. CASP16 also introduced two additional challenges: Phase 0, which required predictions without stoichiometry information, and Phase 2, which provided participants with thousands of models generated by MassiveFold (MF) to enable large-scale sampling for resource-limited groups. Across all phases, the MULTICOM series and Kiharalab emerged as top performers based on the quality of their best models per target. However, these groups did not have a strong advantage in model ranking, and thus their lead over other teams, such as Yang-Multimer and kozakovvajda, was less pronounced when evaluating only the first submitted models. Compared to CASP15, CASP16 showed moderate overall improvement, likely driven by the release of AF3 and the extensive model sampling employed by top groups. Several notable trends highlight key frontiers for future development. First, the kozakovvajda group significantly outperformed others on antibody-antigen targets, achieving over a 60% success rate without relying on AFM or AF3 as their primary modeling framework, suggesting that alternative approaches may offer promising solutions for these difficult targets. Second, model ranking and selection continue to be major bottlenecks. The PEZYFoldings group demonstrated a notable advantage in selecting their best models as first models, suggesting that their pipeline for model ranking may offer important insights for the field. Finally, the Phase 0 experiment indicated reasonable success in stoichiometry prediction; however, stoichiometry prediction remains challenging for high-order assemblies and targets that differ from available homologous templates. Overall, CASP16 demonstrated steady progress in multimer prediction while emphasizing the urgent need for more effective model ranking strategies, improved stoichiometry prediction, and the development of new modeling methods that extend beyond the current AF-based paradigm.

Keywords: AlphaFold2; AlphaFold3; CASP16; antigen-antibody interaction; model sampling; oligomer prediction; stoichiometry.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors have no competing interests to declare.

Figures

Figure 1.
Figure 1.. Overview of targets and participating groups.
A) A heatmap of the performance of participating groups over Phase 1 oligomer targets. The value in each cell is the highest DockQ value of one group for one target (Grey indicates that such a score is not available). The annotations on the x-axis and y-axis represent the different features of groups and targets, respectively. Only groups that submitted more than 25% of the models are included. B) Pipelines we used to evaluate performance for different categories of targets. Left: a standard pipeline for most Phase 1 and Phase 2 targets, where targets share the same stoichiometry as the models. Middle: a special pipeline for Phase 0 and special targets in other phases, where models have different stoichiometry from the targets. Right: a pipeline for AA targets focusing on antibody–antigen interfaces.
Figure 2.
Figure 2.. Difficult targets and target properties that affect model quality.
A) Identification of difficult oligomer targets in CASP16 using the 95% quantile of DockQ and TM-score. Targets inside the green box are considered easy. B – I) Structures of identified difficult targets. Each chain is indicated by a different color. J – N) Target features affecting model quality: J) Neff of input MSAs for homo-dimeric interfaces (each dot represents an interface); K) Neff of input MSAs for hetero-dimeric interfaces (each dot represents an interface); L) interface size (the number of inter-chain contacts with distance below 5) for homo-dimeric interfaces (each dot represents an interface); M) interface size (the number of inter-chain contacts with distance below 5) for hetero-dimeric interfaces (each dot represents an interface); N) Total number of residues in the entire complex (each dot represents a target).
Figure 3.
Figure 3.. Moderate progress in CASP16 over CASP15 after accounting for differences in target difficulty.
A) Average DockQ for top 55 groups over normal targets (excluding AA and fiber targets) in CASP15 (top) and CASP16 (bottom), respectively. B) Gaussian kernel density estimation of the DockQ distribution of the first model of ColabFold for CASP15 (top) and CASP16 (bottom) targets. C) The quality of the best models and the ColabFold models, measured by DockQ. D) One-to-one mapping between CASP15 and CASP16 targets by the Hungarian algorithm to sample targets from CASP15 to match the target difficulty level of CASP16 targets. E) Difference in DockQ between the best models and the ColabFold models for CASP16 targets and CASP15 targets sampled by the Hungarian algorithm, respectively. F) Performance comparisons between CASP16 targets and weighted bootstrap samples of CASP15 targets. These bootstrap samples of CASP15 targets show similar difficulty levels as CASP16 targets (by ColabFold DockQ, x-axis) and improvement over the ColabFold models by the best CASP15 or CASP16 models (y-axis).
Figure 4.
Figure 4.. Performance evaluation on AA targets.
A) Average DockQ scores of first models for AA targets submitted in CASP16 (bottom) and CASP15 (top) by the top 66 groups. Dashed orange lines indicate the DockQ of ColabFold in both CASPs. We added AF3_1k, which uses the best model of each target selected by ipTM of the antibody–antigen interfaces among 5,000 models we generated using 1,000 random seeds with AF3. B) Left: average DockQ of first models from selected groups in CASP15. Right: number of targets with good models falling into different quality tiers based on DockQ for selected CASP15 groups. C) Left: average DockQ of first models from selected CASP16 groups. Right: number of targets with good models falling into different quality tiers based on DockQ for selected CASP16 groups. D) Ranking of groups based on their cumulative z-scores (z-score for each target is the average of 4 component scores, ICS, IPS, QSbest, and DockQ) for AA targets in all phases. E) Pairwise head-to-head bootstrap comparisons between groups for AA targets in all phases. Each cell shows the percentage of targets where the row group outperforms the column group based on cumulative z-scores. F) The sum of differences between each group’s first model DockQ and the mean DockQ of all other models per target. G) Per-target DockQ of models from representative CASP16 groups or generated by us using a standalone AF3 across CASP16 AA targets (labeled on top). AF3 local single: the average of the first models generated by AF3 using 1,000 random seeds; AF3 local 1000: best model selected by ipTM of antibody–antigen interfaces from 5,000 AF3 models generated using 1,000 random seeds. H–O) Structural superimpositions of targets and the best models selected by cumulative DockQ of antibody–antigen interfaces for AA targets. Targets without high-quality models are in orange (N: H1225) and red (O: H1244) boxes. Orange: antigen chains in targets; yellow: antigen chains in models; green: antibody chains in targets; cyan: antibody chains in models.
Figure 5.
Figure 5.. Ranking based on cumulative z-scores and head-to-head comparisons between groups using bootstrap samples.
A) Ranking by best models for Phase 1 targets. B) Head-to-head comparisons by the best models for Phase 1 targets. C) Ranking by best models for all targets in three phases. D) Head-to-head comparisons by the best models for all targets in three phases. E) Ranking by first models for all targets in three phases. D) Head-to-head comparisons by first models for all targets in three phases.
Figure 6.
Figure 6.. Performance differences across phases and model 6 evaluation.
A) Stoichiometry prediction accuracy for each group across Phase 0 targets, red: correct, cyan: incorrect, grey: no value. B) Average DockQ improvement for each group from Phase 0 to Phase 1. C) Average DockQ improvement for each group from Phase 1 to Phase 2. In both B and C, we used the RBM pipeline to compute DockQ scores, making scores comparable between phases. D) Left: per-target DockQ difference between the first submitted model and model 6 (where model 6 was generated using ColabFold MSAs). Right: sum of DockQ differences for each group. E) Left: per-target DockQ difference between model 6 and the first model from Colabfold. Right: Sum of DockQ differences for each group.
Figure 7.
Figure 7.. Evaluation of model ranking ability and the community’s improvement over MF.
A) Evaluation of whether the first model selected by a group is the best model for Phase 1 targets and top-performing groups (in overall ranking over all phases). Green: the first model being the best model; Purple: the first model is not the best model. B) The number of correctly picked first models (when the first model is the best for a target) by each group shown in A. C) Values of the highest DockQ among MF models or different sets of models submitted by CASP16 groups. Targets are ordered by the highest DockQ among models submitted by CASP16 groups from left to right. D) An example, T1218o, where CASP16 groups using docking-based methods outperformed AF2/3-based predictions. AF-based predictions were poor, although a homologous template (right) is available for this target. E) An example, H1258, where the Yang lab can achieve significantly higher interface quality by only including the segment of a protein (the green one) that mediates its interaction with others in the model. Top: the target; bottom: a model from the Yang lab.
Figure 8.
Figure 8.. Evaluation of hybrid targets.
A-F) Ranking and head-to-head bootstrap comparisons of prediction accuracy for full models (A and B), protein-protein interfaces (C and D), and protein-NA interfaces (E and F), respectively. G) Comparison of AF3 ICS scores with the 95th percentile ICS scores among CASP16 submissions for different target interface types. Blue: protein-only complexes; green: protein-protein interfaces in hybrid targets; red: protein-NA interfaces in hybrid targets. Density plots along the axes show the distribution of ICS scores for each interface group. H) Structures of target M1209 along with the best model from participants and the best AF3_server model. I-K) Targets (M1282, M1287, and M1211) and their corresponding best models. These are difficult targets lacking predictions of acceptable quality (best ICS for protein-NA interface < 0.2).

References

    1. Kryshtafovych A, Schwede T, Topf M, Fidelis K, Moult J. Critical assessment of methods of protein structure prediction (CASP)-Round XIII. Proteins. 2019;87(12):1011–1020. - PMC - PubMed
    1. Kryshtafovych A, Schwede T, Topf M, Fidelis K, Moult J. Critical assessment of methods of protein structure prediction (CASP)-Round XIV. Proteins. 2021;89(12):1607–1617. - PMC - PubMed
    1. Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589. - PMC - PubMed
    1. Evans R, O’Neill M, Pritzel A, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv. Published online March 10, 2022:2021.10.04.463034. doi: 10.1101/2021.10.04.463034 - DOI
    1. Kryshtafovych A, Schwede T, Topf M, Fidelis K, Moult J. Critical assessment of methods of protein structure prediction (CASP)-Round XV. Proteins. 2023;91(12):1539–1549. - PMC - PubMed

Publication types

LinkOut - more resources