Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2026 Jan;94(1):154-166.
doi: 10.1002/prot.70044. Epub 2025 Aug 25.

AlphaFold3 at CASP16

Affiliations

AlphaFold3 at CASP16

Arne Elofsson. Proteins. 2026 Jan.

Abstract

The CASP16 experiment provided the first opportunity to benchmark AlphaFold3. In contrast to AlphaFold2, AlphaFold3 can predict the structure of non-protein molecules. According to the benchmark presented by the developers, it is expected to perform slightly better than AlphaFold2 for proteins. In this study, we assess the performance of AlphaFold3 using both automatic server submissions (AF3-server) and manual predictions from the Elofsson group (Elofsson). All predictions were generated via the AlphaFold3 web server, with manual interventions applied to large targets and ligands. Compared to AlphaFold2-based methods, we found that AlphaFold3 performs slightly better for protein complexes. However, when massive sampling is applied to AlphaFold2, the difference disappears. It was also noted that, according to the official ranking from CASP, the AF3-server performs better than AlphaFold2 for easier targets, but not for harder targets. Furthermore, the performance of the AF3-server is comparable to the best methods when considering the top-ranked predictions, but slightly behind when examining the best among the five submitted models. Here, there exist targets where AF3-server, the top-ranked method, is worse than lower-ranked models, indicating that a venue for progress could be to develop better strategies for identifying the best out of the generated models. When using AF3-server to predict the stoichiometry of larger protein complexes, the accuracy is limited, especially for heteromeric targets. When analyzing the predictions including nucleic acids, it was found that, in general, the accuracy is relatively low. However, the AF3-server performance was not far behind that of the top-ranked method. In summary, AF3-server offers a user-friendly tool that provides predictions comparable to state-of-the-art methods in all categories of CASP.

Keywords: AlphaFold; CASP; RNA structure prediction; protein structure predictions.

PubMed Disclaimer

Conflict of interest statement

The author declares no conflicts of interest.

Figures

FIGURE 1
FIGURE 1
Modeling a large target, H1227, where the left domain (colored in red) could not be modeled directly by AF3‐server as the total complex was more than 5000 residues. Instead, it was modeled separately, along with the terminal part of the larger complex. The overlapping domain was superimposed, and the red domain was finally added to the model before submission.
FIGURE 2
FIGURE 2
Comparison of AF3‐server and Elofsson predictions using GDT_TS and DockQ. The top row (A + B) compares the highest‐ranked prediction, while the second row (C + D) compares the best prediction out of the five submitted models. On the left (A + C), the comparisons are made at the domain level using GDT_TS, and on the right (B + D), comparisons of complexes using DockQ are shown. Any target with a difference greater than 10% of the maximum value for one of the methods is annotated with the target number.
FIGURE 3
FIGURE 3
Comparison of AF3‐server with MassiveFold and colabfold_baseline predictions using GDT_TS and DockQ. The top row (A–D) displays the highest‐ranked prediction, while the second row (E–H) highlights the best prediction out of five. The comparison is conducted at the domain level using GDT_TS in the left two columns (A, B, E, and F), and in the right two columns (C, D, G, and H), the comparison for complexes is made using DockQ The comparison against colabfold_baseline is indicated in A, C, E, and G, while the comparison against MassiveFold appears in the other figures. Any target with a difference greater than 10% of the maximum value for one of the methods is annotated with the target number.
FIGURE 4
FIGURE 4
Comparison of AF3‐server and top‐performing predictors, YangServer for domains (A and C) and KiharaLab for complexes (B and D). The top row (A, B) illustrates the highest‐ranked predictions, while the second row (C, D) highlights the best prediction among the five submitted models. On the left, the comparison is conducted at the domain level using GDT_TS, and on the right, the comparison of complexes utilizes DockQ. Any target with a difference greater than 10% of the maximum value for one of the methods is marked with the target number. A rolling average over 10 points is displayed in purple.
FIGURE 5
FIGURE 5
Comparison of the top‐ranked predictions versus the best predictions for each target from the AF3 server. (A) Illustrates domains using GDT_TS, and (B) depicts complexes using DockQ. The target number is annotated for all figures that differ by more than 10%.
FIGURE 6
FIGURE 6
Example of an “overlapping” RNA prediction for R1250 (hexamer), each chain is highlighted in one color. It is clear that they are all superposed on top of each other.
FIGURE 7
FIGURE 7
Comparison of RNA predictions. (A) and (B) Compare the AF3‐server with Elofsson predictions, while in (C) and (D) the comparison is between the AF3‐server and the top‐ranked method, Vfold. In (A) and (C), the comparison is made at the single RNA level using TMalign, while in (B) and (D), the comparison of complexes employs DockQ. Here, the Mixed (M) and pure RNA targets are marked in different colors. Any target with a difference greater than 10% of the maximum value for one of the methods is annotated with the target number. A rolling average over 10 points is shown in purple in B and D.
FIGURE 8
FIGURE 8
Evaluation of quality estimations. (A) pTM versus TMscore for protein domains and single‐stranded RNA. (B) ipTM versus DockQ for entire complexes. Each complex type (homomeric targets [T], heteromeric targets [H], mixed targets [M], and nucleic acid targets [R]) is represented in different colors.

References

    1. Coucke A., Uguzzoni G., Oteri F., Cocco S., Monasson R., and Weigt M., “Direct Coevolutionary Couplings Reflect Biophysical Residue Interactions in Proteins,” Journal of Chemical Physics 145, no. 17 (2016): 174102. - PubMed
    1. Weigt M., White R. A., Szurmant H., Hoch J. A., and Hwa T., “Identification of Direct Residue Contacts in Protein‐Protein Interaction by Message Passing,” Proceedings of the National Academy of Sciences of the United States of America 106, no. 1 (2009): 67–72. - PMC - PubMed
    1. Michel M., Skwark M. J., Menéndez Hurtado D., Ekeberg M., and Elofsson A., “Predicting Accurate Contacts in Thousands of Pfam Domain Families Using PconsC3,” Bioinformatics 33, no. 18 (2017): 2859–2866. - PubMed
    1. Marks D. S., Colwell L. J., Sheridan R., et al., “Protein 3D Structure Computed From Evolutionary Sequence Variation,” PLoS One 6, no. 12 (2011): e28766. - PMC - PubMed
    1. Sułkowska J. I., Morcos F., Weigt M., Hwa T., and Onuchic J. N., “Genomics‐Aided Structure Prediction,” Proceedings of the National Academy of Sciences of the United States of America 109, no. 26 (2012): 10340–10345. - PMC - PubMed

LinkOut - more resources