Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2018 Jul 2;8(1):9939.
doi: 10.1038/s41598-018-26812-8.

An analysis and evaluation of the WeFold collaborative for protein structure prediction and its pipelines in CASP11 and CASP12

Chen Keasar  1 Liam J McGuffin  2 Björn Wallner  3 Gaurav Chopra  4   5   6   7   8 Badri Adhikari  9 Debswapna Bhattacharya  9   10 Lauren Blake  11 Leandro Oliveira Bortot  12 Renzhi Cao  9 B K Dhanasekaran  13 Itzhel Dimas  11 Rodrigo Antonio Faccioli  14 Eshel Faraggi  15   16   17 Robert Ganzynkowicz  18 Sambit Ghosh  13 Soma Ghosh  13 Artur Giełdoń  18 Lukasz Golon  18 Yi He  19 Lim Heo  20 Jie Hou  9 Main Khan  21 Firas Khatib  21 George A Khoury  22 Chris Kieslich  23 David E Kim  24   25 Pawel Krupa  18 Gyu Rie Lee  20 Hongbo Li  9   26   27 Jilong Li  9 Agnieszka Lipska  18 Adam Liwo  18 Ali Hassan A Maghrabi  2 Milot Mirdita  28 Shokoufeh Mirzaei  11   29 Magdalena A Mozolewska  18 Melis Onel  30 Sergey Ovchinnikov  24   31 Anand Shah  21 Utkarsh Shah  30 Tomer Sidi  1 Adam K Sieradzan  18 Magdalena Ślusarz  18 Rafal Ślusarz  18 James Smadbeck  22 Phanourios Tamamis  23   30 Nicholas Trieber  21 Tomasz Wirecki  18 Yanping Yin  32 Yang Zhang  33 Jaume Bacardit  34 Maciej Baranowski  35 Nicholas Chapman  36 Seth Cooper  37 Alexandre Defelicibus  14 Jeff Flatten  36 Brian Koepnick  24 Zoran Popović  36 Bartlomiej Zaborowski  18 David Baker  24   25   36 Jianlin Cheng  9 Cezary Czaplewski  18 Alexandre Cláudio Botazzo Delbem  14 Christodoulos Floudas  23 Andrzej Kloczkowski  18 Stanislaw Ołdziej  35 Michael Levitt  38 Harold Scheraga  32 Chaok Seok  20 Johannes Söding  28 Saraswathi Vishveshwara  13 Dong Xu  9   27 Foldit Players consortiumSilvia N Crivelli  39   40
Collaborators, Affiliations
Comparative Study

An analysis and evaluation of the WeFold collaborative for protein structure prediction and its pipelines in CASP11 and CASP12

Chen Keasar et al. Sci Rep. .

Abstract

Every two years groups worldwide participate in the Critical Assessment of Protein Structure Prediction (CASP) experiment to blindly test the strengths and weaknesses of their computational methods. CASP has significantly advanced the field but many hurdles still remain, which may require new ideas and collaborations. In 2012 a web-based effort called WeFold, was initiated to promote collaboration within the CASP community and attract researchers from other fields to contribute new ideas to CASP. Members of the WeFold coopetition (cooperation and competition) participated in CASP as individual teams, but also shared components of their methods to create hybrid pipelines and actively contributed to this effort. We assert that the scale and diversity of integrative prediction pipelines could not have been achieved by any individual lab or even by any collaboration among a few partners. The models contributed by the participating groups and generated by the pipelines are publicly available at the WeFold website providing a wealth of data that remains to be tapped. Here, we analyze the results of the 2014 and 2016 pipelines showing improvements according to the CASP assessment as well as areas that require further adjustments and research.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
A schematic depiction of the multi-step and multi-path information flow of protein structure prediction. Rounded rectangles represent information and plain rectangles represent basic tasks, each of which is an open computational problem. A prediction process starts with a protein sequence, passes at least once through a set of decoys (structural models of proteins), and ends with a short list, ideally one, of high score decoys. The paths in this graph are not mutually exclusive.
Figure 2
Figure 2
An illustration of the WeFold pipeline concept. The figure presents a schematic depiction of 5 WeFold3 pipelines, which share their first components and differ in the final stages. Graph representation and colors are based on Fig. 1. A complete list of all the WeFold2 and WeFold3 pipelines is presented in Table 1 and in the main text.
Figure 3
Figure 3
Aggregated best models WeFold vs. all CASP groups. In each panel, targets are sorted in descending order of the best decoy submitted (blue line). The best WeFold decoy for each target is marked by a red dot or, when coincides with the overall best, red asterisk. The insert histograms depict the distributions of quality differences (Δ) between the best decoys and their corresponding best WeFold decoy. (A and B) – CASP11; (C and D) – CASP12; (A and C) – Best out of five; (B and D) – First model.
Figure 4
Figure 4
Average z-scores (>−2.0) of the 20 top CASP12 groups, WeFold pipelines are marked with asterisks (Black = wfAll-Cheng; Red = wfMESHI-TIGRESS; Orange = wfMESHI-Seok; Light green = wfRstta-PQ2-seder; Dark green = wfRstta-PQ-ModF6; Light blue = wfRosetta-MUFOLD; Dark blue = wfRstta-PQ-MESHI-MSC; Purple = wfRosetta-PQ-MESHI). The results of MESHI and BAKER-ROSETTASERVER are marked by black circle and triangle respectively. Only those groups that submitted models for at least half of the targets are considered. Chart on the left shows top 20 groups/servers when considering the best model submitted by each group for each target. Chart on the right shows top 20 groups/servers when considering Model 1 only. CASP assessors used GDT_HA + ASE only for TBM targets hence the double depicting of that category. Source: http://www.predictioncenter.org/casp12/zscores_final.cgi.
Figure 5
Figure 5
Pairwise comparison of WeFold and related (underlined) CASP11 groups. Each cell represents a comparison between the row and column groups, based on the subset of targets they both predicted. Cell colors depict the difference in average z-scores (GDT_TS). Blue indicate better performance of the row group. Asterisks indicate statistical significance (p < 0.05; Wilcoxon two-sided pair test). Dots indicate that the two groups shared no more than ten targets. Rows are ordered by decreasing number of significant cells, and then by blue cells. Source: http://www.predictioncenter.org/casp12/zscores_final.cgi.
Figure 6
Figure 6
Box and whiskers plots represent the steps in Keasar-Foldit-based pipelines for target T0822-D1. First column represents the 20 models created by the servers at stage 1. Second column represents the 151 server models that are made available by the CASP organizers (stage 2). Keasar selects a subset of 10 server models using MESHI. These models are marked as dots in the third column. Then Khatib selects 5 of those models (marked with triangles). Khatib’s selected models (starting points) are given to the Foldit players. The Foldit players created a wide range of models, some of which were substantially better than the starting points as shown in column 4. However, column 5 shows that the clustering and filtering algorithm did not select those best models. Column 6 shows the clusters after refinement by Seok’s lab. Columns 7–13 represent the final selection by different WeFold groups, which selected either exclusively from the clusters in column 6, or from a combination of these and Zhang’s clusters, or from a combination of all the models shared by various WeFold groups and servers. Green line is the best model submitted to CASP11 for that target considering all the CASP11 groups. Note that the tick labels along the x-axis also show the number of models in each step of the pipeline. Box and whiskers plots for all the other targets attempted by the Keasar-Foldit pipelines and Zhang pipelines are in the Supplementary Materials.
Figure 7
Figure 7
Comparison of GDT_HA differences between top model in each step of the refinement pipeline and the original model provided by the CASP11 organizers for each target. The steps are identified by color bars representing the difference between the GDT_HA of the starting model and the GDT_HA of (1) the best model among those generated by Foldit players (Foldit-All), (2) the best model among the clusters (Foldit-Cluster), (3) the best model among the clusters refined by KoBaMIN (Foldit-Koba), (4) the best selection by McGuffin (K-McG), (5) the best selection by Wallner/ProQ2 (BW-Kb-BW), (6) the best selection by SVLab of KoBaMIN-refined clusters (Koba-SVlab), and (7) the best selection by SVLab based on unrefined clusters (Clusters-SVLab).
Figure 8
Figure 8
Chart comparing the percentage of models in each step of the refinement pipeline that improved the GDT_HA of the original model provided by CASP organizers. The steps are identified as follows: (1) models generated by Foldit players (Foldit-All), (2) clusters (Foldit Clusters), (3) clusters refined by KoBaMIN (Foldit Koba), (4) selection by McGuffin (K-McG), (5) selection by Wallner/ProQ2 (BW-Kb-BW), (6) selection by SVLab of KoBaMIN-refined clusters (SVLab-Koba), and (7) selection by SVLab based on unrefined clusters (SVLab-Clusters).
Figure 9
Figure 9
Pairwise comparison of WeFold and related (underlined) CASP12 groups. Each cell represents a comparison between the row and column groups, based on the subset of targets they both predicted. Cell colors depict the difference in average z-scores (GDT_TS). Blue indicate better performance of the row group. Asterisks indicate statistical significance (p < 0.05; Wilcoxon two-sided pair test). Rows are ordered by decreasing number of significant cells, and then by blue cells. Source: http://www.predictioncenter.org/casp12/zscores_final.cgi.
Figure 10
Figure 10
Bar plots show the down-selection process across the Rosetta-based pipelines for 6 targets using GDT_HA and GDT_MM. In each row, red bars represent best GDT_HA and blue bars represent best GDT_MM. GDT_MM is a Baker-lab specific metric, where the MAMMOTH alignment algorithm (MM = MAMMOTH) is used for the superposition (slight variations with respect to GDT_TS are based on alignment). Top row shows best GDT_HA (or MM) among the hundreds of thousands of models generated by Rosetta for that target. Next row shows the best GDT_HA (MM) among the best 5 selected by the BAKER-ROSETTASERVER; next row shows the best GDT_HA (MM) among the one thousand models selected by ProQ2; the remainder rows show the best GDT_HA (MM) among the best 5 selected by the Rosetta-based WeFold groups (one set of bars each).

References

    1. The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169, http://www.uniprot.org (2017). - PMC - PubMed
    1. Berman, H. M. et al. The Protein Data Bank Nucleic Acids Research28, 235–242, http://www.rcsb.org (2000). - PMC - PubMed
    1. Moult J, Pedersen JT, Judson R, Fidelis K. A large-scale experiment to assess protein structure prediction methods. Proteins: Struct, Funct, Bioinf. 1995;23(3):ii–iv. doi: 10.1002/prot.340230303. - DOI - PubMed
    1. Moult J. A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol. 2005;15(3):285–289. doi: 10.1016/j.sbi.2005.05.011. - DOI - PubMed
    1. Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A. Critical assessment of methods of protein structure prediction (CASP)–round x. Proteins. 2014;82(Suppl 2):1–6. doi: 10.1002/prot.24452. - DOI - PMC - PubMed

Publication types