Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Mar 12:2023.03.08.531814.
doi: 10.1101/2023.03.08.531814.

Combining pairwise structural similarity and deep learning interface contact prediction to estimate protein complex model accuracy in CASP15

Affiliations

Combining pairwise structural similarity and deep learning interface contact prediction to estimate protein complex model accuracy in CASP15

Raj S Roy et al. bioRxiv. .

Update in

Abstract

Estimating the accuracy of quaternary structural models of protein complexes and assemblies (EMA) is important for predicting quaternary structures and applying them to studying protein function and interaction. The pairwise similarity between structural models is proven useful for estimating the quality of protein tertiary structural models, but it has been rarely applied to predicting the quality of quaternary structural models. Moreover, the pairwise similarity approach often fails when many structural models are of low quality and similar to each other. To address the gap, we developed a hybrid method (MULTICOM_qa) combining a pairwise similarity score (PSS) and an interface contact probability score (ICPS) based on the deep learning inter-chain contact prediction for estimating protein complex model accuracy. It blindly participated in the 15th Critical Assessment of Techniques for Protein Structure Prediction (CASP15) in 2022 and ranked first out of 24 predictors in estimating the global accuracy of assembly models. The average per-target correlation coefficient between the model quality scores predicted by MULTICOM_qa and the true quality scores of the models of CASP15 assembly targets is 0.66. The average per-target ranking loss in using the predicted quality scores to rank the models is 0.14. It was able to select good models for most targets. Moreover, several key factors (i.e., target difficulty, model sampling difficulty, skewness of model quality, and similarity between good/bad models) for EMA are identified and analayzed. The results demonstrate that combining the multi-model method (PSS) with the complementary single-model method (ICPS) is a promising approach to EMA. The source code of MULTICOM_qa is available at https://github.com/BioinfoMachineLearning/MULTICOM_qa .

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
A simplified illustration of the ICPS calculation. (i) A dimer model with chain A in red and chain B in blue. (ii) All the inter-chain contacts from a dimeric interface of the model are identified, which are coloured in red and blue. (iii) The green lines highlight some inter-chain contacts present in the interface. (iv) The predicted probability scores for the inter-chain contacts are extracted from the inter-chain contact map predicted by a deep learning predictor (CDPred). (v) The probability scores of the contacts in the interface are averaged as ICPS for the model.
Figure 2.
Figure 2.
The pipeline of MULTICOM_qa for estimating CASP15 assembly model accuracy. The input is a pool of assembly models for a target and other related information such as the stoichiometry of the target, protein sequences, predicted tertiary structures for each unit in the target, and the number of CPU cores. Multiple CPU cores can be used to speed up the calculation of the pairwise similarity scores between assembly models. PSS and ICPS for each model are computed and then averaged as the predicted global fold accuracy score for the model. The ICPS score is also used as the predicted interface score for each model.
Figure 3.
Figure 3.
The overall performance of MULTICOM_qa on the models of 36 CASP15 multimer targets. (A) Distribution of the per-target correlation coefficient between MULTICOM_qa predicted and true global quality scores (PTCC). The green area contains targets with PTCC > 0.75. Several targets with low/moderate correlation coefficients are identified. The average PTCC on the 36 targets is 0.6626. (B) Distribution of the per-target ranking loss (PTRL). The green area contains targets with PTRL < 0.1. Several targets with high ranking loss are labelled. The red color highlights nine targets with both high ranking loss and low correlation. The average PTRL on the 36 targets is 0.142. There is a strong negative Pearson’s correlation (−0.8358) between PTCC and PTRL.
Figure 4.
Figure 4.
PTCC is plotted against PTRL on the 36 multimer targets. The general trend is that a higher PTCC corresponds to a lower PTRL (correlation between them = −0.8358). There are three pronounced exceptions (T1174, T1173 and H1135) with both high correlation and high ranking loss.
Figure 5.
Figure 5.
The relationship between PTCC / PTRL of MULTICOM_qa and three factors. A-C: Plot of PTCC against (A) per-target average true TM-score of the models, (B) proportion of good models with TM-score >= 0.8, and (C) the skewness of the distribution of the true TM-scores of the models on the CASP15 targets. Each blue dot denotes one target. D-F: Plot of PTRL against (D) per-target average true TM-score of the models, (E) proportion of good models with TM-score >= 0.8, and (F) the skewness of the distribution of the true TM-scores of the models on the CASP15 targets.
Figure 6.
Figure 6.
The average PTCC and PTRL of 23 CASP15 EMA predictors. Blue bar denotes the average PTCC. Orange bar denotes the average PTRL. The methods are ordered by average PTCC.
Figure 7.
Figure 7.
Good and bad examples of applying MULTICOM_qa to estimate the accuracy of the models of CASP15 multimer targets. In each example, the distribution of the true TM-scores of the models for each target is visualized as histogram. The native structure, true top-1 (best) model, selected top-1 model, and the TM-scores of the latter two are presented. The bin in the histogram from which the top-1 model was selected by MULTICOM_qa is highlighted in red. (A) H1111: stoichiometry = A9B9C9, PTCC = 0.62, PTRL = 0.03054; (B) T1123: stoichiometry = A2, PTCC = 0.98, PTRL = 0.01671; (C) H1137: stoichiometry = A1B1C1D1E1F1G2H1I1, PTCC = 0.96, PTRL = 0.05916; (D) T1179, stoichiometry = A2, PTCC = 0.97, PTRL = 0.06577; (E) T1121: stoichiometry = A2, PTCC = −0.39, PTRL = 0.34683; the correct interfaces are circled; and the top-1 model selected by MULTICOM_qa contains only one of the two correct interfaces, resulting in a high ranking loss of 0.34683.
Figure 8.
Figure 8.
The model similarity graphs for H1111 (top) and T1121 (bottom). Each node denotes a model. An edge is used to connect two models (nodes) if their structural similarity score - TM-score is greater than a threshold. The threshold is determined such that only the top 30% of model pairs are connected by edges. The weight of each edge is TM-score between two nodes (models) calculated by MMalign, which is normalized by the total sequence length of the larger model if the two models have different sizes. The color of the nodes correspond to the true TM-scores of the models. For H1111, both the best model (true top-1 model) and selected top-1 model come from the largest subgraph with the highest quality. For T1121, the selected top-1 model is from the largest subgraph with mediocre quality, but the best model is in the third largest subgraph.
Figure 9.
Figure 9.
(A) The distribution of TM-scores of the models of H1114 as well as its native structure, top-1 model selected by PSS, top-1 model selected by ICPS, top-1 model selected by MULTICOM_qa, and the best model in the model pool. Both ICPS and MULTICOM_qa selected a model that is much better than the model selected by PSS. The four A chains in the good models form a cube in the center (red), which is a key feature of the structure of H1114. (B) A homo-dimeric interface between two A chains of the top-1 model selected by ICPS for target H1114. The model is H1114TS119_2. The red lines represent the true inter-chain contacts in the interface. The numbers are the probabilities for the true contacts predicted by the CDPred. The true contacts have high predicted probabilities, leading to a high ICPS of 0.44 for the interface.

Similar articles

References

    1. Basu S., & Wallner B. (2016). DockQ: A Quality Measure for Protein-Protein Docking Models. PLOS ONE, 11(8), e0161879. 10.1371/journal.pone.0161879 - DOI - PMC - PubMed
    1. Bryant P., Pozzati G., Zhu W., Shenoy A., Kundrotas P., & Elofsson A. (2022). Predicting the structure of large protein complexes using AlphaFold and Monte Carlo tree search. Nature Communications, 13(1), Article 1. 10.1038/s41467-022-33729-4 - DOI - PMC - PubMed
    1. Chen C., Chen X., Morehead A., Wu T., & Cheng J. (2023). 3D-equivariant graph neural networks for protein model quality assessment. Bioinformatics, 39(1), btad030. 10.1093/bioinformatics/btad030 - DOI - PMC - PubMed
    1. Chen R., Li L., & Weng Z. (2003). ZDOCK: An initial-stage protein-docking algorithm. Proteins: Structure, Function, and Bioinformatics, 52(1), 80–87. 10.1002/prot.10389 - DOI - PubMed
    1. Chen X., Morehead A., Liu J., & Cheng J. (2022). DProQ: A Gated-Graph Transformer for Protein Complex Structure Assessment (p. 2022.05.19.492741). bioRxiv. 10.1101/2022.05.19.492741 - DOI - PMC - PubMed

Publication types