Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Feb 15;36(4):1091-1098.
doi: 10.1093/bioinformatics/btz679.

Analysis of several key factors influencing deep learning-based inter-residue contact prediction

Affiliations

Analysis of several key factors influencing deep learning-based inter-residue contact prediction

Tianqi Wu et al. Bioinformatics. .

Abstract

Motivation: Deep learning has become the dominant technology for protein contact prediction. However, the factors that affect the performance of deep learning in contact prediction have not been systematically investigated.

Results: We analyzed the results of our three deep learning-based contact prediction methods (MULTICOM-CLUSTER, MULTICOM-CONSTRUCT and MULTICOM-NOVEL) in the CASP13 experiment and identified several key factors [i.e. deep learning technique, multiple sequence alignment (MSA), distance distribution prediction and domain-based contact integration] that influenced the contact prediction accuracy. We compared our convolutional neural network (CNN)-based contact prediction methods with three coevolution-based methods on 75 CASP13 targets consisting of 108 domains. We demonstrated that the CNN-based multi-distance approach was able to leverage global coevolutionary coupling patterns comprised of multiple correlated contacts for more accurate contact prediction than the local coevolution-based methods, leading to a substantial increase of precision by 19.2 percentage points. We also tested different alignment methods and domain-based contact prediction with the deep learning contact predictors. The comparison of the three methods showed deeper sequence alignments and the integration of domain-based contact prediction with the full-length contact prediction improved the performance of contact prediction. Moreover, we demonstrated that the domain-based contact prediction based on a novel ab initio approach of parsing domains from MSAs alone without using known protein structures was a simple, fast approach to improve contact prediction. Finally, we showed that predicting the distribution of inter-residue distances in multiple distance intervals could capture more structural information and improve binary contact prediction.

Availability and implementation: https://github.com/multicom-toolbox/DNCON2/.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Contact prediction performance of MULTICOM-NOVEL and CCMpred. (a) ROC curve of CCMpred on the long-range predicted contacts of 43 CASP13 FM and FM/TBM targets are shown in green and MULTICOM-NOVEL in red. Deep learning-based method, MULTICOM-NOVEL, greatly improves the AUC score from 0.61 to 0.84. (b) The plot of the average distance of false positive contact predictions made by MULTICOM-NOVEL versus CCMpred for each CASP13 FM and FM/TBM target (denoted by a dot in the plot). The average distance of false positive contacts over all the targets for MULTICOM-NOVEL is 14.1 Å, smaller than that for CCMpred (17.8 Å)
Fig. 2.
Fig. 2.
Contact prediction results of CCMpred and MULTICOM-NOVEL for target T0953s2. (a) Top 2L long-range contacts predicted by the two methods (red) versus true contacts (blue); (b) ROC curves of the two methods (red: MULTICOM-NOVEL, AUC = 0.95; green: CCMpred, AUC = 0.81); (c) The coverage (i.e. 100 * TP/N, where, TP is number of true positive contacts and N is number of native contacts) of top 5 to top 2L long-range contacts predicted by the two methods. (d) The plot of precision of predicted top 5, top L/10, top L/5, top L/2, top L and top 2L long-range contacts of the two methods. (Color version of this figure is available at Bioinformatics online.)
Fig. 3.
Fig. 3.
(a) Plot of contact prediction precision against Neff of multiple sequence alignments for 108 CASP13 domains for MULTICOM-NOVEL. Dots with different colors represent precisions of different numbers of long-range contact predictions (top L/5, top L/2 and top L). The curve is the LOESS line fitting the dots. The plot in Neff range [1, 2500] is zoomed in. (b) Scatterplot of the precision of top L long-range contact predictions versus log (Neff) with the marginal histograms of the precision and log (Neff) shown on the top and on the right, respectively. The curve is the LOESS line fitting the dots. (Color version of this figure is available at Bioinformatics online.)
Fig. 4.
Fig. 4.
Domain parsing and domain-based contact prediction of target T0989. (a) Plot of number of sequences in the MSA of T0989 against residue positions, true domain boundaries and the boundaries predicted by the ab initio domain parsing method. (b) The contact prediction precision for the second domain of T0989 by MULTICOM-CLUSTER with/without the domain parsing and integration of domain-based contact prediction
Fig. 5.
Fig. 5.
Top L/2 long-range predicted contacts for T0963-D3 at Stage 1 without the inter-residue distance distribution as input and at Stage 2 with the inter-residue distance distribution as input. (a) Top L/2 long-range predicted contacts (red) versus true contracts (blue) for T0963-D3 at Stage 1 at distance thresholds of 6, 7.5, 8.5 and 10 Å. (b) Top L/2 long-range contacts versus true contacts at the distance threshold of 8.0 Å at Stage 1 and Stage 2. (c) The predicted top L/5 long-range contacts at the distance threshold of 8.0 Å at Stage 1 and Stage 2 are visualized on the native structure of target T0963-D3. The red lines in the structure are the false positive contacts and the black lines are true positive contacts. (Color version of this figure is available at Bioinformatics online.)

Similar articles

Cited by

References

    1. Adhikari B., Cheng J. (2018) CONFOLD2: improved contact-driven ab initio protein structure modeling. BMC Bioinformatics, 19, 22. - PMC - PubMed
    1. Adhikari B. et al. (2016) ConEVA: a toolbox for comprehensive assessment of protein contacts. BMC Bioinformatics, 17, 517.. - PMC - PubMed
    1. Adhikari B. et al. (2018) DNCON2: improved protein contact prediction using two-level deep convolutional neural networks. Bioinformatics, 34, 1466–1472. - PMC - PubMed
    1. Altschuh D. et al. (1988) Coordinated amino acid changes in homologous protein families. Protein Eng., 2, 193–199. - PubMed
    1. Brunger A.T. et al. (1998) Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr. D Biol. Crystallogr., 54 (Pt 5), 905–921. - PubMed

Publication types