Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Mar;86 Suppl 1(Suppl 1):84-96.
doi: 10.1002/prot.25405. Epub 2017 Oct 31.

Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning

Affiliations

Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning

Badri Adhikari et al. Proteins. 2018 Mar.

Abstract

In this study, we report the evaluation of the residue-residue contacts predicted by our three different methods in the CASP12 experiment, focusing on studying the impact of multiple sequence alignment, residue coevolution, and machine learning on contact prediction. The first method (MULTICOM-NOVEL) uses only traditional features (sequence profile, secondary structure, and solvent accessibility) with deep learning to predict contacts and serves as a baseline. The second method (MULTICOM-CONSTRUCT) uses our new alignment algorithm to generate deep multiple sequence alignment to derive coevolution-based features, which are integrated by a neural network method to predict contacts. The third method (MULTICOM-CLUSTER) is a consensus combination of the predictions of the first two methods. We evaluated our methods on 94 CASP12 domains. On a subset of 38 free-modeling domains, our methods achieved an average precision of up to 41.7% for top L/5 long-range contact predictions. The comparison of the three methods shows that the quality and effective depth of multiple sequence alignments, coevolution-based features, and machine learning integration of coevolution-based features and traditional features drive the quality of predicted protein contacts. On the full CASP12 dataset, the coevolution-based features alone can improve the average precision from 28.4% to 41.6%, and the machine learning integration of all the features further raises the precision to 56.3%, when top L/5 predicted long-range contacts are evaluated. And the correlation between the precision of contact prediction and the logarithm of the number of effective sequences in alignments is 0.66.

Keywords: CASP; coevolution; deep learning; machine learning; multiple sequence alignment; protein contact prediction.

PubMed Disclaimer

Conflict of interest statement

4. Conflict of Interest

The authors declare there is no conflict of interest.

Figures

Figure 1
Figure 1
Contact map visualization of top L contacts predicted by MULTICOM-CONSTRUCT (A), PSICOV (B), FreeContact (C), and CCMpred (D) for the target domain T0868-D1. Green dots in upper triangles represent contacts in the native structure and red dots in lower triangles denote the contact predictions.
Figure 2
Figure 2
The precision of top L/5 long-range contacts predicted by MULITCOM-CONSTRUCT is plotted against the logarithm of number of sequences (N) in the alignments generated for the whole targets (left) and the logarithm of number of effective sequences (Meff) calculated for the domains (right) on the CASP12 dataset. The Pearson’s correlation coefficients of the precision with log(N) and log(Meff) are 0.47 and 0.66, respectively.
Figure 3
Figure 3
Visualization of the top L contacts predicted using MULTICOM-CONSTRUCT and reconstructed model for the domain T0900-D1. Chord diagram for the long-range contacts in the native structure are shown in (A) and the top L contacts predicted by MULTICOM-CONSTRUCT shown in (B). MULTICOM-CONSTRUCT predicted contacts are highlighted in the native structure with actual distances between the residues shown in black (C) and the reconstructed structure (in orange) superimposed with the native structure (in green) is shown in (D).

References

    1. Zhang W, Yang J, He B, et al. Integration of QUARK and I-TASSER for Ab Initio Protein Structure Prediction in CASP11. Proteins Struct Funct Bioinforma. 2016;84:76–86. doi: 10.1002/prot.24930. - DOI - PMC - PubMed
    1. Ovchinnikov S, Kim DE, Wang RY-RR, et al. Improved de novo structure prediction in CASP11 by incorporating coevolution information into Rosetta. 2016;84:67–75. doi: 10.1002/prot.24974. - DOI - PMC - PubMed
    1. Duarte JM, Sathyapriya R, Stehr H, et al. Optimal contact definition for reconstruction of contact maps. BMC Bioinformatics. 2010;11:283. doi: 10.1186/1471-2105-11-283. - DOI - PMC - PubMed
    1. Vassura M, Margara L, Di lena P, et al. FT-COMAR: Fault tolerant three-dimensional structure reconstruction from protein contact maps. Bioinformatics. 2008;24:1313–1315. doi: 10.1093/bioinformatics/btn115. - DOI - PubMed
    1. Eickholt J, Cheng J. Predicting protein residue-residue contacts using deep networks and boosting. Bioinformatics. 2012;28:3066–72. doi: 10.1093/bioinformatics/bts598. - DOI - PMC - PubMed

Publication types