Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Mar 24;9(3):e92721.
doi: 10.1371/journal.pone.0092721. eCollection 2014.

Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners

Affiliations

Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners

Carlo Baldassi et al. PLoS One. .

Abstract

In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/code.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. True positive rate plotted against number of predicted pairs.
Results are shown for four different different scoring techniques: Frobenius norm (as described in , pseudo-count set to formula image, blue); Gaussian direct information (as described in the text, APC-corrected, pseudo-count set to formula image, red); mean-field direct information (as described in , pseudo-count set to formula image, orange) and APC-corrected mutual information (as described in , green). The true positive rate is an arithmetic mean over 50 Pfam families (see Table 2 for the list); thin lines represent standard deviations.
Figure 2
Figure 2. True positive rate plotted against number of predicted pairs.
Data for plmDCA (green) and PSICOV version 1.11 (red) was obtained using the code provided by the authors with standard parameters as found in the distributed code, except that PSICOV was run with the -o flag to override the check against insufficient effective number of sequences. The true positive rate is an arithmetic mean over 50 Pfam families (see Table 2 for the list); thin lines represent standard deviations.
Figure 3
Figure 3. First predicted contacts for the PF00069 family (Protein Kinase domain) with Gaussian DCA, using the same settings as for Fig. 2.
The left panel shows the predicted contacts overlaid on the PDB structure 3fz1 (figure produced using the PyMOL software [51]); the right panel shows the predicted pairs overlaid on the contact map (true contacts as obtained by setting the threshold at 8 Å are shown in black). In both panels, the color code is the following: the first formula image predicted contacts are depicted in green, the next formula image contacts in yellow, the last formula image contacts in grey; the only false positive contact (occurring as the 24th predicted pair) is shown in red.
Figure 4
Figure 4. DI-ranking-induced mean true positive rate for predicting inter-protein contacts in the SK/RR complex, for both mean-field DCA (blue curve) and multivariate Gaussian DCA (red curve).
Figure 5
Figure 5. Partner prediction for Caulobacter crescentus orphan two-component proteins by the conditional probability method.
Experimentally known interaction partners , are shown in red. Green dots correspond to partner predictions suggested in . As for , the overall performance of the algorithm is good, except for the prediction on CenK-CenR interaction.
Figure 6
Figure 6. Partner prediction for Bacillus subtilis orphan two-component proteins.
All 5 orphan kinases, KinA-E, are known to phosphorylate Spo0F, which is displayed in red and is always the maximally scoring protein in the RR set.
Figure 7
Figure 7. Illustration of the encoding of a sequence from FASTA format to its intermediate numeric representation (matrix ) to its final binarized representation (matrix ).
For clarity, we restrict the alphabet to formula image amino-acids, formula image, plus the gap. The alternation of white and gray cell backgrounds helps to track the transformation (e.g. formula image). Typically, MSAs of protein families are such that in every column (i.e. residue position) there appears a number of distinct residues smaller than or equal to formula image. Here, we did not not consider a restriction of the alphabet to the residues actually occurring, and we used instead the same encoding for all residues.

References

    1. Altschuh D, Lesk A, Bloomer A, Klug A (1987) Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. Journal of Molecular Biology 193: 693–707. - PubMed
    1. Gobel U, Sander C, Schneider R, Valencia A (1994) Correlated mutations and residue contacts in proteins. Proteins: Structure, Function and Genetics 18: 309–317. - PubMed
    1. Neher E (1994) How frequent are correlated changes in families of protein sequences? Proceedings of the National Academy of Sciences 91: 98–102. - PMC - PubMed
    1. Shindyalov I, Kolchanov N, Sander C (1994) Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Engineering 7: 349–358. - PubMed
    1. Lockless SW, Ranganathan R (1999) Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286: 295–299. - PubMed

Publication types

LinkOut - more resources