Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 May 1;25(9):1125-31.
doi: 10.1093/bioinformatics/btp135. Epub 2009 Mar 10.

Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information

Affiliations

Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information

Cristina Marino Buslje et al. Bioinformatics. .

Abstract

Motivation: Mutual information (MI) theory is often applied to predict positional correlations in a multiple sequence alignment (MSA) to make possible the analysis of those positions structurally or functionally important in a given fold or protein family. Accurate identification of coevolving positions in protein sequences is difficult due to the high background signal imposed by phylogeny and noise. Several methods have been proposed using MI to identify coevolving amino acids in protein families.

Results: After evaluating two current methods, we demonstrate how the use of sequence-weighting techniques to reduce sequence redundancy and low-count corrections to account for small number of observations in limited size sequence families, can significantly improve the predictability of MI. The evaluation is made on large sets of both in silico-generated alignments as well as on biological sequence data. The methods included in the analysis are the APC (average product correction) and RCW (row-column weighting) methods. The best performing method was APC including sequence-weighting and low-count corrections. The use of sequence-permutations to calculate a MI rescaling is shown to significantly improve the prediction accuracy and allows for direct comparison of information values across protein families. Finally, we demonstrate how a lower bound of 400 sequences <62% identical is needed in an MSA in order to achieve meaningful predictive performances. With our contribution, we achieve a noteworthy improvement on the current procedures to determine coevolution and residue contacts, and we believe that this will have potential impacts on the understanding of protein structure, function and folding.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Average of predicted positive ratio as a function of predictions per residue in a semi-log plot for the 85 MSA's in the Pfam dataset. C, clustering and Z, Z-score sequence-based permutation.
Fig. 2.
Fig. 2.
Average AUC values () as a function of the number (#) of clusters or sequences in the MSA from the Pfam dataset. For clusters, the # refers to the definition by the Hobohm 1 algorithms using 62% sequence identity. For sequences, the # refers to individual sequences from protein families defined in the Pfam database. Performance values are calculated using sequence-based Z-score permutations.
Fig. 3.
Fig. 3.
(A) Ribbon representation of 2TRX. Represented as gold spheres, 14 of the 20 highest Z-score transformed MI scoring pairs of residues are shown. Orange spheres: catalytic C32 and C35. Molecular graphic images were generated using UCSF Chimera package (University of California; Meng et al., 2006). (B) Schematic representation of the 14 pairs of interactions scoring highest in Z-score transformed MI values. Red lines are high MI scoring residue pairs. Full lines denote physical contact (Cβ distance <8 Å).

Similar articles

Cited by

References

    1. Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Byung-Chul L, et al. Analysis of the residue-residue coevolution network and the functionally important residues in proteins. Protein Struct. Funct. Bioinform. 2008;72:863–872. - PubMed
    1. Chiu DKY, Kolodziejczak T. Inferring consensus structure from nucleic acid sequences. Comput. Appl. Biosci. 1991;7:347–352. - PubMed
    1. Cover TM, Thomas JA. Elements of information theory. Wiley; 1991.
    1. DePristo MA, et al. Missense meanderings in sequence space: a biophysical view of protein evolution. Nat. Rev. Genet. 2005;6:678–687. - PubMed

Publication types