. 2009 May 1;25(9):1125-31.

doi: 10.1093/bioinformatics/btp135. Epub 2009 Mar 10.

Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information

Cristina Marino Buslje¹, Javier Santos, Jose Maria Delfino, Morten Nielsen

Affiliations

Affiliation

¹ Department of Biological Chemistry and Institute of Biochemistry and Biophysics (IQUIFIB), School of Pharmacy and Biochemistry, University of Buenos Aires, Junín 956, 1113 Buenos Aires, Argentina. cmb@qb.ffyb.uba.ar

PMID: 19276150
PMCID: PMC2672635
DOI: 10.1093/bioinformatics/btp135

Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information

Cristina Marino Buslje et al. Bioinformatics. 2009.

. 2009 May 1;25(9):1125-31.

doi: 10.1093/bioinformatics/btp135. Epub 2009 Mar 10.

Authors

Cristina Marino Buslje¹, Javier Santos, Jose Maria Delfino, Morten Nielsen

Affiliation

¹ Department of Biological Chemistry and Institute of Biochemistry and Biophysics (IQUIFIB), School of Pharmacy and Biochemistry, University of Buenos Aires, Junín 956, 1113 Buenos Aires, Argentina. cmb@qb.ffyb.uba.ar

PMID: 19276150
PMCID: PMC2672635
DOI: 10.1093/bioinformatics/btp135

Abstract

Motivation: Mutual information (MI) theory is often applied to predict positional correlations in a multiple sequence alignment (MSA) to make possible the analysis of those positions structurally or functionally important in a given fold or protein family. Accurate identification of coevolving positions in protein sequences is difficult due to the high background signal imposed by phylogeny and noise. Several methods have been proposed using MI to identify coevolving amino acids in protein families.

Results: After evaluating two current methods, we demonstrate how the use of sequence-weighting techniques to reduce sequence redundancy and low-count corrections to account for small number of observations in limited size sequence families, can significantly improve the predictability of MI. The evaluation is made on large sets of both in silico-generated alignments as well as on biological sequence data. The methods included in the analysis are the APC (average product correction) and RCW (row-column weighting) methods. The best performing method was APC including sequence-weighting and low-count corrections. The use of sequence-permutations to calculate a MI rescaling is shown to significantly improve the prediction accuracy and allows for direct comparison of information values across protein families. Finally, we demonstrate how a lower bound of 400 sequences <62% identical is needed in an MSA in order to achieve meaningful predictive performances. With our contribution, we achieve a noteworthy improvement on the current procedures to determine coevolution and residue contacts, and we believe that this will have potential impacts on the understanding of protein structure, function and folding.

PubMed Disclaimer

Figures

**Fig. 1.**
Average of predicted positive ratio as a function of predictions per residue in a semi-log plot for the 85 MSA's in the Pfam dataset. C, clustering and Z, Z-score sequence-based permutation.

**Fig. 2.**
Average AUC values () as a function of the number (#) of clusters or sequences in the MSA from the Pfam dataset. For clusters, the # refers to the definition by the Hobohm 1 algorithms using 62% sequence identity. For sequences, the # refers to individual sequences from protein families defined in the Pfam database. Performance values are calculated using sequence-based Z-score permutations.

**Fig. 3.**
(A) Ribbon representation of 2TRX. Represented as gold spheres, 14 of the 20 highest Z-score transformed MI scoring pairs of residues are shown. Orange spheres: catalytic C32 and C35. Molecular graphic images were generated using UCSF Chimera package (University of California; Meng *et al.*, 2006). (B) Schematic representation of the 14 pairs of interactions scoring highest in Z-score transformed MI values. Red lines are high MI scoring residue pairs. Full lines denote physical contact (Cβ distance <8 Å).

See this image and copyright information in PMC

Cited by

Structural and functional roles of coevolved sites in proteins.
Chakrabarti S, Panchenko AR. Chakrabarti S, et al. PLoS One. 2010 Jan 6;5(1):e8591. doi: 10.1371/journal.pone.0008591. PLoS One. 2010. PMID: 20066038 Free PMC article.
MitImpact 3: modeling the residue interaction network of the Respiratory Chain subunits.
Castellana S, Biagini T, Petrizzelli F, Parca L, Panzironi N, Caputo V, Vescovi AL, Carella M, Mazza T. Castellana S, et al. Nucleic Acids Res. 2021 Jan 8;49(D1):D1282-D1288. doi: 10.1093/nar/gkaa1032. Nucleic Acids Res. 2021. PMID: 33300029 Free PMC article.
I-COMS: Interprotein-COrrelated Mutations Server.
Iserte J, Simonetti FL, Zea DJ, Teppa E, Marino-Buslje C. Iserte J, et al. Nucleic Acids Res. 2015 Jul 1;43(W1):W320-5. doi: 10.1093/nar/gkv572. Epub 2015 Jun 1. Nucleic Acids Res. 2015. PMID: 26032772 Free PMC article.
Chasing coevolutionary signals in intrinsically disordered proteins complexes.
Iserte JA, Lazar T, Tosatto SCE, Tompa P, Marino-Buslje C. Iserte JA, et al. Sci Rep. 2020 Oct 21;10(1):17962. doi: 10.1038/s41598-020-74791-6. Sci Rep. 2020. PMID: 33087759 Free PMC article.
RNAconTest: comparing tools for noncoding RNA multiple sequence alignment based on structural consistency.
Wright ES. Wright ES. RNA. 2020 May;26(5):531-540. doi: 10.1261/rna.073015.119. Epub 2020 Jan 31. RNA. 2020. PMID: 32005745 Free PMC article.

See all "Cited by" articles

References

1. Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
1. Byung-Chul L, et al. Analysis of the residue-residue coevolution network and the functionally important residues in proteins. Protein Struct. Funct. Bioinform. 2008;72:863–872. - PubMed
1. Chiu DKY, Kolodziejczak T. Inferring consensus structure from nucleic acid sequences. Comput. Appl. Biosci. 1991;7:347–352. - PubMed
1. Cover TM, Thomas JA. Elements of information theory. Wiley; 1991.
1. DePristo MA, et al. Missense meanderings in sequence space: a biophysical view of protein evolution. Nat. Rev. Genet. 2005;6:678–687. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

HHSN266200400025C/PHS HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information

Affiliation

Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources