. 2021 Jun;15(2):902-924.

doi: 10.1214/20-aoas1431. Epub 2021 Jul 12.

LARGE-SCALE MULTIPLE INFERENCE OF COLLECTIVE DEPENDENCE WITH APPLICATIONS TO PROTEIN FUNCTION

Robert Jernigan¹, Kejue Jia¹, Zhao Ren², Wen Zhou³

Affiliations

¹ Department of Biochemistry, Biophysics, and Molecular Biology, Program of Bioinformatics and Computational Biology, Iowa State University.
² Department of Statistics, University of Pittsburgh.
³ Department of Statistics, Colorado State University.

PMID: 35910493
PMCID: PMC9337751
DOI: 10.1214/20-aoas1431

LARGE-SCALE MULTIPLE INFERENCE OF COLLECTIVE DEPENDENCE WITH APPLICATIONS TO PROTEIN FUNCTION

Robert Jernigan et al. Ann Appl Stat. 2021 Jun.

. 2021 Jun;15(2):902-924.

doi: 10.1214/20-aoas1431. Epub 2021 Jul 12.

Authors

Robert Jernigan¹, Kejue Jia¹, Zhao Ren², Wen Zhou³

Affiliations

¹ Department of Biochemistry, Biophysics, and Molecular Biology, Program of Bioinformatics and Computational Biology, Iowa State University.
² Department of Statistics, University of Pittsburgh.
³ Department of Statistics, Colorado State University.

PMID: 35910493
PMCID: PMC9337751
DOI: 10.1214/20-aoas1431

Abstract

Measuring the dependence of k ≥ 3 random variables and drawing inference from such higher-order dependences are scientifically important yet challenging. Motivated here by protein coevolution with multivariate categorical features, we consider an information theoretic measure of higher-order dependence. The proposed collective dependence is a symmetrization of differential interaction information which generalizes the mutual information of a pair of random variables. We show that the collective dependence can be easily estimated and facilitates a test on the dependence of k ≥ 3 random variables. Upon carefully exploring the null space of collective dependence, we devise a Classification-Assisted Large scaLe inference procedure to DEtect significant k-COllective DEpendence among d ≥ k random variables, with the false discovery rate controlled. Finite sample performance of our method is examined via simulations. We apply this method to the multiple protein sequence alignment data to study the residue or position coevolution for two protein families, the elongation factor P family and the zinc knuckle family. We identify novel functional triplets of amino acid residues, whose contributions to the protein function are further investigated. These confirm that the collective dependence does yield additional information important for understanding the protein coevolution compared to the pairwise measures.

Keywords: Collective dependence; false discovery rate; information theoretic measure; multiple testing; protein coevolution; structural biology.

PubMed Disclaimer

Figures

**Fig. 1.**
*The empirical powers for detecting triplets with significant third order collective dependence by the proposed method under the TCD setting for models C*1 (*standard normal*) *and C*2 (*standardized exponential*), *covariances S*1 *and S*2 *and T*_K *with K* = 3, 4. *The nominal FDR is* 0.2, *and results are based on* 100 *repetitions*.

**Fig. 2.**
*The empirical powers for detecting triplets with significant third order collective dependence by the proposed method under the MD setting for Designs I and II with random and fixed* r (D1_r *and D*1_u; D2_r *and D2*_u, *resp*.), *where M* = 2, 5, 10. *The nominal FDR is* 0.2, *and results are based on* 100 *repetitions*.

**Fig. 3.**
*A snapshot of the* (*raw*) *MSA data for analysis*.

**Fig. 4.**
(a) *The structure of EF-P* (*from Thermus thermophilus*). *Three domains are colored separately*. (b) The coverage of coevolved positions on the MSA identified as significant by our method and MI. The x-axis is the position index of the MSA, and the y-axis is its appearance frequency based on the corresponding method. A position is included if it has been identified as significant by either method. Hubs are in red.

**Fig. 5.**
(a) The hub residues of EF-P C-terminal Domain. Residues in red are commonly identified by both CALL-DECODE and MI. Residues uniquely selected by CALL-DECODE are in purple. The mapping between residue index and position index is displayed in the legend (*generated from PfamScan*). *The position index refers to the column number in the MSA. The residue index is the unique identifier* (*residue name plus residue number*) *for a residue on the protein structure defined in the PDB file*. (b) *The structures of EF-P monomer and tRNA in the surface representation. Domain I, II, III in EF-P are in red, green and blue, respectively*.

**Fig. 6.**
The coverage of coevolved positions on the MSA identified as significant by our method and MI. The x-axis is the position index of the MSA, and the y-axis is its appearance frequency based on the corresponding method. A position is included if it has been identified as significant by either method. Hubs are in red.

**Fig. 7.**
(a) The interaction between HIV-1 NC protein and RNA. Hub residues identified by collective dependence are in red, while the interacting nucleotides are in purple. The mapping between residue index and position index is displayed in the legend (*generated from PfamScan*). (b) *The triplet residue contact formed by Asn*27, *Gln*9 *and Phe*6 *on the HIV*-1 *NC protein* (*in purple*).

See this image and copyright information in PMC

Cited by

General strategies for using amino acid sequence data to guide biochemical investigation of protein function.
Kennedy EN, Foster CA, Barr SA, Bourret RB. Kennedy EN, et al. Biochem Soc Trans. 2022 Dec 16;50(6):1847-1858. doi: 10.1042/BST20220849. Biochem Soc Trans. 2022. PMID: 36416676 Free PMC article. Review.

References

1. Afonnikov DA and Kolchanov NA (2004). CRASP: A program for analysis of coordinated substitutions in multiple alignments of protein sequences. Nucleic Acids Res. 32 W64–W68. - PMC - PubMed
1. Basharin GP (1959). On a statistical estimate for the entropy of a sequence of independent random variables. Theory Probab. Appl 4 333–336. MR0127457 10.1137/1104033 - DOI
1. Bell AJ (2003). The co-information lattices. In Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation: IC 2003.
1. Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300. MR1325392
1. Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov I and Bourne P (2000). The protein data bank. Nucleic Acids Res. 28 235–242. - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

LARGE-SCALE MULTIPLE INFERENCE OF COLLECTIVE DEPENDENCE WITH APPLICATIONS TO PROTEIN FUNCTION

Affiliations

LARGE-SCALE MULTIPLE INFERENCE OF COLLECTIVE DEPENDENCE WITH APPLICATIONS TO PROTEIN FUNCTION

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources