Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jun;15(2):902-924.
doi: 10.1214/20-aoas1431. Epub 2021 Jul 12.

LARGE-SCALE MULTIPLE INFERENCE OF COLLECTIVE DEPENDENCE WITH APPLICATIONS TO PROTEIN FUNCTION

Affiliations

LARGE-SCALE MULTIPLE INFERENCE OF COLLECTIVE DEPENDENCE WITH APPLICATIONS TO PROTEIN FUNCTION

Robert Jernigan et al. Ann Appl Stat. 2021 Jun.

Abstract

Measuring the dependence of k ≥ 3 random variables and drawing inference from such higher-order dependences are scientifically important yet challenging. Motivated here by protein coevolution with multivariate categorical features, we consider an information theoretic measure of higher-order dependence. The proposed collective dependence is a symmetrization of differential interaction information which generalizes the mutual information of a pair of random variables. We show that the collective dependence can be easily estimated and facilitates a test on the dependence of k ≥ 3 random variables. Upon carefully exploring the null space of collective dependence, we devise a Classification-Assisted Large scaLe inference procedure to DEtect significant k-COllective DEpendence among dk random variables, with the false discovery rate controlled. Finite sample performance of our method is examined via simulations. We apply this method to the multiple protein sequence alignment data to study the residue or position coevolution for two protein families, the elongation factor P family and the zinc knuckle family. We identify novel functional triplets of amino acid residues, whose contributions to the protein function are further investigated. These confirm that the collective dependence does yield additional information important for understanding the protein coevolution compared to the pairwise measures.

Keywords: Collective dependence; false discovery rate; information theoretic measure; multiple testing; protein coevolution; structural biology.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The empirical powers for detecting triplets with significant third order collective dependence by the proposed method under the TCD setting for models C1 (standard normal) and C2 (standardized exponential), covariances S1 and S2 and TK with K = 3, 4. The nominal FDR is 0.2, and results are based on 100 repetitions.
Fig. 2.
Fig. 2.
The empirical powers for detecting triplets with significant third order collective dependence by the proposed method under the MD setting for Designs I and II with random and fixed r (D1r and D1u; D2r and D2u, resp.), where M = 2, 5, 10. The nominal FDR is 0.2, and results are based on 100 repetitions.
Fig. 3.
Fig. 3.
A snapshot of the (raw) MSA data for analysis.
Fig. 4.
Fig. 4.
(a) The structure of EF-P (from Thermus thermophilus). Three domains are colored separately. (b) The coverage of coevolved positions on the MSA identified as significant by our method and MI. The x-axis is the position index of the MSA, and the y-axis is its appearance frequency based on the corresponding method. A position is included if it has been identified as significant by either method. Hubs are in red.
Fig. 5.
Fig. 5.
(a) The hub residues of EF-P C-terminal Domain. Residues in red are commonly identified by both CALL-DECODE and MI. Residues uniquely selected by CALL-DECODE are in purple. The mapping between residue index and position index is displayed in the legend (generated from PfamScan). The position index refers to the column number in the MSA. The residue index is the unique identifier (residue name plus residue number) for a residue on the protein structure defined in the PDB file. (b) The structures of EF-P monomer and tRNA in the surface representation. Domain I, II, III in EF-P are in red, green and blue, respectively.
Fig. 6.
Fig. 6.
The coverage of coevolved positions on the MSA identified as significant by our method and MI. The x-axis is the position index of the MSA, and the y-axis is its appearance frequency based on the corresponding method. A position is included if it has been identified as significant by either method. Hubs are in red.
Fig. 7.
Fig. 7.
(a) The interaction between HIV-1 NC protein and RNA. Hub residues identified by collective dependence are in red, while the interacting nucleotides are in purple. The mapping between residue index and position index is displayed in the legend (generated from PfamScan). (b) The triplet residue contact formed by Asn27, Gln9 and Phe6 on the HIV-1 NC protein (in purple).

Similar articles

Cited by

References

    1. Afonnikov DA and Kolchanov NA (2004). CRASP: A program for analysis of coordinated substitutions in multiple alignments of protein sequences. Nucleic Acids Res. 32 W64–W68. - PMC - PubMed
    1. Basharin GP (1959). On a statistical estimate for the entropy of a sequence of independent random variables. Theory Probab. Appl 4 333–336. MR0127457 10.1137/1104033 - DOI
    1. Bell AJ (2003). The co-information lattices. In Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation: IC 2003.
    1. Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300. MR1325392
    1. Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov I and Bourne P (2000). The protein data bank. Nucleic Acids Res. 28 235–242. - PMC - PubMed

LinkOut - more resources