Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Dec 16;50(6):1847-1858.
doi: 10.1042/BST20220849.

General strategies for using amino acid sequence data to guide biochemical investigation of protein function

Affiliations
Review

General strategies for using amino acid sequence data to guide biochemical investigation of protein function

Emily N Kennedy et al. Biochem Soc Trans. .

Abstract

The rapid increase of '-omics' data warrants the reconsideration of experimental strategies to investigate general protein function. Studying individual members of a protein family is likely insufficient to provide a complete mechanistic understanding of family functions, especially for diverse families with thousands of known members. Strategies that exploit large amounts of available amino acid sequence data can inspire and guide biochemical experiments, generating broadly applicable insights into a given family. Here we review several methods that utilize abundant sequence data to focus experimental efforts and identify features truly representative of a protein family or domain. First, coevolutionary relationships between residues within primary sequences can be successfully exploited to identify structurally and/or functionally important positions for experimental investigation. Second, functionally important variable residue positions typically occupy a limited sequence space, a property useful for guiding biochemical characterization of the effects of the most physiologically and evolutionarily relevant amino acids. Third, amino acid sequence variation within domains shared between different protein families can be used to sort a particular domain into multiple subtypes, inspiring further experimental designs. Although generally applicable to any kind of protein domain because they depend solely on amino acid sequences, the second and third approaches are reviewed in detail because they appear to have been used infrequently and offer immediate opportunities for new advances. Finally, we speculate that future technologies capable of analyzing and manipulating conserved and variable aspects of the three-dimensional structures of a protein family could lead to broad insights not attainable by current methods.

Keywords: SimpLogo; amino acid sequences; coevolution; covariation; protein domains.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Receiver domain structure
The five conserved residues that catalyze receiver domain autophosphorylation and autodephosphorylation reactions are shown in green. D is the site of phosphorylation. DD coordinate (orange dashed lines) the divalent metal ion, shown in yellow. K, T, and the metal ion each bind (red dashed lines) one of the phosphoryl group oxygen atoms. A stable, non-covalently bound BeF3 mimic of the PO32− phosphoryl group is shown in cyan and yellow. Five variable residues known to affect reaction kinetics are shown in blue, with positions named in relation to the conserved residues. Black arrow indicates the required path of attack by phosphodonor or water molecule in line with P–O bond to be formed or broken respectively [100]. Based on 1FQW structure of E. coli CheY [101].
Figure 2.
Figure 2.. CheY autophosphorylation and autodephosphorylation rate constants as a function of substitution position
Rate constants of E. coli CheY substitution mutants are plotted for autophosphorylation with phosphoramidate (kphos/KS) versus autodephosphorylation with water (kdephos). Note the logarithmic scales on both axes. Red square is wild-type CheY, with NAEPF (single letter amino acid codes, N- to C-terminal) composition for the five variable residues in Figure 1. Intersection of dashed lines indicates rate constants supported by the most abundant (~11%) combinations of five variable residues (MAKPF, MARPF, shown in Panel B) in prokaryotic receiver domains spliced onto the CheY backbone. (A) Substitutions at T+1 (aqua triangles). (B) Substitutions at D+2 (black diamonds), T+2 (brown squares), or both (blue triangles). (C) Substitutions at K+1 (black circles), K+2 (blue diamonds), or both (green triangles). Data from [, –76].
Figure 3.
Figure 3.. Major architectures of proteins containing CheW-like domains.
The Class (designated by numbers from 1 to 6) of CheW-like domains are shown as a function of Architecture and Context as defined in the text. Approximately 95% of CheW-like domains occur in 16 Architectures from three protein lineages [93]. Major Architectures are shown schematically from N to C-terminal. CheW proteins contain only CheW-like domains. Based on Cluster analysis described in the text, the single Context of CheW proteins consists of three Types, each of which ultimately belong to a different Class, as indicated by the asterisk. CheV proteins contain CheW-like and Receiver domains. CheA proteins are the most architecturally diverse. The basic CheA architecture of Hpt, Dimer, CA, and CheW-like domains is most commonly supplemented with up to 10 N-terminal Hpt domains or an additional CheW-like domain. Relationships between Classes 1 to 6 of CheW-like domains and architectural Contexts are shown [93]. The three CheA architectures shown also often feature a C-terminal Receiver domain, indicated in brackets. Domains and Pfam designations are CheW-like (PF01584), Receiver (Response_reg, PF00072), Hpt (histidine phosphotransfer, PF01627), Dimer (H-kinase_dim, PF02895), CA (catalytic & ATP binding, HATPase_c, PF02518).

References

    1. Andreeva A, Kulesha E, Gough J and Murzin AG (2020) The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res 48, D376–D382 10.1093/nar/gkz1064 - DOI - PMC - PubMed
    1. Sillitoe I, Bordin N, Dawson N, Waman VP, Ashford P, Scholes HM, Pang CSM, Woodridge L, Rauer C, Sen N, Abbasian M, Le Cornu S, Lam SD, Berka K, Varekova IH, Svobodova R, Lees J and Orengo CA (2021) CATH: increased structural coverage of functional space. Nucleic Acids Res 49, D266–D273 10.1093/nar/gkaa1079 - DOI - PMC - PubMed
    1. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar, Gustavo A, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, Finn RD and Bateman A (2020) Pfam: The protein families database in 2021. Nucleic Acids Res 49, D412–D419 10.1093/nar/gkaa913 - DOI - PMC - PubMed
    1. Letunic I, Khedkar S and Bork P (2021) SMART: recent updates, new developments and status in 2020. Nucleic Acids Res 49, D458–D460 10.1093/nar/gkaa937 - DOI - PMC - PubMed
    1. Clifton BE, Kozome D and Laurino P (2022) Efficient exploration of sequence space by sequence-guided protein engineering and design. Biochemistry 10.1021/acs.biochem.1c00757 - DOI - PubMed

Publication types