Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Jul;15(7):1723-34.
doi: 10.1110/ps.062109706.

A limited universe of membrane protein families and folds

Affiliations

A limited universe of membrane protein families and folds

Amit Oberai et al. Protein Sci. 2006 Jul.

Abstract

One of the goals of structural genomics is to obtain a structural representative of almost every fold in nature. A recent estimate suggests that 70%-80% of soluble protein domains identified in the first 1000 genome sequences should be covered by about 25,000 structures-a reasonably achievable goal. As no current estimates exist for the number of membrane protein families, however, it is not possible to know whether family coverage is a realistic goal for membrane proteins. Here we find that virtually all polytopic helical membrane protein families are present in the already known sequences so we can make an estimate of the total number of families. We find that only approximately 700 polytopic membrane protein families account for 80% of structured residues and approximately 1700 cover 90% of structured residues. While apparently a finite and reachable goal, we estimate that it will likely take more than three decades to obtain the structures needed for 90% residue coverage, if current trends continue.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Distribution of family sizes. The average size of blocks of 20 families is shown, ordered from the largest to the smallest, for the first 500 families. There are a few very large families and many very small families.
Figure 2.
Figure 2.
The family boundaries mapped onto some known structures. The regions of the structures that are included in the families are in red and the regions not included in the families are in green. (A) Fumarate reductase (PDB Code 1l0v) is comprised of four polypeptide chains (Iverson et al. 1999, 2002): Two chains (FrdA and FrdB) correspond to soluble domains and two chains (FrdC and FrdD) are membrane-spanning with three transmembrane helices each. In the crystal structure, two fumarate reductase complexes are related by twofold symmetry, with total of 12 TM helices. However, there is no evidence that a dimer is physiologically relevant. The membrane-bound region of chain C contains residues 26–127 and chain D contains residues 16–116. Our family region covered residues 8–129 of chain C and residues 1–119 of chain D, including all TM regions with a few residues extra on both the C and N termini. (B) MsbA (PDB code 1pf4) is an ATP binding cassette transporter. The protein is a homo-dimer with two identical chains that contain 6 TM helices and a soluble ATP binding domain at the C terminus, extending from residues 316 to 582 (Chang and Roth 2001; Chang 2003). The structure-defined TM boundary is between residues 31 and 293, while the family region is between residues 16 and 331, which is longer by 15 residues at the N terminus and by 38 residues at the C terminus. (C) MscS (PDB code 1mxm) is a heptamer of identical subunits, each with three TM helices and a cytoplasmic domain, extending from residues 113 to 280 (Bass et al. 2002). The structure-defined TM helices range between 28 and 103. The family region extends from residues 24 to 167, which extends beyond our definition of the membrane by four residues at the N terminus and 64 residues at the C terminus. (D) The MthK (PDB code 1lnq) calcium-gated potassium channel is a tetramer (Jiang et al. 2002a,b). Each monomer comprises a membrane spanning pore domain and an intracellular C-terminal RCK domain. The pore domain contains two transmembrane helices. The structure-defined transmembrane region ranges from residues 19 to 98 while the family region contains residues 26–105. (E) The Kv1.2 (PDB code 2a79) shaker voltage-gated potassium channel (Long et al. 2005). The structure for each subunit of the tetramer consists of Kv1.2 containing a transmembranous ion-conducting pore, a voltage sensor, a soluble T1 domain, and an associated soluble β-subunit. The Kv1.2 channel contains six transmembrane helices (S1–S6) that are connected to soluble T1 domain by T1–S1 linker. The T1–S1 linker, S1 helix, and S3 helix were modeled as 33, 19, and 21 amino acid poly-alanine sequences, respectively. The structure-defined TM boundary is 165–402. The family boundaries are residues 150–411. Thus, we add 15 residues at the N terminus and five residues at the C terminus.
Figure 3.
Figure 3.
Coverage of structured membrane protein sequence space by the largest families. A few of the largest families account for the majority of membrane protein residues.
Figure 4.
Figure 4.
Growth in membrane protein families as the number of sequences increases. The number of membrane protein families required to cover 80% and 90% of structured sequence space. The number of families grows for 80% and 90% sequence space coverage as the sequence database size increases from 13,467 to 28,456 sequences. A plateau is evident beyond 28,456 sequences, for both 80% and 90% structured sequence space coverage, indicating that the vast majority of membrane protein family space has been sampled.
Figure 5.
Figure 5.
Growth in soluble protein families as the number of sequences increases. Here we analyzed how PfamB families increase with the growth of the PfamB database. PfamB families of soluble proteins show much larger diversity than families based on membrane proteins. Unlike for membrane protein families, there is no significant decline in the slope as size of the sequence database increases. The number of families with at least two members is shown for PfamB v6.0, v7.0, and v17.0.
Figure 6.
Figure 6.
The growth of membrane protein structural coverage. (A) The number of structures determined each year since 1985. The curve is a fit to an exponential equation as described in the text. (B) Ideal (solid line) and weighted-random (dotted line) estimation for coverage of sequence space in families by representatives of known structures, as a function of year, based on the observed exponential growth of structure determination seen in panel A.
Figure 7.
Figure 7.
A schematic representation of the overall process in building polytopic membrane protein families.

Similar articles

Cited by

References

    1. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402. - PMC - PubMed
    1. Bass R.B., Strop P., Barclay M., Rees D.C. 2002. Crystal structure of Escherichia coli MscS, a voltage-modulated and mechanosensitive channel. Science 298: 1582–1587. - PubMed
    1. Bateman A., Coin L., Durbin R., Finn R.D., Hollich V., Griffiths-Jones S., Khanna A., Marshall M., Moxon S., Sonnhammer E.L.et al. 2004. The Pfam protein families database. Nucleic Acids Res. 32: D138–D141. - PMC - PubMed
    1. Berry E.A., Guergova-Kuras M., Huang L.S., Crofts A.R. 2000. Structure and function of cytochrome bc complexes. Annu. Rev. Biochem. 69: 1005–1075. - PubMed
    1. Bowie J.U. 1997a. Helix packing angle preferences. Nat. Struct. Biol. 4: 915–917. - PubMed

Publication types

Substances