Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Aug;15(8):1987-2001.
doi: 10.1110/ps.062286306.

An amino acid "transmembrane tendency" scale that approaches the theoretical limit to accuracy for prediction of transmembrane helices: relationship to biological hydrophobicity

Affiliations

An amino acid "transmembrane tendency" scale that approaches the theoretical limit to accuracy for prediction of transmembrane helices: relationship to biological hydrophobicity

Gang Zhao et al. Protein Sci. 2006 Aug.

Abstract

Hydrophobicity analyses applied to databases of soluble and transmembrane (TM) proteins of known structure were used to resolve total genomic hydrophobicity profiles into (helical) TM sequences and mainly "subhydrophobic" soluble components. This information was used to define a refined "hydrophobicity"-type TM sequence prediction scale that should approach the theoretical limit of accuracy. The refinement procedure involved adjusting scale values to eliminate differences between the average amino acid composition of populations TM and soluble sequences of equal hydrophobicity, a required property of a scale having maximum accuracy. Application of this procedure to different hydrophobicity scales caused them to collapse to essentially a single TM tendency scale. As expected, when different scales were compared, the TM tendency scale was the most accurate at predicting TM sequences. It was especially highly correlated (r = 0.95) to the biological hydrophobicity scale, derived experimentally from the percent TM conformation formed by artificial sequences passing though the translocon. It was also found that resolution of total genomic sequence data into TM and soluble components could be used to define the percent probability that a sequence with a specific hydrophobicity value forms a TM segment. Application of the TM tendency scale to whole genomic data revealed an overlap of TM and soluble sequences in the "semihydrophobic" range. This raises the possibility that a significant number of proteins have sequences that can switch between TM and non-TM states. Such proteins may exist in moonlighting forms having properties very different from those of the predominant conformation.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Abundance of sequences as a function of hydrophobicity for the sum of the E. coli and S. cerevisiae genomes as estimated by the Kyte-Doolittle and TM tendency scales. The abundance of sequences within a given hydropathy/hydrophobicity range is shown for the combined E. coli and S. cerevisiae genomes (triangles) as assessed by the Kyte-Doolittle scale (A) or the TM tendency scale (B). The abundances at each hydrophobicity value correspond to an average calculated from the sum of abundance values for E. coli and S. cerevisiae divided by two. A sliding window [N] size of 19 and a merge factor of 4 were used. Each point represents sequences with a hydrophobicity value within the range that is ≤0.1 units greater than the hydrophobicity value shown on the X-axis. To resolve curves into soluble and TM sequences, hydrophobicity distributions were calculated for soluble proteins with known structures from a soluble protein database (squares) and for TM segments from a database (TMPDB) (Ikeda et al. 2003) cataloging TM proteins (circles). The abundance values for the soluble and TM sequence databases were normalized in terms of height to fit the genomic data (see text for details). The sum of the normalized soluble and TM segment components is shown as a solid line. If not noted otherwise, in this and later calculations signal sequences predicted by the Phobius program (Kall et al. 2004) were removed from the genomic sequences. (Inset) Dependence of the percent probability that a sequence is TM upon KD hydropathy as determined from the ratio of sequence abundance in the normalized soluble database to that in the sum of the normalized soluble database and normalized TM database at each specific hydropathy value.
Figure 2.
Figure 2.
Correlation between various hydrophobicity scales. (A) Correlation between the CW and GH scales. GH scale values for I, L, and V residues (asterisks) were based on the CW scale by extrapolating from the linear correlation curve (dashed line). (B) Correlation between the EC and GH scales. (C) Correlation between the TM tendency and biological hydrophobicity scales. (D) Correlation between the refined EC and TM tendency scales. The positions of amino acid residues are designated by their one-letter codes. Correlation coefficients (r) are shown in each panel.
Figure 3.
Figure 3.
Schematic illustration of the definition of prediction accuracy as judged from the degree of overlap between hydrophobicity of soluble and TM sequences. (A) Definition of populations used for estimation of prediction errors. Soluble (solid) and TMPDB (dashed) curves schematically represent the hydrophobicity profiles of sequences in the soluble protein and TMP databases, respectively, after normalization to genomic data. A1, the population of sequences falsely predicted to be TM, is the area to the right of the 50% TM possibility line (vertical dotted line) and below the soluble protein profile (solid line). A2, the population of TM sequences that are not assigned as being TM (i.e., missing TM sequences), is the area to the left of the 50% TM possibility line and below the TMP database profile (dashed line). A1 and A2 were used to calculate the misidentification levels for hydrophobicity scales (see legend of Table 1). (B) Effect of the number of soluble sequences (relative to the number of TM sequences) upon percent TM probability versus hydrophobicity. Notice that when the relative number of soluble sequences increases (from dash-dot-dash curve to solid curve), the hydrophobicity value at which there is a 50% probability of a sequence being a TM sequence (vertical dotted line) increases.
Figure 4.
Figure 4.
Comparison of the abundance of sequences as a function of “hydrophobicity” in various individual genomes as estimated by the GH scale and TM tendency scales. The total genomic abundance of sequences versus hydrophobicity is shown for the GH scale (open diamonds) and TM tendency (closed squares) scale for Nanoarchaeum equitans, E. coli, Y. pestis, S. cerevisiae, Schizosaccharomyces pombe, and Drosophila melanogaster. A sliding window size of 19 and a merge factor of 4 were used. Each point represents sequences with a hydrophobicity value range from equal up to <0.05 units greater than the hydrophobicity value shown on the X-axis. Insets show expanded portions of selected graphs.
Figure 5.
Figure 5.
Hydrophobicity distributions of sequences that are the most hydrophobic in a protein (maxH) or not most hydrophobic (non-maxH) for known soluble proteins and known TM sequences and comparison to whole genomic data. (A) Abundance of the most hydrophobic (maxH) and non-maxH sequences for known soluble proteins (squares) and for known TM sequences (circles) as a function of TM tendency for sum of E. coli + S. cerevisiae genomes divided by two. (Filled symbols) maxH sequences; (open symbols) non-maxH sequences. (B) Percent probability of a soluble (squares) or TM (triangles) sequence being the most hydrophobic sequence (maxH sequence) in a soluble protein or most hydrophobic TM sequence in a TM protein, respectively, as a function of TM tendency. (C,D) Percent probability of a sequence in S. cerevisiae (C) or E. coli (D) being the most hydrophobic in a protein as a function of TM tendency (squares). Curves are for soluble sequences in soluble proteins (solid line) and TM sequences in TM proteins (dashed line) fit to genomic data by slight adjustment of the X-axis position of curves in panel B. (This adjusts for factors such as a difference in average molecular weight for proteins from databases of proteins with known structure versus proteins in a whole genome.)

Similar articles

Cited by

References

    1. Bechinger B. 1996. Towards membrane protein design: pH-sensitive topology of histidine-containing polypeptides. J. Mol. Biol. 263: 768–775. - PubMed
    1. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235–242. - PMC - PubMed
    1. Blattner F.R., Plunkett G. III, Bloch C.A., Perna N.T., Burland V., Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F. et al. 1997. The complete genome sequence of Escherichia coli K-12. Science 277: 1453–1474. - PubMed
    1. Boyd D., Schierle C., Beckwith J. 1998. How many membrane proteins are there? Protein Sci. 7: 201–205. - PMC - PubMed
    1. Caputo G.A. and London E. 2003. Cumulative effects of amino acid substitutions and hydrophobic mismatch upon the transmembrane stability and conformation of hydrophobic α-helices. Biochemistry 42: 3275–3285. - PubMed

Publication types

LinkOut - more resources