. 2006 Aug;15(8):1987-2001.

doi: 10.1110/ps.062286306.

An amino acid "transmembrane tendency" scale that approaches the theoretical limit to accuracy for prediction of transmembrane helices: relationship to biological hydrophobicity

Gang Zhao¹, Erwin London

Affiliations

PMID: 16877712
PMCID: PMC2242586
DOI: 10.1110/ps.062286306

An amino acid "transmembrane tendency" scale that approaches the theoretical limit to accuracy for prediction of transmembrane helices: relationship to biological hydrophobicity

Gang Zhao et al. Protein Sci. 2006 Aug.

. 2006 Aug;15(8):1987-2001.

doi: 10.1110/ps.062286306.

Authors

Gang Zhao¹, Erwin London

Affiliation

¹ Department of Biochemistry and Cell Biology, Stony Brook University, New York 11794-5215, USA.

PMID: 16877712
PMCID: PMC2242586
DOI: 10.1110/ps.062286306

Abstract

Hydrophobicity analyses applied to databases of soluble and transmembrane (TM) proteins of known structure were used to resolve total genomic hydrophobicity profiles into (helical) TM sequences and mainly "subhydrophobic" soluble components. This information was used to define a refined "hydrophobicity"-type TM sequence prediction scale that should approach the theoretical limit of accuracy. The refinement procedure involved adjusting scale values to eliminate differences between the average amino acid composition of populations TM and soluble sequences of equal hydrophobicity, a required property of a scale having maximum accuracy. Application of this procedure to different hydrophobicity scales caused them to collapse to essentially a single TM tendency scale. As expected, when different scales were compared, the TM tendency scale was the most accurate at predicting TM sequences. It was especially highly correlated (r = 0.95) to the biological hydrophobicity scale, derived experimentally from the percent TM conformation formed by artificial sequences passing though the translocon. It was also found that resolution of total genomic sequence data into TM and soluble components could be used to define the percent probability that a sequence with a specific hydrophobicity value forms a TM segment. Application of the TM tendency scale to whole genomic data revealed an overlap of TM and soluble sequences in the "semihydrophobic" range. This raises the possibility that a significant number of proteins have sequences that can switch between TM and non-TM states. Such proteins may exist in moonlighting forms having properties very different from those of the predominant conformation.

PubMed Disclaimer

Figures

**Figure 1.**
Abundance of sequences as a function of hydrophobicity for the sum of the *E. coli* and *S. cerevisiae* genomes as estimated by the Kyte-Doolittle and TM tendency scales. The abundance of sequences within a given hydropathy/hydrophobicity range is shown for the combined *E. coli* and *S. cerevisiae* genomes (triangles) as assessed by the Kyte-Doolittle scale (A) or the TM tendency scale (B). The abundances at each hydrophobicity value correspond to an average calculated from the sum of abundance values for *E. coli* and *S. cerevisiae* divided by two. A sliding window [N] size of 19 and a merge factor of 4 were used. Each point represents sequences with a hydrophobicity value within the range that is ≤0.1 units greater than the hydrophobicity value shown on the X-axis. To resolve curves into soluble and TM sequences, hydrophobicity distributions were calculated for soluble proteins with known structures from a soluble protein database (squares) and for TM segments from a database (TMPDB) (Ikeda et al. 2003) cataloging TM proteins (circles). The abundance values for the soluble and TM sequence databases were normalized in terms of height to fit the genomic data (see text for details). The sum of the normalized soluble and TM segment components is shown as a solid line. If not noted otherwise, in this and later calculations signal sequences predicted by the Phobius program (Kall et al. 2004) were removed from the genomic sequences. (*Inset*) Dependence of the percent probability that a sequence is TM upon KD hydropathy as determined from the ratio of sequence abundance in the normalized soluble database to that in the sum of the normalized soluble database and normalized TM database at each specific hydropathy value.

**Figure 2.**
Correlation between various hydrophobicity scales. (A) Correlation between the CW and GH scales. GH scale values for I, L, and V residues (asterisks) were based on the CW scale by extrapolating from the linear correlation curve (dashed line). (B) Correlation between the EC and GH scales. (C) Correlation between the TM tendency and biological hydrophobicity scales. (D) Correlation between the refined EC and TM tendency scales. The positions of amino acid residues are designated by their one-letter codes. Correlation coefficients (r) are shown in each panel.

**Figure 3.**
Schematic illustration of the definition of prediction accuracy as judged from the degree of overlap between hydrophobicity of soluble and TM sequences. (A) Definition of populations used for estimation of prediction errors. Soluble (solid) and TMPDB (dashed) curves schematically represent the hydrophobicity profiles of sequences in the soluble protein and TMP databases, respectively, after normalization to genomic data. A1, the population of sequences falsely predicted to be TM, is the area to the *right* of the 50% TM possibility line (vertical dotted line) and *below* the soluble protein profile (solid line). A2, the population of TM sequences that are not assigned as being TM (i.e., missing TM sequences), is the area to the *left* of the 50% TM possibility line and *below* the TMP database profile (dashed line). A1 and A2 were used to calculate the misidentification levels for hydrophobicity scales (see legend of Table 1). (B) Effect of the number of soluble sequences (relative to the number of TM sequences) upon percent TM probability versus hydrophobicity. Notice that when the relative number of soluble sequences increases (from dash-dot-dash curve to solid curve), the hydrophobicity value at which there is a 50% probability of a sequence being a TM sequence (vertical dotted line) increases.

**Figure 4.**
Comparison of the abundance of sequences as a function of “hydrophobicity” in various individual genomes as estimated by the GH scale and TM tendency scales. The total genomic abundance of sequences versus hydrophobicity is shown for the GH scale (open diamonds) and TM tendency (closed squares) scale for *Nanoarchaeum equitans*, *E. coli*, *Y. pestis*, *S. cerevisiae*, *Schizosaccharomyces pombe*, and *Drosophila melanogaster*. A sliding window size of 19 and a merge factor of 4 were used. Each point represents sequences with a hydrophobicity value range from equal up to <0.05 units greater than the hydrophobicity value shown on the X-axis. Insets show expanded portions of selected graphs.

**Figure 5.**
Hydrophobicity distributions of sequences that are the most hydrophobic in a protein (maxH) or not most hydrophobic (non-maxH) for known soluble proteins and known TM sequences and comparison to whole genomic data. (A) Abundance of the most hydrophobic (maxH) and non-maxH sequences for known soluble proteins (squares) and for known TM sequences (circles) as a function of TM tendency for sum of *E. coli + S. cerevisiae* genomes divided by two. (Filled symbols) maxH sequences; (open symbols) non-maxH sequences. (B) Percent probability of a soluble (squares) or TM (triangles) sequence being the most hydrophobic sequence (maxH sequence) in a soluble protein or most hydrophobic TM sequence in a TM protein, respectively, as a function of TM tendency. (C,D) Percent probability of a sequence in *S. cerevisiae* (C) or *E. coli* (D) being the most hydrophobic in a protein as a function of TM tendency (squares). Curves are for soluble sequences in soluble proteins (solid line) and TM sequences in TM proteins (dashed line) fit to genomic data by slight adjustment of the X-axis position of curves in panel B. (This adjusts for factors such as a difference in average molecular weight for proteins from databases of proteins with known structure versus proteins in a whole genome.)

See this image and copyright information in PMC

Cited by

Massively parallel interrogation of protein fragment secretability using SECRiFY reveals features influencing secretory system transit.
Boone M, Ramasamy P, Zuallaert J, Bouwmeester R, Van Moer B, Maddelein D, Turan D, Hulstaert N, Eeckhaut H, Vandermarliere E, Martens L, Degroeve S, De Neve W, Vranken W, Callewaert N. Boone M, et al. Nat Commun. 2021 Nov 5;12(1):6414. doi: 10.1038/s41467-021-26720-y. Nat Commun. 2021. PMID: 34741024 Free PMC article.
Towards an experimental classification system for membrane active peptides.
Brand GD, Ramada MHS, Genaro-Mattos TC, Bloch C Jr. Brand GD, et al. Sci Rep. 2018 Jan 19;8(1):1194. doi: 10.1038/s41598-018-19566-w. Sci Rep. 2018. PMID: 29352252 Free PMC article.
The two transmembrane regions of Candida albicans Dfi1 contribute to its biogenesis.
Herwald SE, Zucchi PC, Tan S, Kumamoto CA. Herwald SE, et al. Biochem Biophys Res Commun. 2017 Jun 17;488(1):153-158. doi: 10.1016/j.bbrc.2017.04.158. Epub 2017 May 5. Biochem Biophys Res Commun. 2017. PMID: 28483525 Free PMC article.
Comparison of amino acids physico-chemical properties and usage of late embryogenesis abundant proteins, hydrophilins and WHy domain.
Jaspard E, Hunault G. Jaspard E, et al. PLoS One. 2014 Oct 8;9(10):e109570. doi: 10.1371/journal.pone.0109570. eCollection 2014. PLoS One. 2014. PMID: 25296175 Free PMC article.
Sequence-based features that are determinant for tail-anchored membrane protein sorting in eukaryotes.
Fry MY, Saladi SM, Cunha A, Clemons WM Jr. Fry MY, et al. Traffic. 2021 Sep;22(9):306-318. doi: 10.1111/tra.12809. Epub 2021 Aug 3. Traffic. 2021. PMID: 34288289 Free PMC article.

See all "Cited by" articles

References

1. Bechinger B. 1996. Towards membrane protein design: pH-sensitive topology of histidine-containing polypeptides. J. Mol. Biol. 263: 768–775. - PubMed
1. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235–242. - PMC - PubMed
1. Blattner F.R., Plunkett G. III, Bloch C.A., Perna N.T., Burland V., Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F. et al. 1997. The complete genome sequence of Escherichia coli K-12. Science 277: 1453–1474. - PubMed
1. Boyd D., Schierle C., Beckwith J. 1998. How many membrane proteins are there? Protein Sci. 7: 201–205. - PMC - PubMed
1. Caputo G.A. and London E. 2003. Cumulative effects of amino acid substitutions and hydrophobic mismatch upon the transmembrane stability and conformation of hydrophobic α-helices. Biochemistry 42: 3275–3285. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An amino acid "transmembrane tendency" scale that approaches the theoretical limit to accuracy for prediction of transmembrane helices: relationship to biological hydrophobicity

Affiliation

An amino acid "transmembrane tendency" scale that approaches the theoretical limit to accuracy for prediction of transmembrane helices: relationship to biological hydrophobicity

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials