Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov 13;115(46):E10988-E10997.
doi: 10.1073/pnas.1808790115. Epub 2018 Oct 29.

The in silico human surfaceome

Affiliations

The in silico human surfaceome

Damaris Bausch-Fluck et al. Proc Natl Acad Sci U S A. .

Abstract

Cell-surface proteins are of great biomedical importance, as demonstrated by the fact that 66% of approved human drugs listed in the DrugBank database target a cell-surface protein. Despite this biomedical relevance, there has been no comprehensive assessment of the human surfaceome, and only a fraction of the predicted 5,000 human transmembrane proteins have been shown to be located at the plasma membrane. To enable analysis of the human surfaceome, we developed the surfaceome predictor SURFY, based on machine learning. As a training set, we used experimentally verified high-confidence cell-surface proteins from the Cell Surface Protein Atlas (CSPA) and trained a random forest classifier on 131 features per protein and, specifically, per topological domain. SURFY was used to predict a human surfaceome of 2,886 proteins with an accuracy of 93.5%, which shows excellent overlap with known cell-surface protein classes (i.e., receptors). In deposited mRNA data, we found that between 543 and 1,100 surfaceome genes were expressed in cancer cell lines and maximally 1,700 surfaceome genes were expressed in embryonic stem cells and derivative lines. Thus, the surfaceome diversity depends on cell type and appears to be more dynamic than the nonsurface proteome. To make the predicted surfaceome readily accessible to the research community, we provide visualization tools for intuitive interrogation (wlab.ethz.ch/surfaceome). The in silico surfaceome enables the filtering of data generated by multiomics screens and supports the elucidation of the surfaceome nanoscale organization.

Keywords: SURFY; cell surface protein; machine learning; multiomics; surfaceome.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Surfaceome definition and construction. (A) Visual representation of surfaceome definition. Proteins shown in red are regarded as surfaceome members; those in blue are not. (B) Compositions of nonsurface (negative) and surface (positive) training sets used for the machine-learning model. Subcellular location of the nonsurface training set are labeled as follows: 1, endoplasmic reticulum; 2, endosome; 3, Golgi apparatus; 4, lysosome; 5, mitochondrion; 6, nucleus; 7, peroxisome; 8, cytosol; and 9, multiple locations. (C) Receiver operating characteristics for the full model derived from out-of-bag error estimates (red line) (Dataset S1, 11.6). Gray line indicates the performance of random guessing. The three SURFY score cutoffs at 1%, 5%, and 15% FPRs are indicated. (D) Distribution of the predicted scores for the training sets (Upper) and for the remaining ɑ-helical TM proteins (Lower). The bars of the nonsurface training set are stacked on top of the bars of the surface training set. Score cutoffs for estimated 1%, 5%, and 15% FPRs are indicated at the bottom. The predicted score distribution for CD antigens in and outside the training set are highlighted in yellow. (E) Gini index scores (41) for the 10 most important features used in building the predictive random forest model used by SURFY. Scores are plotted as means ± SDs. Features used for calculating SURFY scores are highlighted in red. AUC, area under the curve; Avg., average; C-glyc., C-glycosylation; TMD, TM domain.
Fig. 2.
Fig. 2.
Characterization of the predicted surfaceome. (AC) Comparison of 2,886 surfaceome proteins in red, with 2,216 nonsurfaceome membrane proteins in blue. (A) Distributions of sequence features in noncytoplasmic domains of surfaceome (upper graphs) and nonsurfaceome membrane (lower graphs) proteins, calculated as frequency per 100 amino acids. A, Left shows the distribution of numbers of N-glycosylation sequence motifs (N-X-S/T) per 100 amino acids. Proteins with more than five motifs were excluded from this graph. A, Center shows the distribution of numbers of C-glycosylation sites per 100 amino acids predicted by using GlycoMine. Proteins with more than four predicted sites were excluded. A, Right shows the distribution of numbers of cysteine residues per 100 amino acids. Proteins with >10 cysteines were excluded. (B) Distribution of the number of ɑ-helical TM domains per protein. The pie chart shows the proportion of GPCRs within the set of surface proteins with seven TM domains (red bar). (C) Distribution of the length of ɑ-helical TM domains. (D) Classification of surface and nonsurface proteins into functional classes. *P < 10−2. Functional classes are numbered as follows: 1, GPCRs; 2, receptor-type tyrosine kinases; 3, receptors of the Ig superfamily; 4, scavenger receptors; 5, other receptors; 6, channels; 7, solute carrier superfamily; 8, active transporters; 9, auxiliary transport proteins; 10, other transporters; 11, oxidoreductases; 12, transferases; 13, hydrolases; 14, lyases; 15, isomerases; 16, ligases; 17, structure/adhesion proteins; 18, ligand proteins; and 19, proteins of unknown function. (E) Overlap of proteins of the human surfaceome annotated in UniProt, predicted by YLoc (13), predicted by da Cunha et al. (17), and predicted by SURFY. (F) Protter image of human MEGF9. N-X-S/T motifs are marked in light blue, with the corresponding asparagine (N) in dark blue. CSC identified peptides are marked in purple. (G) Half-life distributions of surfaceome proteins, transcription factors (TF), and all quantified proteins from Mathieson et al. (45). Misc., miscellaneous.
Fig. 3.
Fig. 3.
Surfaceome expression in 610 cancer cell lines. (A) Distribution of cell-specific surfaceome diversity (count of expressed surfaceome genes; left axis), sorted from large to small. Cell lines are colored based on their tissue type, as indicated by the color code. The straight gray line marks the average, and the dashed line marks the median surfaceome diversity. The sum of the expressed surfaceome genes is indicated by the black line corresponding to the right axis. (B) Distribution of surfaceome diversities (count of expressed surfaceome genes) based on tissue type. Tissues are color-coded as in A. The number of cell lines belonging to each tissue is indicated on the horizontal axis. (C) Scatter plot of count of expressed surfaceome genes vs. physical cell size. Squared Pearson correlation coefficient is indicated. (D) Box plots of the surfaceome gene expression level distribution for each cell line, sorted based on surfaceome diversity as in A from large to small. The black range represents the interquartile range; whiskers are depicted in gray. (E) Distribution of log2 expression level of PD-L1. Cell lines with the highest expression are indicated. (F) Surfaceome genes sorted by number of cell lines in which each gene is expressed enabled categorization into five groups. Functional classification for each group of genes based on Almén et al. (1) is shown in the bar chart in F, Inset. Misc., miscellaneous.
Fig. 4.
Fig. 4.
Voronoi tree maps generated on wlab.ethz.ch/surfaceome. Maps for RAMOS (A), HT-29 (B), and IMR-32 (C) are shown. RPKME values of each cell line were scaled from 0 to 1 and mapped onto the whole in silico surfaceome. Light color indicates low expression; dark color indicates strong expression. White genes are not expressed. Characteristic functional protein groups of these cell lines are highlighted on the right.
Fig. 5.
Fig. 5.
Surfaceome changes during neurogenesis. (A) Left axis: Surfaceome gene level distribution from day 0 to day 22. Right axis: The red line shows the total number of expressed surfaceome genes, and the brown line shows the sum of expression levels over all expressed surfaceome genes. Transcriptomic data and definition of developmental stages (1, pluripotency stage; 2, differentiation initiation stage; 3, neural commitment stage; 4, NPC proliferation stage; 5, neuronal differentiation stage) were obtained from Li et al. (52). (B) Expression of selected gap junction genes from day 0 to day 22. (C) Identified clusters among surfaceome gene expression profiles based on c-means soft clustering. Red, higher correlation with cluster; light blue, lower correlation with cluster. (D) Voronoi tree map of log2 expression ratios between day 0 and day 22; the darker means more expressed at day 0, and the brighter color means more expressed at day 22. Surfaceome genes are hierarchically grouped by functional classification [receptors (orange), transporters (purple), hydrolases (dark blue), unclassified (blue), and miscellaneous (green)].

References

    1. Almén MS, Nordström KJV, Fredriksson R, Schiöth HB. Mapping the human membrane proteome: A majority of the human membrane proteins can be classified according to function and evolutionary origin. BMC Biol. 2009;7:50. - PMC - PubMed
    1. Reeb J, Kloppmann E, Bernhofer M, Rost B. Evaluation of transmembrane helix predictions in 2014. Proteins. 2015;83:473–484. - PubMed
    1. Krogh A, Larsson B, von Heijne G, Sonnhammer ELL. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J Mol Biol. 2001;305:567–580. - PubMed
    1. Jones DT. Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics. 2007;23:538–544. - PubMed
    1. Viklund H, Elofsson A. OCTOPUS: Improving topology prediction by two-track ANN-based preference scores and an extended topological grammar. Bioinformatics. 2008;24:1662–1668. - PubMed

Publication types