This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Jul 15:2023.07.18.549602.

doi: 10.1101/2023.07.18.549602.

Contextual AI models for single-cell protein biology

Michelle M Li¹, Yepeng Huang¹, Marissa Sumathipala¹, Man Qing Liang¹, Alberto Valdeolivas², Ashwin N Ananthakrishnan^{1

3}, Katherine Liao^{1

4}, Daniel Marbach², Marinka Zitnik^{1

5

6

7}

Affiliations

¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
² Roche Pharma Research and Early Development, Pharmaceutical Sciences, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Basel, Switzerland.
³ Division of Gastroenterology, Massachusetts General Hospital, Boston, MA, USA.
⁴ Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital, Boston, MA, USA.
⁵ Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Allston, MA, USA.
⁶ Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁷ Harvard Data Science Initiative, Cambridge, MA, USA.

PMID: 37503080
PMCID: PMC10370131
DOI: 10.1101/2023.07.18.549602

Contextual AI models for single-cell protein biology

Michelle M Li et al. bioRxiv. 2024.

[Preprint]. 2024 Jul 15:2023.07.18.549602.

doi: 10.1101/2023.07.18.549602.

Authors

Michelle M Li¹, Yepeng Huang¹, Marissa Sumathipala¹, Man Qing Liang¹, Alberto Valdeolivas², Ashwin N Ananthakrishnan^{1

3}, Katherine Liao^{1

4}, Daniel Marbach², Marinka Zitnik^{1

5

6

7}

Affiliations

¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
² Roche Pharma Research and Early Development, Pharmaceutical Sciences, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Basel, Switzerland.
³ Division of Gastroenterology, Massachusetts General Hospital, Boston, MA, USA.
⁴ Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital, Boston, MA, USA.
⁵ Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Allston, MA, USA.
⁶ Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁷ Harvard Data Science Initiative, Cambridge, MA, USA.

PMID: 37503080
PMCID: PMC10370131
DOI: 10.1101/2023.07.18.549602

Update in

Contextual AI models for single-cell protein biology.
Li MM, Huang Y, Sumathipala M, Liang MQ, Valdeolivas A, Ananthakrishnan AN, Liao K, Marbach D, Zitnik M. Li MM, et al. Nat Methods. 2024 Aug;21(8):1546-1557. doi: 10.1038/s41592-024-02341-3. Epub 2024 Jul 22. Nat Methods. 2024. PMID: 39039335 Free PMC article.

Abstract

Understanding protein function and developing molecular therapies require deciphering the cell types in which proteins act as well as the interactions between proteins. However, modeling protein interactions across biological contexts remains challenging for existing algorithms. Here, we introduce Pinnacle, a geometric deep learning approach that generates context-aware protein representations. Leveraging a multi-organ single-cell atlas, Pinnacle learns on contextualized protein interaction networks to produce 394,760 protein representations from 156 cell type contexts across 24 tissues. Pinnacle's embedding space reflects cellular and tissue organization, enabling zero-shot retrieval of the tissue hierarchy. Pretrained protein representations can be adapted for downstream tasks: enhancing 3D structure-based representations for resolving immuno-oncological protein interactions, and investigating drugs' effects across cell types. Pinnacle outperforms state-of-the-art models in nominating therapeutic targets for rheumatoid arthritis and inflammatory bowel diseases, and pinpoints cell type contexts with higher predictive capability than context-free models. Pinnacle's ability to adjust its outputs based on the context in which it operates paves way for large-scale context-specific predictions in biology.

PubMed Disclaimer

Conflict of interest statement

Competing interests. D.M. and A.V. are currently employed by F. Hoffmann-La Roche Ltd. The remaining authors declare no competing interests.

Figures

**Figure 1:. Overview of Pinnacle.**
(a) Cell type-specific protein interaction networks and metagraph of cell type and tissue organization are constructed from a multi-organ single-cell transcriptomic atlas of humans, a human reference protein interaction network, and a tissue ontology. (b) Pinnacle has protein-, cell type-, and tissue-level attention mechanisms that enable the algorithm to generate contextualized representations of proteins, cell types, and tissues in a single unified embedding space. (c) Pinnacle is designed such that the nodes (i.e., proteins, cell types, and tissues) that share an edge are embedded closer (decreased embedding distance) to each other than nodes that do not share an edge (increased embedding distance); proteins activated in the same cell type are embedded more closely (decreased embedding distance) than proteins activated in different cell types (increased embedding distance); and cell types are embedded closer to their activated proteins (decreased embedding distance) than other proteins (increased embedding distance). (d) As a result, Pinnacle generates protein representations injected with cell type and tissue context; a unique representation is produced for each protein activated in each cell type. Pinnacle simultaneously generates representations for cell types and tissues. (e) Existing methods, however, are context-free. They generate a single embedding per protein, representing only one condition or context for each protein, without any notion of cell type or tissue context. (**f-h**) The Pinnacle algorithm and its outputs enable (f) multi-modal deep learning (e.g., single-cell transcriptomic data with interactomes), (g) context-specific transfer learning (e.g., between proteins, cell types, and tissues), and (h) contextualized predictions (e.g., efficacy and safety of therapeutics).

**Figure 2:. Enrichment of Pinnacle’s protein embedding regions.**
**(a-f)** Two-dimensional UMAP plots of contextualized protein representations generated by Pinnacle from six different contexts: (a) medullary thymic epithelial cell, (b) bronchial vessel endothelial cell, (c) mesenchymal stem cell, (d) lung microvascular endothelial cell, (e) kidney epithelial cell, and (f) fibroblast of breast. Each dot is a protein representation. Gray dots are representations of proteins from other cell types, and nongray colors indicate the cell type context. Each protein embedding region is expected to be enriched neighborhoods that are spatially localized according to cell type context. To quantify this, we compute spatial enrichment of each protein embedding region using SAFE [31], and provide the mean and max neighborhood enrichment scores (NES) and the number of enriched neighborhoods output by the tool (Methods 6 and Supplementary Figure S3–S4). (**g-h**) Distribution of (g) the maximum SAFE NES and (h) the number of enriched neighborhoods for 156 cell type contexts (each context has $p$ -value < 0.05; hypergeometric test, adjusted using the Benjamin-Hochberg false discovery rate correction with significance cutoff $α = 0.05$ ). 10 randomly sampled cell type contexts are annotated, with their maximum SAFE NES or number of enriched neighborhoods in parentheses.

**Figure 3:. Evaluation of Pinnacle’s contextual representations.**
**(a-b)** Gap between embedding similarities using (a) PINNACLE’s protein representations and (b) a non-contextualized model’s protein representations on $n = 394$ , 760 samples (i.e., cell type specific protein representations). Similarities are calculated between pairs of proteins in the same cell type (dark shade of color) or different cell types (light shade of color), and stratified by the compartment from which the cell types are derived. We use the two-sided two-sample Kolmogorov-Smirnov test for goodness of fit. Annotations indicate median values. The non-contextualized model is an ablated version of Pinnacle without any notion of tissue or cell type organization (i.e., remove cell type and tissue network and all cell type- and tissue-related components of Pinnacle’s architecture and objective function). The bounds of the box show the quartiles of the data, the center indicates the median value of the data, and the whiskers represent the farthest data point within 1.5× IQR. (c) Embedding distance of Pinnacle’s 62 tissue representations as a function of tissue ontology distance. Gray bars indicate a null distribution (refer to Methods 6 for more details). Both the Spearman correlation ( $p$ -value = 1.85×10⁻¹¹⁹) and Kolmogorov-Smirnov ( $p$ -value < 0.001) statistical tests are two-sided. Data are represented as mean values with error bars indicating a 95% confidence interval. (d) Prediction task in which protein representations are optimized to maximize the gap between binding and non-binding proteins. (e) Cell type context (provided by Pinnacle) is injected into context-free structure-based protein representations (provided by MaSIF [3], which learns a protein representation from the protein’s 3D structure) via concatenation to generate contextualized protein representations. Lack of cell type context is defined by an average of Pinnacle’s protein representations. (f) Comparison of context-free and contextualized representations in differentiating between binding and non-binding proteins. Scores are computed using cosine similarity on $n = 22$ unique protein pairs (2 binding and 20 non-binding); since Pinnacle generates multiple representations per protein based on context, there are $n = 7,956$ pairwise computations (180 binding and 7,776 non-binding) for the contextualized representations. The binding proteins evaluated are PD-1/PD-L1 and B7–1/CTLA-4. Pairwise scores also are calculated for each of these four proteins and proteins that they do not bind with (i.e., RalB, RalBP1, EPO, EPOR, C3, and CFH). The gap between the average scores of binding and non-binding proteins is annotated for context-free and contextualized representations. The significance of the score gaps between binding and non-binding proteins is measured using a one-sided non-parametric permutation test. Data are represented as mean values with error bars indicating a 95% confidence interval.

**Figure 4:. Fine-tuning contextualized protein representations for therapeutic target prioritization.**
**(a)** Workflow to curate positive training examples for rheumatoid arthritis (left) and inflammatory bowel disease (right) therapeutic areas. (b) We construct positive examples by selecting proteins from our protein-protein interaction network (PPIN) that are targeted by compounds that have at least completed phase 2 for treating the therapeutic area of interest. These proteins are deemed safe and potentially efficacious for humans with the disease. We construct negative examples by selecting proteins from our PPIN that do not have associations with the therapeutic area yet have been targeted by at least one existing drug/compound. (c) Cell type-specific protein interaction networks are embedded by Pinnacle, and finetuned for a downstream task. Here, the predictor module (i.e., multi-layer perceptron) finetunes the (pretrained) contextualized protein representations for predicting whether a given protein is a strong candidate for the therapeutic area of interest. Additional insights of our setup include hypothesizing highly predictive cell types for examining candidate therapeutic targets. (**d-e**) Benchmarking of context-aware and context-free approaches for (d) RA and (e) IBD therapeutic areas. Each dot is the performance (averaged across 10 random seeds) of protein representations from a given context (i.e., cell type context for Pinnacle, context-free global reference protein interaction network for GAT and random walk, and context-free multi-modal protein interaction network for BIONIC).

**Figure 5:. Performance of contextualized target prioritization for RA and IBD therapeutic areas.**
**(a,d)** Model performance (measured by APR@5) for RA and IBD therapeutic areas, respectively. APR@K (or Average Precision and Recall at K) is a combination of Precision@K and Recall@K (refer to Methods 6 for more details). Each dot is the performance (averaged across 10 random seeds) of Pinnacle’s protein representations from a specific cell type context. The gray and dark orange lines are the performance of the GAT and BIONIC models, respectively. For each therapeutic area, 22 cell types are annotated and colored by their compartment category. Supplementary Figure S8 contains model performance measured by APR@10, APR@15, and APR@20 for RA and IBD therapeutic areas. (**b-c**, **e-f**) Selected proteins for RA and IBD therapeutic areas. Dotted line separates the top and bottom 5 cell types. (**b-c**) Two selected proteins, JAK3 and IL6R, that are targeted by drugs that have completed Phase IV of clinical trials for treating RA therapeutic area. (**e-f**) Two selected proteins, ITGA4 and PPARG, that are targeted by drugs that have completed Phase IV for treating IBD therapeutic area.

See this image and copyright information in PMC

References

1. Lund-Johansen F., Tran T. & Mehta A. Towards reproducibility in large-scale analysis of protein–protein interactions. Nature Methods 18, 720–721 (2021). - PubMed
1. Kustatscher G. et al. Understudied proteins: opportunities and challenges for functional proteomics. Nature Methods 19, 774–779 (2022). - PubMed
1. Gainza P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods 17, 184–192 (2019). - PubMed
1. Barabási A.-L., Gulbahce N. & Loscalzo J. Network medicine: a network-based approach to human disease. Nature Reviews Genetics 12, 56–68 (2010). - PMC - PubMed
1. Wang J. et al. Scaffolding protein functional sites using deep learning. Science 377, 387–394 (2022). - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Contextual AI models for single-cell protein biology

Affiliations

Contextual AI models for single-cell protein biology

Authors

Affiliations

Update in

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources