This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Nov 3:2024.12.10.627665.

doi: 10.1101/2024.12.10.627665.

ProCyon: A multimodal foundation model for protein phenotypes

Owen Queen^{1

2}, Yepeng Huang¹, Robert Calef^{1

3}, Valentina Giunchiglia^{1

4

5}, Tianlong Chen^{1

3}, George Dasoulas¹, LeAnn Tai³, Gianmarco Abbadessa^{4

6}, Owain Howell^{4

7}, Michelle M Li¹, Yasha Ektefaie¹, Ayush Noori¹, Ildiko Farkas⁴, Joseph Brown⁸, Tom Cobley^{2

9}, Karin Hrovatin^{10

11}, Tom Hartvigsen¹², Fabian J Theis^{10

13}, Bradley L Pentelute^{8

14}, James Zou¹⁵, Vikram Khurana^{14

16

17}, David Owen⁴, Richard Nicholas^{4

5

7}, Manolis Kellis^{3

14}, Marinka Zitnik^{1

17

18

19}

Affiliations

¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
² Department of Computer Science, Stanford University, Stanford, CA, USA.
³ Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA.
⁴ Department of Brain Sciences, Imperial College London, London, UK.
⁵ Centre for Neuroimaging Sciences, King's College London, London, UK.
⁶ University of Campania Luigi Vanvitelli, Naples, Italy.
⁷ Medical School, Swansea University, Swansea, UK.
⁸ Department of Chemistry, MIT, Cambridge, MA, USA.
⁹ Department of Computing, Imperial College London, London, UK.
¹⁰ Institute of Computational Biology, Computational Health Center, Helmholtz Munich, Munich, Germany.
¹¹ TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany.
¹² School of Data Science, University of Virginia, VA, USA.
¹³ School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
¹⁴ Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹⁵ Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
¹⁶ Department of Neurology, Brigham and Women's Hospital, Boston, MA, USA.
¹⁷ Harvard Stem Cell Institute, Cambridge, MA, USA.
¹⁸ Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, MA, USA.
¹⁹ Harvard Data Science Initiative, Cambridge, MA, USA.

PMID: 41279541
PMCID: PMC12632626
DOI: 10.1101/2024.12.10.627665

ProCyon: A multimodal foundation model for protein phenotypes

Owen Queen et al. bioRxiv. 2025.

[Preprint]. 2025 Nov 3:2024.12.10.627665.

doi: 10.1101/2024.12.10.627665.

Authors

Affiliations

¹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
² Department of Computer Science, Stanford University, Stanford, CA, USA.
³ Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA.
⁴ Department of Brain Sciences, Imperial College London, London, UK.
⁵ Centre for Neuroimaging Sciences, King's College London, London, UK.
⁶ University of Campania Luigi Vanvitelli, Naples, Italy.
⁷ Medical School, Swansea University, Swansea, UK.
⁸ Department of Chemistry, MIT, Cambridge, MA, USA.
⁹ Department of Computing, Imperial College London, London, UK.
¹⁰ Institute of Computational Biology, Computational Health Center, Helmholtz Munich, Munich, Germany.
¹¹ TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany.
¹² School of Data Science, University of Virginia, VA, USA.
¹³ School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
¹⁴ Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹⁵ Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
¹⁶ Department of Neurology, Brigham and Women's Hospital, Boston, MA, USA.
¹⁷ Harvard Stem Cell Institute, Cambridge, MA, USA.
¹⁸ Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, MA, USA.
¹⁹ Harvard Data Science Initiative, Cambridge, MA, USA.

PMID: 41279541
PMCID: PMC12632626
DOI: 10.1101/2024.12.10.627665

Abstract

Characterizing human proteins remains a major challenge: approximately 29% of human proteins lack experimentally validated functions and even well-annotated proteins often lack context-specific phenotypic insights. To enable universal modeling of protein phenotypes, we present ProCyon, a multimodal foundation model that utilizes protein sequence, structure, and natural language for generating and predicting protein phenotypes across diverse knowledge domains. ProCyon is trained on our novel dataset, ProCyon-Instruct, with 33 million protein phenotype instructions. On dozens of benchmarking tasks, ProCyon performs competitively against single-modal and multimodal models. Further, ProCyon conditionally retrieves proteins via mechanisms of action of small molecule drugs and disease contexts, and it generates candidate phenotypic descriptions for poorly characterized proteins, including those implicated in Parkinson's disease that were identified after ProCyon's knowledge cutoff date. We experimentally confirm ProCyon's predictions in multiple sclerosis using post-mortem brain RNA-seq, identifying novel MS genes and elucidating associated pathway mechanisms consistent with cortical pathology. ProCyon paves the way toward a general approach to generate functional insights into the human proteome.

PubMed Disclaimer

Conflict of interest statement

Competing interests. F.J.T. consults for Immunai Inc., CytoReason Ltd, Cellarity, BioTuring Inc., and Genbio.AI Inc., and has an ownership interest in Dermagnostix GmbH and Cellarity. Other authors declare no competing interests.

Figures

**Figure 1:. Overview of ProCyon model architecture and ProCyon-Instruct dataset.**
a) ProCyon models proteins and phenotypes in a unified latent space, enabling wide applications across knowledge domains (shown are example prompts). b) ProCyon is a multimodal foundation model consisting of a large language model and multimodal molecular encoders, trained to perform both protein retrieval and phenotype generation. For protein retrieval, a prompt is processed into the LLM and compared against a library of proteins. ProCyon then outputs a ranked list of proteins, domains, peptides, or polypeptides. For phenotype generation, the input text and multimodal elements are processed by the LLM via multimodal token decomposition. ProCyon then performs autoregressive text generation to generate protein phenotypes and other free-form answers. c) A comprehensive ProCyon-Instruct dataset with 33,899,528 protein-phenotype instructions is curated from five knowledge domains (Function, Therapeutics, Disease, Protein Domains, and Interaction) and used to train ProCyon. Details of dataset curation, phenotype description rephrasing, and instruction templating can be found in Extended Data Fig. 4, Methods Sec. 1 and 3.1, and Supplementary Note 2. The number displayed shows the number of protein-phenotype description pairs curated from each database.

**Figure 2:. ProCyon accurately retrieves proteins from flexible phenotypes.**
a) In protein retrieval, ProCyon receives a phenotype as input and outputs a ranked list of proteins, domains, peptides, or polypeptides. b) Benchmarking ProCyon’s retrieval capabilities against other approaches across distinct knowledge domains. x axes show F_max, the maximum F₁ score across any cutoff. Error bars show bootstrapped 95% confidence intervals. One retrieval task corresponds to retrieving proteins for one phenotype. c) ProCyon’s retrieval performance across splits of ProCyon-Instruct. d) Illustration of pleiotropic protein functions. The FOXC1 protein is implicated in heart development, regulation of HPSC differentiation, and regulation of the mitotic cell cycle. e) Percentiles of potentially pleiotropic proteins ranked by ProCyon using composite prompts composed of two, three, and four pathways, in comparison with aggregated individual ranks. Also compared with ProtST. All comparisons have p-value < 0.001; two-sided Wilcoxon signed-rank test. f) External validation of disease-associated protein retrieval across different lines of evidence for association. y axis shows median retrieval percentile scores for each disease across different lines of evidence. g) Protein retrieval with different disease descriptions. Left, example texts for prompts with gradually increasing information (Disease, Diagnostics, Associated Features supporting Diagnosis and their combinations) for Autistic Spectrum Disorder. Right, median retrieval percentile scores derived for each disease with increasingly informative prompts. h) Left, retrieval percentile of STING as more precise prompts are provided as input. Right, UMAP of unified latent space of phenotypes and proteins, highlighting prompt embeddings and STING protein embedding. Dashed contours indicate distance from STING protein embedding; thicker lines denote closer distance. * p-value < 0.05, ** p-value < 0.005; two-sided Mann-Whitney U test.

**Figure 3:. ProCyon generates accurate responses and phenotype descriptions from multimodal prompts.**
a) In QA, ProCyon receives a multimodal prompt (protein + phenotype description) and outputs a yes/no answer. b) Benchmarking QA accuracy across models and knowledge domains. Error bars show bootstrapped 95% CIs. c) ProCyon’s QA accuracy across different ProCyon-Instruct splits. d) ProCyon tokenizes protein sequences and 3D structures using a multimodal encoder; LLMs must use text surrogates, many of which are not generalizable (gene symbol, UniProt ID). e) QA performance comparison between ProCyon and LLMs using HGNC gene symbols. f) In phenotype generation, ProCyon processes multimodal inputs (proteins, natural language, small molecules) to output open-ended phenotype descriptions. g) Phenotype generation benchmark using reference text similarity (mean BERTScore F₁). Hatching indicates ability to generalize to unseen proteins. h) Phenotype generation benchmark using LLM-based judging. Win rate refers to preference over ProCyon; models grouped by parameter count.

**Figure 4:. ProCyon models domains, peptides, and small molecules beyond proteins.**
a) ProCyon predicts the small molecule binding domain on proteins, where the input is the description and molecular structure of a drug, and the output is a ranked list of domains on a target protein. In this example, ProCyon retrieves domains of MGAM given the drug miglitol and correctly prioritizes the binding domain for miglitol as the first among the MGAM’s nine domains. Each domain is identified by the Pfam ID and the index of the first amino acid in the domain. b) ProCyon ranks the correct drug-binding domain (highlighted in orange) on respective target proteins highly. In Q06187, the green domain (PF00017_281) denotes the ProCyon-predicted binding domain, but the orange domain (PF07714_402) is the correct binding domain. ProCyon predicts the binding domain as top-ranked in 24 examples, second-ranked in 5 examples, and so on. c) ProCyon can be finetuned to predict protein-peptide binding. Left, ProCyon is first finetuned on naturally occurring protein-peptide complexes from the Protein Data Bank (PDB), then tested on an experimental dataset of synthetic peptides screened for binding to the ACE2 protein. Right, the predicted binding scores for binders versus non-binders. Two-sided Mann-Whitney U test. d) Leveraging ProCyon to perform indication-specific drug target retrieval. In this example, the input consists of a novel task definition, the description of indications (nicotine addiction or major depressive disorder), and the name and structure of bupropion. ProCyon ranks known targets (norepinephrine transporter (NET), dopamine transporter (DAT), and nicotinic acetylcholinergic receptor (AChR)) of bupropion among the top 40/18,174 human proteins. Shown are 95% confidence intervals derived by perturbing words in the input prompt. ** p-value < 0.001; two-sided Wilcoxon signed rank test.

**Figure 5:. ProCyon characterizes poorly-annotated proteins and their functions.**
a) ProCyon generates phenotype predictions for AKNAD1. QA filtering selects the highest-confidence outputs; two of the three predicted phenotypes are experimentally supported. b) Additional ProCyon-generated phenotypes for other poorly characterized proteins. All proteins either have no UniProt-annotated function or have no UniProt function as of the ProCyon knowledge cutoff date. c) External evaluation of ProCyon using perturbation datasets. Parallel pathway analysis on genetic perturbation data is used to validate ProCyon-predicted functions for uncharacterized proteins. d) Top 100 functions prioritized by ProCyon for each protein. Percentile refers to the percentile of that protein among all 18,174 human proteins in our database for the specific function query. Green dots indicate overlap with perturbation-derived pathways; grey dots indicate other predictions. ** p < 0.005, * p < 0.05; hypergeometric test. e) ProCyon-generated pathways for poorly characterized proteins associated with Parkinson’s disease (PD). Left, heatmap of enrichment scores for each pathway compared to established PD-related proteins (“PD assoc.”), proteins expressed in the nervous system but not in PD or other neurodegenerative conditions (“Neuro control”), and proteins not expressed in brain tissues (“General control”), respectively. Right, differences in enrichment scores between “PD assoc.” and “Neuro control” and “General control”. Statistically significant differences shown in green. Pathways are grouped by PD association score provided by human experts (right).

**Figure 6:. Experimental validation in multiple sclerosis using post-mortem brain RNA-seq with mechanistic characterization by ProCyon.**
a) MS brain tissue was dissected, stained, and RNA-sequenced. b) Blocks from the superior frontal gyrus (SFG) were sampled. c) Luxol fast blue/cresyl violet distinguished grey and white matter. d) MOG staining identified activated microglia/macrophages. e) HLA-D⁺ areas were masked in red. f) Neurons were labeled with HuC/D. g) Differential expression analysis identified HLA⁺-associated genes (FDR < 0.05, shown in red). h) Validation of ProCyon rankings across five MS hallmarks. Left: retrieval prompts. Right: DEG percentiles for each hallmark. Boxplots show interquartile range and median. i) ProCyon-assigned ranks for HLA⁺ DEGs (red) compared to 2000 random non-DEGs. j) ProCyon retrieval percentiles for ‘unexpected’ proteins (Methods Sec. 4.13) across control diseases versus MS. Control diseases vary in similarity to MS (bottom legend). X markers denote known protein–disease associations. MS is marked by a vertical purple bar. * = significant compared to MS control group; ** = significant across three control groups (Methods Sec. 4.13). k) Top 15 ProCyon-predicted pathways for ZNF727, EVI2B, and MS4A7. Highlighted by the percentage of MS-associated proteins (per GO) in each pathway.

See this image and copyright information in PMC

Publication types

Actions

Grants and funding

R01 HD108794/HD/NICHD NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Cold Spring Harbor Laboratory
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

ProCyon: A multimodal foundation model for protein phenotypes

Affiliations

ProCyon: A multimodal foundation model for protein phenotypes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources