Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Nov 3:2024.12.10.627665.
doi: 10.1101/2024.12.10.627665.

ProCyon: A multimodal foundation model for protein phenotypes

Affiliations

ProCyon: A multimodal foundation model for protein phenotypes

Owen Queen et al. bioRxiv. .

Abstract

Characterizing human proteins remains a major challenge: approximately 29% of human proteins lack experimentally validated functions and even well-annotated proteins often lack context-specific phenotypic insights. To enable universal modeling of protein phenotypes, we present ProCyon, a multimodal foundation model that utilizes protein sequence, structure, and natural language for generating and predicting protein phenotypes across diverse knowledge domains. ProCyon is trained on our novel dataset, ProCyon-Instruct, with 33 million protein phenotype instructions. On dozens of benchmarking tasks, ProCyon performs competitively against single-modal and multimodal models. Further, ProCyon conditionally retrieves proteins via mechanisms of action of small molecule drugs and disease contexts, and it generates candidate phenotypic descriptions for poorly characterized proteins, including those implicated in Parkinson's disease that were identified after ProCyon's knowledge cutoff date. We experimentally confirm ProCyon's predictions in multiple sclerosis using post-mortem brain RNA-seq, identifying novel MS genes and elucidating associated pathway mechanisms consistent with cortical pathology. ProCyon paves the way toward a general approach to generate functional insights into the human proteome.

PubMed Disclaimer

Conflict of interest statement

Competing interests. F.J.T. consults for Immunai Inc., CytoReason Ltd, Cellarity, BioTuring Inc., and Genbio.AI Inc., and has an ownership interest in Dermagnostix GmbH and Cellarity. Other authors declare no competing interests.

Figures

Figure 1:
Figure 1:. Overview of ProCyon model architecture and ProCyon-Instruct dataset.
a) ProCyon models proteins and phenotypes in a unified latent space, enabling wide applications across knowledge domains (shown are example prompts). b) ProCyon is a multimodal foundation model consisting of a large language model and multimodal molecular encoders, trained to perform both protein retrieval and phenotype generation. For protein retrieval, a prompt is processed into the LLM and compared against a library of proteins. ProCyon then outputs a ranked list of proteins, domains, peptides, or polypeptides. For phenotype generation, the input text and multimodal elements are processed by the LLM via multimodal token decomposition. ProCyon then performs autoregressive text generation to generate protein phenotypes and other free-form answers. c) A comprehensive ProCyon-Instruct dataset with 33,899,528 protein-phenotype instructions is curated from five knowledge domains (Function, Therapeutics, Disease, Protein Domains, and Interaction) and used to train ProCyon. Details of dataset curation, phenotype description rephrasing, and instruction templating can be found in Extended Data Fig. 4, Methods Sec. 1 and 3.1, and Supplementary Note 2. The number displayed shows the number of protein-phenotype description pairs curated from each database.
Figure 2:
Figure 2:. ProCyon accurately retrieves proteins from flexible phenotypes.
a) In protein retrieval, ProCyon receives a phenotype as input and outputs a ranked list of proteins, domains, peptides, or polypeptides. b) Benchmarking ProCyon’s retrieval capabilities against other approaches across distinct knowledge domains. x axes show Fmax, the maximum F1 score across any cutoff. Error bars show bootstrapped 95% confidence intervals. One retrieval task corresponds to retrieving proteins for one phenotype. c) ProCyon’s retrieval performance across splits of ProCyon-Instruct. d) Illustration of pleiotropic protein functions. The FOXC1 protein is implicated in heart development, regulation of HPSC differentiation, and regulation of the mitotic cell cycle. e) Percentiles of potentially pleiotropic proteins ranked by ProCyon using composite prompts composed of two, three, and four pathways, in comparison with aggregated individual ranks. Also compared with ProtST. All comparisons have p-value < 0.001; two-sided Wilcoxon signed-rank test. f) External validation of disease-associated protein retrieval across different lines of evidence for association. y axis shows median retrieval percentile scores for each disease across different lines of evidence. g) Protein retrieval with different disease descriptions. Left, example texts for prompts with gradually increasing information (Disease, Diagnostics, Associated Features supporting Diagnosis and their combinations) for Autistic Spectrum Disorder. Right, median retrieval percentile scores derived for each disease with increasingly informative prompts. h) Left, retrieval percentile of STING as more precise prompts are provided as input. Right, UMAP of unified latent space of phenotypes and proteins, highlighting prompt embeddings and STING protein embedding. Dashed contours indicate distance from STING protein embedding; thicker lines denote closer distance. * p-value < 0.05, ** p-value < 0.005; two-sided Mann-Whitney U test.
Figure 3:
Figure 3:. ProCyon generates accurate responses and phenotype descriptions from multimodal prompts.
a) In QA, ProCyon receives a multimodal prompt (protein + phenotype description) and outputs a yes/no answer. b) Benchmarking QA accuracy across models and knowledge domains. Error bars show bootstrapped 95% CIs. c) ProCyon’s QA accuracy across different ProCyon-Instruct splits. d) ProCyon tokenizes protein sequences and 3D structures using a multimodal encoder; LLMs must use text surrogates, many of which are not generalizable (gene symbol, UniProt ID). e) QA performance comparison between ProCyon and LLMs using HGNC gene symbols. f) In phenotype generation, ProCyon processes multimodal inputs (proteins, natural language, small molecules) to output open-ended phenotype descriptions. g) Phenotype generation benchmark using reference text similarity (mean BERTScore F1). Hatching indicates ability to generalize to unseen proteins. h) Phenotype generation benchmark using LLM-based judging. Win rate refers to preference over ProCyon; models grouped by parameter count.
Figure 4:
Figure 4:. ProCyon models domains, peptides, and small molecules beyond proteins.
a) ProCyon predicts the small molecule binding domain on proteins, where the input is the description and molecular structure of a drug, and the output is a ranked list of domains on a target protein. In this example, ProCyon retrieves domains of MGAM given the drug miglitol and correctly prioritizes the binding domain for miglitol as the first among the MGAM’s nine domains. Each domain is identified by the Pfam ID and the index of the first amino acid in the domain. b) ProCyon ranks the correct drug-binding domain (highlighted in orange) on respective target proteins highly. In Q06187, the green domain (PF00017_281) denotes the ProCyon-predicted binding domain, but the orange domain (PF07714_402) is the correct binding domain. ProCyon predicts the binding domain as top-ranked in 24 examples, second-ranked in 5 examples, and so on. c) ProCyon can be finetuned to predict protein-peptide binding. Left, ProCyon is first finetuned on naturally occurring protein-peptide complexes from the Protein Data Bank (PDB), then tested on an experimental dataset of synthetic peptides screened for binding to the ACE2 protein. Right, the predicted binding scores for binders versus non-binders. Two-sided Mann-Whitney U test. d) Leveraging ProCyon to perform indication-specific drug target retrieval. In this example, the input consists of a novel task definition, the description of indications (nicotine addiction or major depressive disorder), and the name and structure of bupropion. ProCyon ranks known targets (norepinephrine transporter (NET), dopamine transporter (DAT), and nicotinic acetylcholinergic receptor (AChR)) of bupropion among the top 40/18,174 human proteins. Shown are 95% confidence intervals derived by perturbing words in the input prompt. ** p-value < 0.001; two-sided Wilcoxon signed rank test.
Figure 5:
Figure 5:. ProCyon characterizes poorly-annotated proteins and their functions.
a) ProCyon generates phenotype predictions for AKNAD1. QA filtering selects the highest-confidence outputs; two of the three predicted phenotypes are experimentally supported. b) Additional ProCyon-generated phenotypes for other poorly characterized proteins. All proteins either have no UniProt-annotated function or have no UniProt function as of the ProCyon knowledge cutoff date. c) External evaluation of ProCyon using perturbation datasets. Parallel pathway analysis on genetic perturbation data is used to validate ProCyon-predicted functions for uncharacterized proteins. d) Top 100 functions prioritized by ProCyon for each protein. Percentile refers to the percentile of that protein among all 18,174 human proteins in our database for the specific function query. Green dots indicate overlap with perturbation-derived pathways; grey dots indicate other predictions. ** p < 0.005, * p < 0.05; hypergeometric test. e) ProCyon-generated pathways for poorly characterized proteins associated with Parkinson’s disease (PD). Left, heatmap of enrichment scores for each pathway compared to established PD-related proteins (“PD assoc.”), proteins expressed in the nervous system but not in PD or other neurodegenerative conditions (“Neuro control”), and proteins not expressed in brain tissues (“General control”), respectively. Right, differences in enrichment scores between “PD assoc.” and “Neuro control” and “General control”. Statistically significant differences shown in green. Pathways are grouped by PD association score provided by human experts (right).
Figure 6:
Figure 6:. Experimental validation in multiple sclerosis using post-mortem brain RNA-seq with mechanistic characterization by ProCyon.
a) MS brain tissue was dissected, stained, and RNA-sequenced. b) Blocks from the superior frontal gyrus (SFG) were sampled. c) Luxol fast blue/cresyl violet distinguished grey and white matter. d) MOG staining identified activated microglia/macrophages. e) HLA-D+ areas were masked in red. f) Neurons were labeled with HuC/D. g) Differential expression analysis identified HLA+-associated genes (FDR < 0.05, shown in red). h) Validation of ProCyon rankings across five MS hallmarks. Left: retrieval prompts. Right: DEG percentiles for each hallmark. Boxplots show interquartile range and median. i) ProCyon-assigned ranks for HLA+ DEGs (red) compared to 2000 random non-DEGs. j) ProCyon retrieval percentiles for ‘unexpected’ proteins (Methods Sec. 4.13) across control diseases versus MS. Control diseases vary in similarity to MS (bottom legend). X markers denote known protein–disease associations. MS is marked by a vertical purple bar. * = significant compared to MS control group; ** = significant across three control groups (Methods Sec. 4.13). k) Top 15 ProCyon-predicted pathways for ZNF727, EVI2B, and MS4A7. Highlighted by the percentage of MS-associated proteins (per GO) in each pathway.

Publication types

LinkOut - more resources