This is a preprint.
ProCyon: A multimodal foundation model for protein phenotypes
- PMID: 41279541
- PMCID: PMC12632626
- DOI: 10.1101/2024.12.10.627665
ProCyon: A multimodal foundation model for protein phenotypes
Abstract
Characterizing human proteins remains a major challenge: approximately 29% of human proteins lack experimentally validated functions and even well-annotated proteins often lack context-specific phenotypic insights. To enable universal modeling of protein phenotypes, we present ProCyon, a multimodal foundation model that utilizes protein sequence, structure, and natural language for generating and predicting protein phenotypes across diverse knowledge domains. ProCyon is trained on our novel dataset, ProCyon-Instruct, with 33 million protein phenotype instructions. On dozens of benchmarking tasks, ProCyon performs competitively against single-modal and multimodal models. Further, ProCyon conditionally retrieves proteins via mechanisms of action of small molecule drugs and disease contexts, and it generates candidate phenotypic descriptions for poorly characterized proteins, including those implicated in Parkinson's disease that were identified after ProCyon's knowledge cutoff date. We experimentally confirm ProCyon's predictions in multiple sclerosis using post-mortem brain RNA-seq, identifying novel MS genes and elucidating associated pathway mechanisms consistent with cortical pathology. ProCyon paves the way toward a general approach to generate functional insights into the human proteome.
Conflict of interest statement
Competing interests. F.J.T. consults for Immunai Inc., CytoReason Ltd, Cellarity, BioTuring Inc., and Genbio.AI Inc., and has an ownership interest in Dermagnostix GmbH and Cellarity. Other authors declare no competing interests.
Figures
Publication types
Grants and funding
LinkOut - more resources
Full Text Sources