Bilingual language model for protein sequence and structure
- PMID: 39633723
- PMCID: PMC11616678
- DOI: 10.1093/nargab/lqae150
Bilingual language model for protein sequence and structure
Abstract
Adapting language models to protein sequences spawned the development of powerful protein language models (pLMs). Concurrently, AlphaFold2 broke through in protein structure prediction. Now we can systematically and comprehensively explore the dual nature of proteins that act and exist as three-dimensional (3D) machines and evolve as linear strings of one-dimensional (1D) sequences. Here, we leverage pLMs to simultaneously model both modalities in a single model. We encode protein structures as token sequences using the 3Di-alphabet introduced by the 3D-alignment method Foldseek. For training, we built a non-redundant dataset from AlphaFoldDB and fine-tuned an existing pLM (ProtT5) to translate between 3Di and amino acid sequences. As a proof-of-concept for our novel approach, dubbed Protein 'structure-sequence' T5 (ProstT5), we showed improved performance for subsequent, structure-related prediction tasks, leading to three orders of magnitude speedup for deriving 3Di. This will be crucial for future applications trying to search metagenomic sequence databases at the sensitivity of structure comparisons. Our work showcased the potential of pLMs to tap into the information-rich protein structure revolution fueled by AlphaFold2. ProstT5 paves the way to develop new tools integrating the vast resource of 3D predictions and opens new research avenues in the post-AlphaFold2 era.
© The Author(s) 2024. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
Figures




References
-
- Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I Attention is all you need. Advances in Neural Information Processing Systems. 2017; 5998–6008.
-
- Brown T.B., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A. et al. . Language models are few-shot learners. 2020; arXiv doi:28 May 2020, preprint: not peer reviewedhttps://arxiv.org/abs/2005.14165.
-
- Ouyang L., Wu J., Jiang X., Almeida D., Wainwright C.L., Mishkin P., Zhang C., Agarwal S., Slama K., Ray A. et al. . Training language models to follow instructions with human feedback. 2022; arXiv doi:04 March 2022, preprint: not peer reviewedhttps://arxiv.org/abs/2203.02155.
LinkOut - more resources
Full Text Sources
Research Materials
Miscellaneous