Bilingual language model for protein sequence and structure

Michael Heinzinger¹, Konstantin Weissenow¹, Joaquin Gomez Sanchez¹, Adrian Henkel¹, Milot Mirdita², Martin Steinegger^{2

3

4}, Burkhard Rost^{1

5}

Affiliations

¹ School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics & Computational Biology, TUM (Technical University of Munich), 85748 Garching/Munich, Germany.
² School of Biological Sciences, Seoul National University, 08826 Seoul, South Korea.
³ Artificial Intelligence Institute, Seoul National University, 08826 Seoul, South Korea.
⁴ Institute of Molecular Biology and Genetics, Seoul National University, 08826 Seoul, South Korea.
⁵ Institute for Advanced Study (TUM-IAS), Lichtenbergstr, 2a, 85748 Garching/Munich, Germany & TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany.

PMID: 39633723
PMCID: PMC11616678
DOI: 10.1093/nargab/lqae150

Bilingual language model for protein sequence and structure

Michael Heinzinger et al. NAR Genom Bioinform. 2024.

. 2024 Nov 15;6(4):lqae150.

doi: 10.1093/nargab/lqae150. eCollection 2024 Dec.

Authors

Michael Heinzinger¹, Konstantin Weissenow¹, Joaquin Gomez Sanchez¹, Adrian Henkel¹, Milot Mirdita², Martin Steinegger^{2

3

4}, Burkhard Rost^{1

5}

Affiliations

¹ School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics & Computational Biology, TUM (Technical University of Munich), 85748 Garching/Munich, Germany.
² School of Biological Sciences, Seoul National University, 08826 Seoul, South Korea.
³ Artificial Intelligence Institute, Seoul National University, 08826 Seoul, South Korea.
⁴ Institute of Molecular Biology and Genetics, Seoul National University, 08826 Seoul, South Korea.
⁵ Institute for Advanced Study (TUM-IAS), Lichtenbergstr, 2a, 85748 Garching/Munich, Germany & TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany.

PMID: 39633723
PMCID: PMC11616678
DOI: 10.1093/nargab/lqae150

Abstract

Adapting language models to protein sequences spawned the development of powerful protein language models (pLMs). Concurrently, AlphaFold2 broke through in protein structure prediction. Now we can systematically and comprehensively explore the dual nature of proteins that act and exist as three-dimensional (3D) machines and evolve as linear strings of one-dimensional (1D) sequences. Here, we leverage pLMs to simultaneously model both modalities in a single model. We encode protein structures as token sequences using the 3Di-alphabet introduced by the 3D-alignment method Foldseek. For training, we built a non-redundant dataset from AlphaFoldDB and fine-tuned an existing pLM (ProtT5) to translate between 3Di and amino acid sequences. As a proof-of-concept for our novel approach, dubbed Protein 'structure-sequence' T5 (ProstT5), we showed improved performance for subsequent, structure-related prediction tasks, leading to three orders of magnitude speedup for deriving 3Di. This will be crucial for future applications trying to search metagenomic sequence databases at the sensitivity of structure comparisons. Our work showcased the potential of pLMs to tap into the information-rich protein structure revolution fueled by AlphaFold2. ProstT5 paves the way to develop new tools integrating the vast resource of 3D predictions and opens new research avenues in the post-AlphaFold2 era.

PubMed Disclaimer

Figures

**Figure 1.**
Sketch of ProstT5. *Model architecture:* ProstT5 is a T5-based encoder-decoder model initialized with weights of the ProtT5 model (6). *Pre-training:* Foldseek (1) transferred protein 3D coordinates into 3Di tokens, i.e. 1D descriptions of 3D structure that assign each residue in a protein into one of twenty states described by a 1D string of letters. We used 17 million (17M) high-quality, non-redundant and diverse 3D predictions from AFDB (33). ProtT5 was leveraged as an already pre-trained starting point for translating between 1D sequence (amino acids, AA) and 3D structure (3Di). Firstly, we applied the original pre-training objective of ProtT5 (span-based denoising) to both, AAs and 3Di, to teach the model the new 3Di tokens while avoiding catastrophic forgetting of AAs. Secondly, we continued to train the resulting model to translate between AAs and 3Di and *vice versa*. The final model, ProstT5 (Protein structure-sequence T5) extracts the information in its internal embeddings that can be input into downstream applications. This includes established feature extraction using only the encoder (6), or bi-directional translation, either from AAs to 3Di (‘folding’) or from 3Di to AAs (‘inverse folding’). *Inference*: bi-directional translation from AA to 3Di (AA→3Di) or 3Di→AA can be conducted using either the encoder-decoder mode, necessitating token-wise decoder-inference or through an optimized inference mode, where 3Di tokens are directly predicted through a convolutional neural network from the encoder-embedding. The optimized 3Di inference mode results in a three orders of magnitude speedup over 3Di extraction from predicted protein structures (Figure 2).

**Figure 2.**
Successful remote homology detection with predicted 3Di. We replicated the Foldseek benchmark (1) on SCOPe40 (57) using 3Di strings generated either by ProstT5 (Foldseek(p3Di)) or a CNN trained on top of ProstT5’s encoder (Foldseek(p3Di-CNN)) and compared the sensitivity up to the first false positive (protein with different classification) with the performance of Foldseek on experimental structures (Foldseek (3Di)). For all three levels (from fine-grained family level on the left, over the superfamily level, to the coarse-grained level of *fold*), ProstT5-predicted 3Di strings sufficed to almost reach the performance of PDB structures while significantly outperforming traditional sequence alignment (*MMseqs2* (35)).

**Figure 3.**
Protein prediction tasks exclusively using pLM embeddings. We probed the relevance of the information learned by ProstT5 by inputting its embeddings into subsequent supervised prediction methods, as introduced before (5). In particular, we compared ProstT5 to SOTA general purpose pLMs using only amino acid sequences (ProtT5 (6), Ankh (12) and ESM-2 (3B) (8)) on four different prediction tasks, namely the per-residue prediction of secondary structure (A: performance: Q3, three-state per-residue accuracy; data sets: middle: *CASP12* (48), lower bar: *CASP14* (52), upper bar: *NEW364* (6); note: since each set is supposed to measure performance, the difference between these provided an error estimate), binding residues (B: performance: F1; data: *testSet300* (17)), conservation (C: performance: Q9, nine-state per-residue accuracy; data: (53)), and the per-protein prediction of subcellular location (D: performance: Q10, 10-state per-protein accuracy; data: *setHARD* (18)). As a baseline, we also probed the information content readily available from one-hot-encoded 3Di-sequences (OHE (3Di)). For panels B–D, the bars mark the 95% confidence interval, i.e. ±1.96 × standard errors, estimated via bootstrapping.

**Figure 4.**
Inverse folding examples. We manually picked four examples from our test set for which both (A and B), ProstT5 and ProteinMPNN, only ProstT5 (C) or only ProteinMPNN (D) generated sequence with high structure similarity to their natural counterparts. Structures colored in green show the AlphaFold2 predictions (considered ground-truth), blue and orange depict ESMFold predictions of ProstT5- (blue) and ProteinMPNN-generated (orange) sequences, respectively. We picked examples such that they show diversity in their structural composition (beta-strands and alpha-helices) and their place of action (transmembrane (B) versus soluble (A, C, D)). Both methods can produce proteins that share only little sequence but high structural similarity to their native counterparts (A and B: lDDT of 76–95 or RMSD of 1.1–2.6 at 23–44% sequence similarity) but there are also cases where only one of them succeeds (C: ProstT5(lDDT)=68 versus ProteinMPNN(lDDT)=34; D: ProstT5(lDDT)=56 versus ProteinMPNN(lDDT)=75). For better visibility, we increased transparency for cases with poor structural superposition (C: ProteinMPNN, D: ProstT5).

See this image and copyright information in PMC

References

1. van Kempen M., Kim S.S., Tumescheit C., Mirdita M., Lee J., Gilchrist C.L.M., Söding J., Steinegger M. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 2024; 42:243–246. - PMC - PubMed
1. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I Attention is all you need. Advances in Neural Information Processing Systems. 2017; 5998–6008.
1. Brown T.B., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A. et al. . Language models are few-shot learners. 2020; arXiv doi:28 May 2020, preprint: not peer reviewedhttps://arxiv.org/abs/2005.14165.
1. Ouyang L., Wu J., Jiang X., Almeida D., Wainwright C.L., Mishkin P., Zhang C., Agarwal S., Slama K., Ray A. et al. . Training language models to follow instructions with human feedback. 2022; arXiv doi:04 March 2022, preprint: not peer reviewedhttps://arxiv.org/abs/2203.02155.
1. Heinzinger M., Elnaggar A., Wang Y., Dallago C., Nechaev D., Matthes F., Rost B. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinf. 2019; 20:723. - PMC - PubMed

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Bilingual language model for protein sequence and structure

Affiliations

Bilingual language model for protein sequence and structure

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous