Protein language models trained on multiple sequence alignments learn phylogenetic relationships
- PMID: 36273003
- PMCID: PMC9588007
- DOI: 10.1038/s41467-022-34032-y
Protein language models trained on multiple sequence alignments learn phylogenetic relationships
Abstract
Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.
© 2022. The Author(s).
Conflict of interest statement
The authors declare no competing interests.
Figures





Similar articles
-
The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction.Biomolecules. 2024 Nov 29;14(12):1531. doi: 10.3390/biom14121531. Biomolecules. 2024. PMID: 39766238 Free PMC article. Review.
-
Generative power of a protein language model trained on multiple sequence alignments.Elife. 2023 Feb 3;12:e79854. doi: 10.7554/eLife.79854. Elife. 2023. PMID: 36734516 Free PMC article.
-
Pairing interacting protein sequences using masked language modeling.Proc Natl Acad Sci U S A. 2024 Jul 2;121(27):e2311887121. doi: 10.1073/pnas.2311887121. Epub 2024 Jun 24. Proc Natl Acad Sci U S A. 2024. PMID: 38913900 Free PMC article.
-
Phylogenetic Corrections and Higher-Order Sequence Statistics in Protein Families: The Potts Model vs MSA Transformer.ArXiv [Preprint]. 2025 Mar 1:arXiv:2503.00289v1. ArXiv. 2025. PMID: 40365615 Free PMC article. Preprint.
-
Are protein language models the new universal key?Curr Opin Struct Biol. 2025 Apr;91:102997. doi: 10.1016/j.sbi.2025.102997. Epub 2025 Feb 7. Curr Opin Struct Biol. 2025. PMID: 39921962 Review.
Cited by
-
Understanding the natural language of DNA using encoder-decoder foundation models with byte-level precision.Bioinform Adv. 2024 Aug 12;4(1):vbae117. doi: 10.1093/bioadv/vbae117. eCollection 2024. Bioinform Adv. 2024. PMID: 39176288 Free PMC article.
-
The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction.Biomolecules. 2024 Nov 29;14(12):1531. doi: 10.3390/biom14121531. Biomolecules. 2024. PMID: 39766238 Free PMC article. Review.
-
Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods.Pharmaceutics. 2023 Apr 25;15(5):1337. doi: 10.3390/pharmaceutics15051337. Pharmaceutics. 2023. PMID: 37242577 Free PMC article.
-
Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision.ArXiv [Preprint]. 2024 Aug 22:arXiv:2311.02333v3. ArXiv. 2024. Update in: Bioinform Adv. 2024 Aug 12;4(1):vbae117. doi: 10.1093/bioadv/vbae117. PMID: 38410643 Free PMC article. Updated. Preprint.
-
Computational drug development for membrane protein targets.Nat Biotechnol. 2024 Feb;42(2):229-242. doi: 10.1038/s41587-023-01987-2. Epub 2024 Feb 15. Nat Biotechnol. 2024. PMID: 38361054 Review.
References
-
- Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate (ICLR 2015). arXiv10.48550/arXiv.1409.0473 (2014).
-
- Vaswani A, et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017;30:5998–6008.
-
- Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised learning. bioRxiv10.1101/2020.07.12.199554 (2020).
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources