. 2022 Oct 22;13(1):6298.

doi: 10.1038/s41467-022-34032-y.

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Umberto Lupo^{1

2}, Damiano Sgarbossa^{3

4}, Anne-Florence Bitbol^{5

6}

Affiliations

¹ Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland. umberto.lupo@epfl.ch.
² SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland. umberto.lupo@epfl.ch.
³ Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.
⁴ SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland.
⁵ Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland. anne-florence.bitbol@epfl.ch.
⁶ SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland. anne-florence.bitbol@epfl.ch.

PMID: 36273003
PMCID: PMC9588007
DOI: 10.1038/s41467-022-34032-y

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Umberto Lupo et al. Nat Commun. 2022.

. 2022 Oct 22;13(1):6298.

doi: 10.1038/s41467-022-34032-y.

Authors

Umberto Lupo^{1

2}, Damiano Sgarbossa^{3

4}, Anne-Florence Bitbol^{5

6}

Affiliations

¹ Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland. umberto.lupo@epfl.ch.
² SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland. umberto.lupo@epfl.ch.
³ Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.
⁴ SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland.
⁵ Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland. anne-florence.bitbol@epfl.ch.
⁶ SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland. anne-florence.bitbol@epfl.ch.

PMID: 36273003
PMCID: PMC9588007
DOI: 10.1038/s41467-022-34032-y

Abstract

Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. MSA Transformer: column attentions and Hamming distances.**
a MSA Transformer is trained using the masked language modeling objective of filling in randomly masked residue positions in MSAs. For each residue position in an input MSA, it assigns attention scores to all residue positions in the same row (sequence) and column (site) in the MSA. These computations are performed by 12 independent row/column attention heads in each of 12 successive layers of the network. b Our approach for Hamming distance matrix prediction from the column attentions computed by the trained MSA Transformer model, using a natural MSA as input. For each i = 1, …, M, j = 0, …, L and l = 1, …, 12, the embedding vector $x_{i j}^{(l)}$ is the i-th row of the matrix $X_{j}^{(l)}$ defined in “Methods – MSA Transformer and column attention”, and the column attentions are computed according to Eqs. (2) and (3).

**Fig. 2. Fitting logistic models to predict Hamming distances separately in each MSA.**
The column-wise means of MSA Transformer’s column attention heads are used to predict normalised Hamming distances as probabilities in a logistic model. Each MSA is randomly split into a training set comprising 70% of its sequences and a test set composed of the remaining sequences. For each MSA, a logistic model is trained on all pairwise distances in the training set. Regression coefficients are shown for each layer and attention head (first column), as well as their absolute values averaged over heads for each layer (second column). For four example MSAs, ground truth Hamming distances are shown in the upper triangle (blue) and predicted Hamming distances in the lower triangle and diagonal (green), for the training and test sets (third and fourth columns). Darker shades correspond to larger Hamming distances.

**Fig. 3. Pearson correlations between regression coefficients in larger MSAs.**
Sufficiently deep (≥ 100 sequences) and long (≥ 30 residues) MSAs are considered (mean/min/max Pearson correlations: 0.80/0.69/0.87).

**Fig. 4. Fitting a single logistic model to predict Hamming distances.**
Our collection of 15 MSAs is split into a training set comprising 12 of them and a test set composed of the remaining 3. A logistic regression is trained on all pairwise distances within each MSA in the training set. Regression coefficients (first panel) and their absolute values averaged over heads for each layer (second panel) are shown as in Fig. 2. For the three test MSAs, ground truth Hamming distances are shown in the upper triangle (blue) and predicted Hamming distances in the lower triangle and diagonal (green), also as in Fig. 2 (last three panels). We further report the R² coefficients of determination for the regressions on these test MSAs—see also Supplementary Table 2.

**Fig. 5. Correlations from coevolution and from phylogeny in MSAs.**
a Natural selection on structure and function leads to correlations between residue positions in MSAs (coevolution). b Potts models, also known as DCA, aim to capture these correlations in their pairwise couplings. c Historical contingency can lead to correlations even in the absence of structural or functional constraints.

See this image and copyright information in PMC

Cited by

Understanding the natural language of DNA using encoder-decoder foundation models with byte-level precision.
Malusare A, Kothandaraman H, Tamboli D, Lanman NA, Aggarwal V. Malusare A, et al. Bioinform Adv. 2024 Aug 12;4(1):vbae117. doi: 10.1093/bioadv/vbae117. eCollection 2024. Bioinform Adv. 2024. PMID: 39176288 Free PMC article.
The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction.
Zhang C, Wang Q, Li Y, Teng A, Hu G, Wuyun Q, Zheng W. Zhang C, et al. Biomolecules. 2024 Nov 29;14(12):1531. doi: 10.3390/biom14121531. Biomolecules. 2024. PMID: 39766238 Free PMC article. Review.
Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods.
Mardikoraem M, Woldring D. Mardikoraem M, et al. Pharmaceutics. 2023 Apr 25;15(5):1337. doi: 10.3390/pharmaceutics15051337. Pharmaceutics. 2023. PMID: 37242577 Free PMC article.
Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision.
Malusare A, Kothandaraman H, Tamboli D, Lanman NA, Aggarwal V. Malusare A, et al. ArXiv [Preprint]. 2024 Aug 22:arXiv:2311.02333v3. ArXiv. 2024. Update in: Bioinform Adv. 2024 Aug 12;4(1):vbae117. doi: 10.1093/bioadv/vbae117. PMID: 38410643 Free PMC article. Updated. Preprint.
Computational drug development for membrane protein targets.
Li H, Sun X, Cui W, Xu M, Dong J, Ekundayo BE, Ni D, Rao Z, Guo L, Stahlberg H, Yuan S, Vogel H. Li H, et al. Nat Biotechnol. 2024 Feb;42(2):229-242. doi: 10.1038/s41587-023-01987-2. Epub 2024 Feb 15. Nat Biotechnol. 2024. PMID: 38361054 Review.

See all "Cited by" articles

References

1. de Juan D, Pazos F, Valencia A. Emerging methods in protein co-evolution. Nat. Rev. Genet. 2013;14:249–261. doi: 10.1038/nrg3414. - DOI - PubMed
1. Cocco S, Feinauer C, Figliuzzi M, Monasson R, Weigt M. Inverse statistical physics of protein sequences: a key issues review. Rep. Prog. Phys. 2018;81:032601. doi: 10.1088/1361-6633/aa9965. - DOI - PubMed
1. Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate (ICLR 2015). arXiv10.48550/arXiv.1409.0473 (2014).
1. Vaswani A, et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017;30:5998–6008.
1. Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised learning. bioRxiv10.1101/2020.07.12.199554 (2020).

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Affiliations

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources