. 2024 Jul 2;121(27):e2311887121.

doi: 10.1073/pnas.2311887121. Epub 2024 Jun 24.

Pairing interacting protein sequences using masked language modeling

Umberto Lupo^#^{1

2}, Damiano Sgarbossa^#^{1

2}, Anne-Florence Bitbol^{1

2}

Affiliations

¹ Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland.
² SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland.

^# Contributed equally.

PMID: 38913900
PMCID: PMC11228504
DOI: 10.1073/pnas.2311887121

Pairing interacting protein sequences using masked language modeling

Umberto Lupo et al. Proc Natl Acad Sci U S A. 2024.

. 2024 Jul 2;121(27):e2311887121.

doi: 10.1073/pnas.2311887121. Epub 2024 Jun 24.

Authors

Umberto Lupo^#^{1

2}, Damiano Sgarbossa^#^{1

2}, Anne-Florence Bitbol^{1

2}

Affiliations

¹ Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland.
² SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland.

^# Contributed equally.

PMID: 38913900
PMCID: PMC11228504
DOI: 10.1073/pnas.2311887121

Abstract

Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.

Keywords: coevolution; machine learning; protein complex structure; protein language models; protein–protein interactions.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:The authors declare no competing interest.

Figures

**Fig. 1.**
Performance of DiffPALM on small HK-RR MSAs. The performance of two variants of DiffPALM (MRA and IPA; see *Materials and Methods, Improving precision: MRA and IPA*) is shown versus the number of runs used for the MRA variant, for $40$ MSAs comprising about 50 HK-RR pairs. The chance expectation, and the performance of various other methods, are reported as baselines. Three existing coevolution-based methods are considered: DCA-IPA (14), MI-IPA (47), and GA-IPA (50). We also consider a pairing method based on the scores given by the ESM-2 (650M) single-sequence protein language model (5), see *Materials and Methods, Pairing Based on a Single-Sequence Language Model*. With all methods, a full one-to-one within-species pairing is produced, and performance is measured by precision (also called positive predictive value or PPV), namely, the fraction of correct pairs among predicted pairs. The default score is “precision-100,” where this fraction is computed over all predicted pairs (100% of them). For DiffPALM-MRA, we also report “precision-10,” which is calculated over the top $10 %$ predicted pairs, when ranked by predicted confidence within each MSA (*Materials and Methods*). For DiffPALM, we plot the mean performance on all MSAs (color shading), and the SE range (shaded region). For our ESM-2-based method, we consider 10 different values of masking probability $p$ from $0.1$ to $1.0$ , and we report the range of precisions obtained (gray shading). For other baselines, we report the mean performance on all MSAs.

**Fig. 2.**
Impact of positive examples, MSA depth, and extension to another pair of protein families. We report the performance of DiffPALM with five MRA runs (measured as precision-100 and precision-10, see Fig. 1), for various numbers of positive examples, on the same HK-RR MSAs as in Fig. 1 (*Left* panel). We also report the performance of DiffPALM (using no positive examples) versus MSA depth for both HK-RR and MALG-MALK pairs (*Middle* and *Right* panel). In all cases, we show the mean value over different MSAs and its SE, and we plot the chance expectation for reference. Note that MSA depth can vary by $\pm 10 %$ around the reported value because complete species are used (*SI Appendix*, Datasets).

**Fig. 3.**
Performance of AFM using different pairing methods. We use AFM to predict the structure of protein complexes starting from differently paired MSAs, each of them constructed from the same initial unpaired MSAs. Three pairing methods are considered: the default one of AFM, only pairing orthologs to the two query sequences, and a single run of DiffPALM (equivalent to one MRA run). We used a single run for computational time reasons. Performance is evaluated using DockQ scores (*Top* panels), a widely used measure of quality for protein–protein docking (62), and the AFM confidence scores (*Bottom* panels), see *SI Appendix*, *General Points on AFM*. The latter are also used as transparency levels in the *Top* panels, where more transparent markers denote predicted structures with low AFM confidence. For each query complex, AFM is run five times. Each run yields 25 predictions which are ranked by AFM confidence score. The top five predicted structures are selected from each run, giving 25 predicted structures in total for each complex. Out of the 15 complexes listed in *SI Appendix*, Table S1, we restrict to those where any two of these three pairing methods yield a significant difference ( $> 0.1$ ) in average DockQ scores for at least one set of predictions coming from different runs but with the same within-run rank according to AFM confidence. Panels are ordered by increasing mean DockQ score for the AFM default method.

**Fig. 4.**
Schematic of the DiffPALM method. First, the parameterization matrices $X_{k}$ are initialized, and then the following steps are repeated until the loss converges: 1) Compute the permutation matrix $M (X_{k})$ and use it to shuffle $M^{(A)}$ relative to $M^{(B)}$ . Then pair the two MSAs. 2) Randomly mask some tokens from one of the two sides of the paired MSA and compute the MLM loss, see *SI Appendix*, Eq. S1. 3) Backpropagate the loss and update the parameterization matrices $X_{k}$ , using the Sinkhorn operator $\hat{S}$ for the backward step instead of the matching operator $M$ (*SI Appendix*, *A differentiable formulation of paralog matching*).

See this image and copyright information in PMC

Cited by

The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction.
Zhang C, Wang Q, Li Y, Teng A, Hu G, Wuyun Q, Zheng W. Zhang C, et al. Biomolecules. 2024 Nov 29;14(12):1531. doi: 10.3390/biom14121531. Biomolecules. 2024. PMID: 39766238 Free PMC article. Review.
Machine learning meets physics: A two-way street.
Levine H, Tu Y. Levine H, et al. Proc Natl Acad Sci U S A. 2024 Jul 2;121(27):e2403580121. doi: 10.1073/pnas.2403580121. Epub 2024 Jun 24. Proc Natl Acad Sci U S A. 2024. PMID: 38913898 Free PMC article. No abstract available.
DiffPaSS-high-performance differentiable pairing of protein sequences using soft scores.
Lupo U, Sgarbossa D, Milighetti M, Bitbol AF. Lupo U, et al. Bioinformatics. 2024 Dec 26;41(1):btae738. doi: 10.1093/bioinformatics/btae738. Bioinformatics. 2024. PMID: 39672677 Free PMC article.
Multimeric protein interaction and complex prediction: Structure, dynamics and function.
Lu D, Yu S, Huang Y, Gong X. Lu D, et al. Comput Struct Biotechnol J. 2025 May 16;27:1975-1997. doi: 10.1016/j.csbj.2025.05.009. eCollection 2025. Comput Struct Biotechnol J. 2025. PMID: 40496891 Free PMC article. Review.
Genomic language model predicts protein co-regulation and function.
Hwang Y, Cornman AL, Kellogg EH, Ovchinnikov S, Girguis PR. Hwang Y, et al. Nat Commun. 2024 Apr 3;15(1):2880. doi: 10.1038/s41467-024-46947-9. Nat Commun. 2024. PMID: 38570504 Free PMC article.

See all "Cited by" articles

References

1. Rajagopala S. V., et al. , The binary protein-protein interaction landscape of Escherichia coli. Nat. Biotechnol. 32, 285–290 (2014). - PMC - PubMed
1. Jumper J., et al. , Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). - PMC - PubMed
1. Baek M., et al. , Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021). - PMC - PubMed
1. Chowdhury R., et al. , Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 40, 1617–1623 (2022). - PMC - PubMed
1. Lin Z., et al. , Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

851173/EC | H2020 | PRIORITY 'Excellent science' | H2020 European Research Council (ERC)

LinkOut - more resources

Full Text Sources
- Atypon
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Pairing interacting protein sequences using masked language modeling

Affiliations

Pairing interacting protein sequences using masked language modeling

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources