Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 2;121(27):e2311887121.
doi: 10.1073/pnas.2311887121. Epub 2024 Jun 24.

Pairing interacting protein sequences using masked language modeling

Affiliations

Pairing interacting protein sequences using masked language modeling

Umberto Lupo et al. Proc Natl Acad Sci U S A. .

Abstract

Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.

Keywords: coevolution; machine learning; protein complex structure; protein language models; protein–protein interactions.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
Performance of DiffPALM on small HK-RR MSAs. The performance of two variants of DiffPALM (MRA and IPA; see Materials and Methods, Improving precision: MRA and IPA) is shown versus the number of runs used for the MRA variant, for 40 MSAs comprising about 50 HK-RR pairs. The chance expectation, and the performance of various other methods, are reported as baselines. Three existing coevolution-based methods are considered: DCA-IPA (14), MI-IPA (47), and GA-IPA (50). We also consider a pairing method based on the scores given by the ESM-2 (650M) single-sequence protein language model (5), see Materials and Methods, Pairing Based on a Single-Sequence Language Model. With all methods, a full one-to-one within-species pairing is produced, and performance is measured by precision (also called positive predictive value or PPV), namely, the fraction of correct pairs among predicted pairs. The default score is “precision-100,” where this fraction is computed over all predicted pairs (100% of them). For DiffPALM-MRA, we also report “precision-10,” which is calculated over the top 10% predicted pairs, when ranked by predicted confidence within each MSA (Materials and Methods). For DiffPALM, we plot the mean performance on all MSAs (color shading), and the SE range (shaded region). For our ESM-2-based method, we consider 10 different values of masking probability p from 0.1 to 1.0, and we report the range of precisions obtained (gray shading). For other baselines, we report the mean performance on all MSAs.
Fig. 2.
Fig. 2.
Impact of positive examples, MSA depth, and extension to another pair of protein families. We report the performance of DiffPALM with five MRA runs (measured as precision-100 and precision-10, see Fig. 1), for various numbers of positive examples, on the same HK-RR MSAs as in Fig. 1 (Left panel). We also report the performance of DiffPALM (using no positive examples) versus MSA depth for both HK-RR and MALG-MALK pairs (Middle and Right panel). In all cases, we show the mean value over different MSAs and its SE, and we plot the chance expectation for reference. Note that MSA depth can vary by ±10% around the reported value because complete species are used (SI Appendix, Datasets).
Fig. 3.
Fig. 3.
Performance of AFM using different pairing methods. We use AFM to predict the structure of protein complexes starting from differently paired MSAs, each of them constructed from the same initial unpaired MSAs. Three pairing methods are considered: the default one of AFM, only pairing orthologs to the two query sequences, and a single run of DiffPALM (equivalent to one MRA run). We used a single run for computational time reasons. Performance is evaluated using DockQ scores (Top panels), a widely used measure of quality for protein–protein docking (62), and the AFM confidence scores (Bottom panels), see SI Appendix, General Points on AFM. The latter are also used as transparency levels in the Top panels, where more transparent markers denote predicted structures with low AFM confidence. For each query complex, AFM is run five times. Each run yields 25 predictions which are ranked by AFM confidence score. The top five predicted structures are selected from each run, giving 25 predicted structures in total for each complex. Out of the 15 complexes listed in SI Appendix, Table S1, we restrict to those where any two of these three pairing methods yield a significant difference (>0.1) in average DockQ scores for at least one set of predictions coming from different runs but with the same within-run rank according to AFM confidence. Panels are ordered by increasing mean DockQ score for the AFM default method.
Fig. 4.
Fig. 4.
Schematic of the DiffPALM method. First, the parameterization matrices Xk are initialized, and then the following steps are repeated until the loss converges: 1) Compute the permutation matrix M(Xk) and use it to shuffle M(A) relative to M(B). Then pair the two MSAs. 2) Randomly mask some tokens from one of the two sides of the paired MSA and compute the MLM loss, see SI Appendix, Eq. S1. 3) Backpropagate the loss and update the parameterization matrices Xk, using the Sinkhorn operator S^ for the backward step instead of the matching operator M (SI Appendix, A differentiable formulation of paralog matching).

Similar articles

Cited by

References

    1. Rajagopala S. V., et al. , The binary protein-protein interaction landscape of Escherichia coli. Nat. Biotechnol. 32, 285–290 (2014). - PMC - PubMed
    1. Jumper J., et al. , Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). - PMC - PubMed
    1. Baek M., et al. , Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021). - PMC - PubMed
    1. Chowdhury R., et al. , Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 40, 1617–1623 (2022). - PMC - PubMed
    1. Lin Z., et al. , Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). - PubMed

LinkOut - more resources