PairK: Pairwise k-mer alignment for quantifying protein motif conservation in disordered regions

Jackson C Halpin¹, Amy E Keating^{1

2

3}

Affiliations

¹ Department of Biology, MIT, Cambridge, Massachusetts, USA.
² Department of Biological Engineering, MIT, Cambridge, Massachusetts, USA.
³ Koch Institute for Integrative Cancer Research, Cambridge, Massachusetts, USA.

PMID: 39720898
PMCID: PMC11669117
DOI: 10.1002/pro.70004

PairK: Pairwise k-mer alignment for quantifying protein motif conservation in disordered regions

Jackson C Halpin et al. Protein Sci. 2025 Jan.

. 2025 Jan;34(1):e70004.

doi: 10.1002/pro.70004.

Authors

Jackson C Halpin¹, Amy E Keating^{1

2

3}

Affiliations

¹ Department of Biology, MIT, Cambridge, Massachusetts, USA.
² Department of Biological Engineering, MIT, Cambridge, Massachusetts, USA.
³ Koch Institute for Integrative Cancer Research, Cambridge, Massachusetts, USA.

PMID: 39720898
PMCID: PMC11669117
DOI: 10.1002/pro.70004

Abstract

Protein-protein interactions are often mediated by a modular peptide recognition domain binding to a short linear motif (SLiM) in the disordered region of another protein. To understand the features of SLiMs that are important for binding and to identify motif instances that are important for biological function, it is useful to examine the evolutionary conservation of motifs across homologous proteins. However, the intrinsically disordered regions (IDRs) in which SLiMs reside evolve rapidly. Consequently, multiple sequence alignment (MSA) of IDRs often misaligns SLiMs and underestimates their conservation. We present PairK (pairwise k-mer alignment), an MSA-free method to align and quantify the relative local conservation of subsequences within an IDR. Lacking a ground truth for conservation, we tested PairK on the task of distinguishing biologically important motif instances from background motifs, under the assumption that biologically important motifs are more conserved. The method outperforms both standard MSA-based conservation scores and a modern LLM-based conservation score predictor. PairK can quantify conservation over wider phylogenetic distances than MSAs, indicating that some SLiMs are more conserved than MSA-based metrics imply. PairK is available as an open-source python package at https://github.com/jacksonh1/pairk. It is designed to be easily adapted for use with other SLiM tools and for diverse applications.

Keywords: conservation; intrinsically disordered proteins; multiple sequence alignment; short linear motif.

PubMed Disclaimer

Figures

**FIGURE 1**
Detecting evolutionary conservation of short linear motifs is confounded by poor alignment of disordered regions. (a) Slice of an MSA of RIAM, which contains an LPPPP Ena/VASP binding motif (Lafuente et al., 2004), aligned to its vertebrate homolog sequences. (b) *Left* – Part of the SLiM region of the MSA from (a). Matches to the EVH1 motif are highlighted in red. Columns that appear artificially conserved are indicated with a red arrow. *Right* – The apparent conservation of the SLiM residues extracted from the example MSA. X‐axis labels are residues in the human protein. White space in the sequence logo indicates gaps in the corresponding alignment columns. The bar plot shows the conservation scores of the aligned columns. (C) The conservation scores (Shannon entropy from Capra & Singh, 2007) of residues in experimentally verified SLiMs vary with alignment algorithms. Data are from 240 verified SLiM instances (731 residues). Homologs are from metazoans.

**FIGURE 2**
The pairwise k‐mer alignment method (PairK) for quantifying the conservation of SLiMs. (a) Schematic of the method. (b) Example z‐scores and sequence logos for an Ena/VASP binding motif from the protein RIAM and its vertebrate homologs. X‐axis labels are residues in the human protein. White space in the sequence logos indicates gaps in the corresponding alignment. The results using an MSA (the same MSA from Figure 1a) are shown at *left*. The positions in the MSA corresponding to the SLiM residues in the human sequence (LPPPP) are extracted and shown in the *middle‐left* panel, with gaps removed. The results from PairK (*middle*‐*right*) and the embedding‐based variant of PairK using ESM2 embeddings (Lin et al., 2023) (*right*) suggest that the LPPPP motif is more conserved than it appears in the MSA.

**FIGURE 3**
SLiM conservation scoring benchmark. (a) Schematic of the benchmark pipeline. Distributions of the conservation scores of the motifs in the benchmark are shown for the MSA method (b) and PairK (c). Homologous sequences were gathered at the metazoan level. PairK better separates motif matches that are validated (TPs, orange) from background motif matches in the proteome (BG, blue). (d) For each motif in the benchmark except 14‐3‐3, PairK (red) performs better than the MSA method (gray). Error bars are 95% confidence intervals from the bootstrap analysis. The plot is for homologs at the Metazoa level.

**FIGURE 4**
PairK better distinguishes real motifs from background matches for more divergent homologs. (a) Phylogenetic tree of Eukaryotes. (b) Performance (reported as auPRC) for the MSA and PairK methods at different phylogenetic levels. The performance of the Kibby method is replotted at each level for comparison with the other methods, however, it is independent of the phylogenetic level and only the query sequence is used in its calculation. Error bars are 95% confidence intervals from a bootstrap analysis. (c) The difference in auPRC score for PairK vs. the MSA method for individual motifs at different phylogenetic levels. (d) Example sequence logos and conservation scores for an experimentally verified SLiM from lamellipoden (RAPH1) that binds to the Ena/VASP EVH1 domain. The motif region is highlighted in red. For Vertebrata, the MSA and pairwise k‐mer methods (*top*) perform similarly and show similar sequence profiles. For Metazoa, the MSA (*bottom left*) has a high fraction of gaps, indicated by white space in the logo, while the pairwise k‐mer method (*bottom right*) indicates that the motif is still conserved in metazoans.

**FIGURE 5**
Sequence logos and conservation scores for examples from the benchmark. (a) TRAF6 motif match in G protein‐coupled receptor 179 with homologs from the vertebrate level. (b) Motif matches for the Ena/VASP EVH1 domain (vertebrate level for RIAM, and metazoan level for WASF2 and Roundabout homolog 2). White space in the sequence logos indicates gaps in the alignment. The x‐axis labels are the residues of the human sequence, which was the query sequence. For the MSA plots, the positions corresponding to the human residues were extracted from the MSA (as in Figure 1b) for easier visualization. The sequence shown on the x‐axis labels is the full query k‐mer. In (b) a larger value of k was used (k = 15) for the PairK method to show sequence flanking the motif. The motif residues are highlighted in red.

See this image and copyright information in PMC

Update of

PairK: Pairwise k-mer alignment for quantifying protein motif conservation in disordered regions.
Halpin JC, Keating AE. Halpin JC, et al. bioRxiv [Preprint]. 2024 Jul 24:2024.07.23.604860. doi: 10.1101/2024.07.23.604860. bioRxiv. 2024. Update in: Protein Sci. 2025 Jan;34(1):e70004. doi: 10.1002/pro.70004. PMID: 39091826 Free PMC article. Updated. Preprint.

References

1. Acevedo LA, Greenwood AI, Nicholson LK. A noncanonical binding site in the EVH1 domain of vasodilator‐stimulated phosphoprotein regulates its interactions with the Proline rich region of Zyxin. Biochemistry. 2017;56:4626–4636. - PMC - PubMed
1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25:25–29. - PMC - PubMed
1. Ball LJ, Kühne R, Hoffmann B, Häfner A, Schmieder P, Volkmer‐Engert R, et al. Dual epitope recognition by the VASP EVH1 domain modulates polyproline ligand specificity and binding affinity. EMBO J. 2000;19:4903–4914. - PMC - PubMed
1. Bashaw GJ, Kidd T, Murray D, Pawson T, Goodman CS. Repulsive axon guidance: Abelson and enabled play opposing roles downstream of the roundabout receptor. Cell. 2000;101:703–715. - PubMed
1. Benz C, Ali M, Krystkowiak I, Simonetti L, Sayadi A, Mihalic F, et al. Proteome‐scale mapping of binding sites in the unstructured regions of the human proteome. Mol Syst Biol. 2022;18:e10584. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PairK: Pairwise k-mer alignment for quantifying protein motif conservation in disordered regions

Affiliations

PairK: Pairwise k-mer alignment for quantifying protein motif conservation in disordered regions

Authors

Affiliations

Abstract

Figures

Update of

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous