Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Jul 24:2024.07.23.604860.
doi: 10.1101/2024.07.23.604860.

PairK: Pairwise k-mer alignment for quantifying protein motif conservation in disordered regions

Affiliations

PairK: Pairwise k-mer alignment for quantifying protein motif conservation in disordered regions

Jackson C Halpin et al. bioRxiv. .

Update in

Abstract

Protein-protein interactions are often mediated by a modular peptide recognition domain binding to a short linear motif (SLiM) in the disordered region of another protein. The ability to predict domain-SLiM interactions would allow researchers to map protein interaction networks, predict the effects of perturbations to those networks, and develop biologically meaningful hypotheses. Unfortunately, sequence database searches for SLiMs generally yield mostly biologically irrelevant motif matches or false positives. To improve the prediction of novel SLiM interactions, researchers employ filters to discriminate between biologically relevant and improbable motif matches. One promising criterion for identifying biologically relevant SLiMs is the sequence conservation of the motif, exploiting the fact that functional motifs are more likely to be conserved than spurious motif matches. However, the difficulty of aligning disordered regions has significantly hampered the utility of this approach. We present PairK (pairwise k-mer alignment), an MSA-free method to quantify motif conservation in disordered regions. PairK outperforms both standard MSA-based conservation scores and a modern LLM-based conservation score predictor on the task of identifying biologically important motif instances. PairK can quantify conservation over wider phylogenetic distances than MSAs, indicating that SLiMs may be more conserved than is implied by MSA-based metrics. PairK is available as open-source code at https://github.com/jacksonh1/pairk.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Detecting evolutionary conservation of short linear motifs is confounded by poor alignment of disordered regions. (A) Slice of an MSA of RIAM, which contains an LPPPP Ena/VASP binding motif (37), aligned to its vertebrate homolog sequences. (B) Left – Part of the SLiM region of the MSA from (A). Matches to the EVH1 motif are highlighted in red. Columns that appear artificially conserved are indicated with a red arrow. Right - The apparent conservation of the SLiM residues extracted from the example MSA. X-axis labels are residues in the human protein. White space in the sequence logo indicates gaps in the corresponding alignment columns. Bar plot shows the conservation scores of the aligned columns. (C) The conservation scores (Shannon entropy from Capra et. al (42)) of residues in experimentally verified SLiMs vary with alignment algorithms. Data are from 236 verified SLiM instances (721 residues). Homologs are from metazoans.
Figure 2.
Figure 2.
The pairwise k-mer alignment method (PairK) for quantifying the conservation of SLiMs. (A) Schematic of the method. (B) Example z-scores and sequence logos for an Ena/VASP binding motif from the protein RIAM and its vertebrate homologs. X-axis labels are residues in the human protein. White space in the sequence logos indicates gaps in the corresponding alignment. The results using an MSA (the same MSA from Figure 1A) are shown at left. The positions in the MSA corresponding to the SLiM residues in the human sequence (LPPPP) are extracted and shown in the middle-left panel, with gaps removed. The results from PairK (middle-right) and the embedding-based variant of PairK (right) suggest that the LPPPP motif is more conserved than it appears in the MSA.
Figure 3.
Figure 3.
SLiM conservation scoring benchmark. (A) Schematic of the benchmark pipeline. Distributions of the conservation scores of the motifs in the benchmark are shown for the MSA method (B) and PairK (C). Homologous sequences were gathered at the metazoan level. PairK better separates motif matches that are validated (TPs, orange) from background motif matches in the proteome (BG, blue). (D) For each motif in the benchmark, PairK (red) performs better than the MSA method (gray). Error bars are 95% confidence intervals from a bootstrap analysis. The plot is for homologs at the Metazoa level.
Figure 4.
Figure 4.
PairK better distinguishes real motifs from background matches for more divergent homologs. (A) Phylogenetic tree of Eukaryotes. (B) Performance (reported as auPRC) for the MSA and PairK methods at different phylogenetic levels. The performance of the Kibby method is replotted at each level for comparison with the other methods, however it is independent of phylogenetic level and only the query sequence is used in its calculation. Error bars are 95% confidence intervals from a bootstrap analysis. (C) The difference in auPRC score for PairK vs. the MSA method for individual motifs at different phylogenetic levels. (D) Example sequence logos and conservation scores for an experimentally verified SLiM from lamellipoden (RAPH1) that binds to the Ena/VASP EVH1 domain. The motif region is highlighted in red. For Vertebrata, the MSA and pairwise k-mer methods (top) perform similarly and show similar sequence profiles. For Metazoa, the MSA (bottom left) has a high fraction of gaps, indicated by white space in the logo, while the pairwise k-mer method (bottom right) indicates that the motif is still conserved in metazoans.
Figure 5.
Figure 5.
Sequence logos and conservation scores for examples from the benchmark. (A) TRAF6 motif match in G protein-coupled receptor 179 with homologs from the vertebrate level. (B) Motif matches for the Ena/VASP EVH1 domain (vertebrate level for RIAM, and metazoan level for WASF2 and Roundabout homolog 2). White space in the sequence logos indicates gaps in the alignment. The x-axis labels are the residues of the human sequence, which was the query sequence. For the MSA plots, the positions corresponding to the human residues were extracted from the MSA (as in Figure 1B) for easier visualization. For PairK plots, the sequence shown on the x-axis labels is the full query k-mer. In (B) a larger value of k was used (15) for the PairK method to show sequence flanking the motif. The motif residues are highlighted in red.

References

    1. Kumar M., Michael S., Alvarado-Valverde J., Mészáros B., Sámano-Sánchez H., Zeke A., Dobson L., Lazar T., Örd M., Nagpal A., Farahi N., Käser M., Kraleti R., Davey N. E., Pancsa R., Chemes L. B., Gibson T. J., The Eukaryotic Linear Motif resource: 2022 release. Nucleic Acids Res 50, D497–D508 (2022). - PMC - PubMed
    1. Ball L. J., Kühne R., Hoffmann B., Häfner A., Schmieder P., Volkmer-Engert R., Hof M., Wahl M., Schneider-Mergener J., Walter U., Oschkinat H., Jarchau T., Dual epitope recognition by the VASP EVH1 domain modulates polyproline ligand specificity and binding affinity. EMBO J 19, 4903–4914 (2000). - PMC - PubMed
    1. Hwang T., Parker S. S., Hill S. M., Ilunga M. W., Grant R. A., Mouneimne G., Keating A. E., A distributed residue network permits conformational binding specificity in a conserved family of actin remodelers. eLife 10, e70601 (2021). - PMC - PubMed
    1. Acevedo L. A., Greenwood A. I., Nicholson L. K., A Noncanonical Binding Site in the EVH1 Domain of Vasodilator-Stimulated Phosphoprotein Regulates Its Interactions with the Proline Rich Region of Zyxin. Biochemistry 56, 4626–4636 (2017). - PMC - PubMed
    1. Stevers L. M., de Vink P. J., Ottmann C., Huskens J., Brunsveld L., A Thermodynamic Model for Multivalency in 14–3-3 Protein-Protein Interactions. J Am Chem Soc 140, 14498–14510 (2018). - PMC - PubMed

Publication types