Seq2Bind webserver for binding site prediction from sequences using fine-tuned protein language models

Xiang Ma^{1

2}, Supantha Dey³, Vaishnavey Sr³, Casey Zelinski³, Qi Li¹, Ratul Chowdhury³

Affiliations

¹ Department of Computer Science, Iowa State University, Ames, IA 50011, United States.
² Department of Chemistry, Grand View University, Des Moines, IA 50316, United States.
³ Department of Chemical and Biological Engineering, Iowa State University, Ames, IA 50011, United States.

PMID: 41278538
PMCID: PMC12639246
DOI: 10.1093/nargab/lqaf154

Seq2Bind webserver for binding site prediction from sequences using fine-tuned protein language models

Xiang Ma et al. NAR Genom Bioinform. 2025.

. 2025 Nov 22;7(4):lqaf154.

doi: 10.1093/nargab/lqaf154. eCollection 2025 Dec.

Authors

Xiang Ma^{1

2}, Supantha Dey³, Vaishnavey Sr³, Casey Zelinski³, Qi Li¹, Ratul Chowdhury³

Affiliations

¹ Department of Computer Science, Iowa State University, Ames, IA 50011, United States.
² Department of Chemistry, Grand View University, Des Moines, IA 50316, United States.
³ Department of Chemical and Biological Engineering, Iowa State University, Ames, IA 50011, United States.

PMID: 41278538
PMCID: PMC12639246
DOI: 10.1093/nargab/lqaf154

Abstract

Decoding protein-protein interactions at the residue level is crucial for understanding cellular mechanisms and developing targeted therapeutics. We present Seq2Bind webserver, a computational framework that leverages fine-tuned protein language models (PLMs) to determine binding affinity between proteins and identify critical binding residues directly from sequences, eliminating the structural requirements that limit affinity prediction tools. We fine-tuned four architectures, including ProtBERT, ProtT5, Evolutionary Scale Modeling 2 (ESM2), and Bidirectional Long Short-Term Memory on the SKEMPI 2.0 dataset. Through systematic alanine mutagenesis on each residue for 6063 dimer proteins from Protein Data Bank, we evaluated each model's ability to identify interface residues. Performance was assessed using N-factor metrics, where N-factor = 3 evaluates whether true residues appear within 3n top predictions for n-interface residues. ESM2 achieved 67.4% and ProtBERT 68.2% interface-residue recovery at N-factor = 3. On an independent panel of 14 human health-relevant protein complexes, Seq2Bind outperformed docking and mutation-based baselines, with ESM2 (37.2%) and ProtBERT (35.1%) exceeding the structural docking HADDOCK3 (32.1%) at N-factor = 2. Our sequence-based approach enables rapid screening, handles disordered proteins, and provides comparable accuracy, making Seq2Bind a valuable prior to steer blind docking protocols to identify putative binding residues from each protein for therapeutic targets. Seq2Bind webserver is freely accessible at https://agrivax.onrender.com/seq2bind/scan under StructF-suite.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
Schematic representation of the deep learning workflow for predicting experimental protein–protein binding affinity. Two protein sequences are tokenized and processed through a Siamese neural network (two identical, weight-sharing pre-trained PLM encoders) to generate sequence embeddings. After max pooling, the vectors are combined by a Hadamard product. The joint representation is then passed to an MLP regression head (the gear icon), followed by a ReLU activation function, to produce an affinity score as the final output. The error is backpropagated to fine-tune the PLM encoders.

**Figure 2.**
(A) Loss curves for ProtBERT, ProtT5, ESM2, and BiLSTM across training and (B) validation epochs. Both training and validation MAE are shown using distinct marker styles. Among the models evaluated, ProtBERT achieves the lowest validation MAE, indicating the best overall predictive performance. (C) Distribution of −∆G values for the training and validation dataset. (D) MAE of −∆G values for each model on training and validation datasets.

**Figure 3.**
Performance comparison of interface residue prediction models across different N factors. (A) Distribution of success rates (%) for each model (ESM2, BiLSTM, ProtBERT, T5) across analyzed protein complexes for N factors 1, 2, and 3. Success rate is defined as the average percentage of ground-truth interface residues correctly predicted by the model for 6063 protein complexes. (B) Heatmap showing the mean success rate % across different models (x-axis) and N factors (y-axis).

**Figure 4.**
(A) Interface-residue recovery across N-factor thresholds for sequence-based models, docking, and mutation-effect predictors. Bars indicate the success rate (percentage of ground-truth interface residues recovered) aggregated over 14 protein–protein complexes. For sequence models (ESM, LSTM, ProtBERT, T5) and mutation predictors (SAAMBE-3D, DDMut-PPI, mCSM-PPI2, SAAMBE-SEQ, MutationExplorer), three grouped bars correspond to N = 1, 2, and 3, where the top N, 2N, or 3N highest-ranked residues per chain are counted as correct if present in the ground-truth interface. The HADDOCK bar (leftmost) reports its aggregate success rate across complexes. (B) Stacked horizontal bar charts showing the count of correctly predicted interface residues per PDB ID for each model. The total length of the bar represents the sum of correct predictions by all models for that complex (including residues predicted correctly by multiple models). Three different parts represent results for N factors 1, 2, and 3. To prevent overcounting, the multiple models section was used to indicate that the correct residue was predicted by more than one model.

**Figure 5.**
Interface-residue recovery on two benchmark complexes (PDB IDs 5WBX and 5V89). For each complex, we display the experimentally validated interacting residues as sticks (ground truth) and annotate those positions that are recovered by at least two of the four models tested for panel (A) 5WBX (a SK/IK channel positive modulators) and by our best-performing model esm2 for panel (B) 5V89 (a DCN1-like protein 4 PONY domain bound to Cullin-1).

See this image and copyright information in PMC

References

1. Braun P, Gingras A. History of protein–protein interactions: from egg-white to complex networks. Proteomics. 2012;12:1478–98. 10.1002/pmic.201100563. - DOI - PubMed
1. Garlick JM, Mapp AK. Selective modulation of dynamic protein complexes. Cell Chem Biol. 2020;27:986–97. 10.1016/j.chembiol.2020.07.019. - DOI - PMC - PubMed
1. Zhang Y, Gao P, Yuan JS. Plant protein–protein interaction network and interactome. Curr Genomics. 2010;11:40–6. 10.2174/138920210790218016. - DOI - PMC - PubMed
1. Ideker T, Sharan R. Protein networks in disease. Genome Res. 2008;18:644–52. 10.1101/gr.071852.107. - DOI - PMC - PubMed
1. De Las Rivas J, Fontanillo C. Protein–protein interaction networks: unraveling the wiring of molecular machines within the cell. Brief Funct Genomics. 2012;11:489–96. 10.1093/bfgp/els036. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Seq2Bind webserver for binding site prediction from sequences using fine-tuned protein language models

Affiliations

Seq2Bind webserver for binding site prediction from sequences using fine-tuned protein language models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources