Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 23;25(6):bbae480.
doi: 10.1093/bib/bbae480.

Protein language models are performant in structure-free virtual screening

Affiliations

Protein language models are performant in structure-free virtual screening

Hilbert Yuen In Lam et al. Brief Bioinform. .

Abstract

Hitherto virtual screening (VS) has been typically performed using a structure-based drug design paradigm. Such methods typically require the use of molecular docking on high-resolution three-dimensional structures of a target protein-a computationally-intensive and time-consuming exercise. This work demonstrates that by employing protein language models and molecular graphs as inputs to a novel graph-to-transformer cross-attention mechanism, a screening power comparable to state-of-the-art structure-based models can be achieved. The implications thereof include highly expedited VS due to the greatly reduced compute required to run this model, and the ability to perform early stages of computer-aided drug design in the complete absence of 3D protein structures.

Keywords: cheminformatics; computer-aided drug design; protein language models; virtual screening.

PubMed Disclaimer

Figures

Figure 1
Figure 1
BIND is able to perform forward screening, reverse screening, and drug-target affinity prediction, all without structural input. BIND is trained on the BindingDB dataset only consisting of protein sequences and experimentally determined DTA values. By attaching BIND to a pre-trained ESM-2 protein language model and feeding in a protein sequence and SMILES molecular representation, which are then converted into graphs, the model can effectively discriminate between active and decoy ligands. This allows BIND to be used in forward and reverse screening, while predicting DTA values.
Figure 2
Figure 2
BIND architecture incorporates a proposed cross-attention graph block, and is trained with both true ligands and decoys taken from other proteins in the same dataset. The cross-attention graph block essentially deconstructs a graph and treats each node as a token for cross-attention—this allows the ligand to ‘query’ the protein and its important parts. The loss is a summation of Huber loss for the affinity predictions, and binary cross entropy loss for the decoy classifier. During training, ESM-2’s weights are frozen such that only the 9.26 M parameters from BIND are tuned. Q, K, and V in the diagram represent the query, key and value inputs characteristic of transformers.
Figure 3
Figure 3
The model is performant on docking benchmarks and outperforms the state-of-the-art DTA only model in enrichment. CASF-2016, DEKOIS 2.0, DUD-AD, DUD-E, and LIT-PCBA were evaluated. (a, b) CASF- 2016 1% enrichment factors and 1% success rate, (c–f) DEKOIS 2.0, DUD-AD, DUD-E and LIT-PCBA 1% enrichment factors, (g–k) probability-normalized histogram of the non-zero-shot BIND classifier logit output showing separation between true binders and decoys. All enrichment factors and success rates are averaged across the entire datasets evaluated, and histograms shown are distributions of logit scores for all proteins and ligands tested. The green bar indicates the enrichment factor of the published SSM-DTA model, which predicts the pIC50 DTA. The zero-shot BIND model indicates the model in which proteins with >90% homology comparative to the evaluation datasets are removed during training.
Figure 4
Figure 4
Protein language models can outperform standard reverse docking pipelines using structures from AlphaFold2. A total of 12 195 proteins and 90 ligands from Luo et al., 2023’s AlphaFold2 reverse docking benchmark were individually pairwise scored using BIND and the ligands ranked by classification score for each protein. (a, b) the cumulative ranking of BIND in standard and logarithmic plots respectively. The legend indicates the type of score function used, and the software used to determine the pocket locations in parentheses. (c, d) ranking of 84 ligands against 85 proteins in the reverse docking benchmark on the Astex dataset, with other benchmark scores obtained from Luo et al., 2017.
Figure 5
Figure 5
Predicted binding affinity has lower enriching power compared to the classifier in both forward and reverse screening. (a–e) Box-and-whisker plots of BEDROC values across the entire CASF-2016, DEKOIS 2.0, DUD-AD, DUD-E, and LIT-PCBA datasets respectively, with boxes representing the interquartile range and the median demarcated in the box, whiskers showing the fence and diamonds showing outliers. (f) Cumulative logarithmic plot of ranking in reverse screening on Luo et al., 2023’s AlphaFold2 reverse docking benchmark dataset.

References

    1. Ferreira LG, dos Santos R, Oliva G. et al. Molecular docking and structure-based drug design strategies. Molecules 2015;20:13384–421. 10.3390/molecules200713384. - DOI - PMC - PubMed
    1. Zhang B, Li H, Yu K. et al. Molecular docking-based computational platform for high-throughput virtual screening. CCF Trans High Perform Comput 2022;4:63–74. 10.1007/s42514-021-00086-5. - DOI - PMC - PubMed
    1. Wojcikowski M, Ballester PJ, Siedlecki P. Performance of machine-learning scoring functions in structure-based virtual screening. Sci Rep 2017;7:46710. 10.1038/srep46710. - DOI - PMC - PubMed
    1. Zheng L, Meng J, Jiang K. et al. Improving protein–ligand docking and screening accuracies by incorporating a scoring function correction term. Brief Bioinform 2022;23. 10.1093/bib/bbac051. - DOI - PMC - PubMed
    1. Shen C, Zhang X, Deng Y. et al. Boosting protein–ligand binding pose prediction and virtual screening based on residue-atom distance likelihood potential and graph transformer. J Med Chem 2022;65:10691–706. 10.1021/acs.jmedchem.2c00991. - DOI - PubMed