Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar:14609:34-49.
doi: 10.1007/978-3-031-56060-6_3. Epub 2024 Mar 16.

Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search

Affiliations

Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search

Kathryn E Kirchoff et al. Adv Inf Retr. 2024 Mar.

Abstract

Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a k-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding-SmallSA-for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks.

Keywords: Cheminformatics; Drug discovery; Virtual screening.

PubMed Disclaimer

Figures

Fig.1:
Fig.1:
Overview of the similarity search framework. k-d trees are combined with with low-dimensional chemical embeddings to produce a partitioned chemical space, which can be quickly queried for nearest neighbors.
Fig.2:
Fig.2:
The average approximate graph edit distance (GED) between query molecule and nearest neighbors, per method, shown as a function of the number of neighbors considered. Lower distances are better. Lines are smoothed using a running average approach for simplicity of analysis.
Fig.3:
Fig.3:
AUROC achieved by each embedding on the RDKit virtual screening benchmark of 69 query targets, grouped by target database. Each database is indicated on the x-axis. Note that out of the 69 targets, most targets (50) belong to the ChEMBL database.
Fig.4:
Fig.4:
Example of a query molecule and the hits obtained by select high-performing embeddings.

References

    1. ANNOY library. https://github.com/spotify/annoy, accessed: 2017-08-01
    1. NextMove software | Arthor, https://www.nextmovesoftware.com/arthor.html
    1. NextMove software | Arthor, https://www.nextmovesoftware.com/talks/Sayle_EvolutionVsRevolution_ICCS_...
    1. NextMove software | SmallWorld, https://www.nextmovesoftware.com/smallworld.html
    1. Achlioptas D: Database-friendly random projections: Johnson-lindenstrauss with binary coins 66(4), 671–687. 10.1016/S0022-0000(03)00025-4 - DOI

LinkOut - more resources