Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 29;16(1):87.
doi: 10.1186/s13321-024-00850-z.

Hilbert-curve assisted structure embedding method

Affiliations

Hilbert-curve assisted structure embedding method

Gergely Zahoránszky-Kőhalmi et al. J Cheminform. .

Abstract

Motivation: Chemical space embedding methods are widely utilized in various research settings for dimensional reduction, clustering and effective visualization. The maps generated by the embedding process can provide valuable insight to medicinal chemists in terms of the relationships between structural, physicochemical and biological properties of compounds. However, these maps are known to be difficult to interpret, and the ''landscape'' on the map is prone to ''rearrangement'' when embedding different sets of compounds.

Results: In this study we present the Hilbert-Curve Assisted Space Embedding (HCASE) method which was designed to create maps by organizing structures according to a logic familiar to medicinal chemists. First, a chemical space is created with the help of a set of ''reference scaffolds''. These scaffolds are sorted according to the medicinal chemistry inspired Scaffold-Key algorithm found in prior art. Next, the ordered scaffolds are mapped to a line which is folded into a higher dimensional (here: 2D) space. The intricately folded line is referred to as a pseudo-Hilbert-Curve. The embedding of a compound happens by locating its most similar reference scaffold in the pseudo-Hilbert-Curve and assuming the respective position. Through a series of experiments, we demonstrate the properties of the maps generated by the HCASE method. Subjects of embeddings were compounds of the DrugBank and CANVASS libraries, and the chemical spaces were defined by scaffolds extracted from the ChEMBL database.

Scientific contribution: The novelty of HCASE method lies in generating robust and intuitive chemical space embeddings that are reflective of a medicinal chemist's reasoning, and the precedential use of space filling (Hilbert) curve in the process.

Availability: https://github.com/ncats/hcase.

Keywords: Chemical space embedding; Clustering; Dimension reduction; HCASE; Hilbert-curve; Scaffold-Keys.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
KNNs of randomly selected molecules. First column contains the query structures and subsequent columns contain the k = 5 NNs in decreasing order of similarity. Tanimoto-similarity was computed using Morgan-fingerprints, radius = 3, length = 2048. The value of Tanimoto-similarity coefficient and the label of compounds are shown after the compound IDs for NNs. The BMSs of compounds are highlighted by red
Fig. 2
Fig. 2
Maps generated by t-SNE Analysis of Drug Molecules. Embedding of DrugBank molecules performed by the original t-SNE algorithm at various perplexity values and repeating the embedding with a 90% sized subset of drug molecules. The randomly selected five molecules are marked by enlarged (X) symbol. Green: DB00006, orange: DB00849, purple: DB00977, aqua: DB01362, blue: DB04837. The NNs of each molecule are indicated by ( +) symbol with matching color. Molecules are labeled according to Fig. 1. A t-SNE embedding of drug compounds, perplexity = 5. B t-SNE embedding of drug compounds, perplexity = 40. C 90% sized subset of drug compounds, perplexity = 40
Fig. 3
Fig. 3
HCASE method. The process of embedding compounds into a chemical space with the HCASE method is demonstrated. The chemical space is defined by reference scaffolds which are ordered based on their Scaffold-Keys (SK). The HCASE method maps the reference scaffolds on a series of PHCs of increasing order. Then, a compound of the library to be embedded are mapped to its closest scaffold based on their Scaffold-Key distances (dSK). A binning step is also included in the process to make sure that each of the reference scaffolds, hence each compound, can be mapped to one of the possible coordinates in the higher dimension space. The number of possible coordinates is influenced by the order of the PHC the scaffolds are mapped to. A compound highlighted by yellow is tracked in this process. As it can be seen, the position of the compound in a 2D space is the function of the order of the PHC it was mapped to. Due to the nature of PHCs the position of compounds converges to a “stable” position when increasing the order of PHCs
Algorithm 1
Algorithm 1
HCASE method
Fig. 4
Fig. 4
Tracking the position of the cherry-picked scaffold set on the PHCs in the ChEMBL reference scaffold space. ChEMBL scaffolds were mapped onto PHCs of varying order (value of z was incremented in the range of [2, 8] for subfigures a-g, respectively). The order of the PHC is indicated by the suffix in the title of the subfigures. On each PHC we tracked the positions of the BMSs in the cherry-picked scaffold set. The cherry-picked scaffolds and their and respective colors are provided in Table 1. The color of the SK-ordering based nearest neighbors is the same as that of the corresponding cherry-picked scaffold
Fig. 5
Fig. 5
HCASE embedding of drug compounds into ChEMBL scaffold space. Shown is the HCASE embedding of k = 5 nearest neighbors of 5 randomly selected compounds from the DrugBank dataset. The order of PHC utilized for structure embedding is indicated by suffix in the titles of the subfigures. Enlarged (X) signs indicate the query compound of KNN analysis; green: DB00006, orange: DB00849, purple: DB00977, aqua: DB01362, blue: DB04837. ( +) signs indicate the NNs of a query compound with identical color. Gray circles indicate other DrugBank compounds. Compounds are labeled according to Fig. 1
Fig. 6
Fig. 6
Comparison of the HCASE embeddings of compounds in Natural Product and ChEMBL scaffold space. Blue: CANVASS compounds, yellow: drugs. Overlapping datapoints are colored by green–brown color due to the transparency of the datapoints. A) NatProd Scaffold Space, PHC-5 (z=5). B ChEMBL Scaffold Space, PHC-8 (z=8)
Fig. 7
Fig. 7
Distribution of compounds in the map obtained by HCASE embedding. Compounds were embedded into the NatProd scaffold space with the help of HCASE method. The intensity of each cell of the heatmaps is proportional to the number of compounds assigned to each cell, i.e., position in the embedded space. A Aggregated number of drug compounds embedded into HCASE NatProd space. B Aggregated number of CANVASS compounds embedded into HCASE NatProd space. C Aggregated number of drug compounds embedded into HCASE NatProd space, binarized. D Aggregated number of CANVASS compounds embedded into HCASE NatProd space, binarized
Fig. 8
Fig. 8
Cherry-picked scaffold set and drug molecules in t-SNE chemical spaces. The parameters of t-SNE embedding were set to default values, except for perplexity, i.e., learning rate = 200, iteration number 1000. A ChEMBL t-SNE space defined by the t-SNE embedding of ChEMBL scaffolds at perplexity = 40. Highlighted are the BMSs in the cherry-picked scaffold set. B Scaffold t-SNE embedding of k = 5 nearest neighbors of selected DrugBank molecules into ChEMBL t-SNE space. C Reduced scaffold t-SNE space defined by the t-SNE embedding of the reduced scaffold set. Highlighted are the BMSs in the cherry-picked scaffold set. D Scaffold t-SNE embedding of k = 5 nearest neighbors of selected DrugBank molecules into reduced scaffold t-SNE space. The cherry-picked scaffold set is colored according to colors provided in Table 1. The colors of the cherry-picked scaffolds were used to indicate their respective 100 SK-ordering based nearest neighbors. Enlarged (X) signs in Fig. 8B and 8D indicate the query compound of KNN analysis; green: DB00006, orange: DB00849, purple: DB00977, aqua: DB01362, blue: DB04837. ( +) signs indicate the NNs of a query compound with identical color. Compounds are labeled according to Fig. 1
Fig. 9
Fig. 9
Cherry-picked scaffold set and drug molecules in HCASE chemical spaces. A ChEMBL scaffolds were mapped onto a PHC of z=8. Positions of BMS belonging to the cherry-picked scaffold set are highlighted on the PHC. B Embedding of k = 5 Nearest Neighbors of selected DrugBank Molecules with HCASE into ChEMBL space employing an PHC of z=8. C The reduced scaffold set was mapped onto a PHC of z=8. Positions of BMS belonging to the cherry-picked scaffold set are highlighted on the PHC. D Embedding of k = 5 Nearest Neighbors of selected DrugBank Molecules with HCASE into reduced scaffold set space employing an PHC of z=8. The cherry-picked scaffold set is colored according to colors provided in Table 1. The colors of the cherry-picked scaffolds were used to indicate their respective 100 SK-ordering based nearest neighbors. Enlarged (X) signs in B, D indicate the query compound of KNN analysis; green: DB00006, orange: DB00849, purple: DB00977, aqua: DB01362, blue: DB04837. ( +) signs indicate the NNs of a query compound with identical color. Compounds are labeled according to Fig. 1
Fig. 10
Fig. 10
HCASE space defined by ChEMBL scaffolds, annotated by structures. The HCASE embedding of ChEMBL scaffolds at z = 8, shown in Fig. 9a, is annotated by structures. The cherry-picked scaffold set is colored according to colors provided in Table 1. The colors of the cherry-picked scaffolds were used to indicate their respective 100 SK-ordering based nearest neighbors. Structures were annotated for the cherry-picked scaffolds and two of their randomly selected neighbors among the 100 SK-ordering based nearest neighbors for demonstration purpose. Among each group of three scaffolds, the middle one is the cherry-picked scaffold. The structures of all 100 SK-based nearest neighbors are provided in Figs. S14–S22., in SI

References

    1. Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ. 10.1037/h007132510.1037/h0071325 - DOI
    1. Quist M, Yona G (2004) Distributional scaling: an algorithm for structure-preserving embedding of metric and nonmetric spaces. J Mach Learn Res 5:399–420
    1. L. van der Maaten, “Learning a Parametric Embedding by Preserving Local Structure,” in Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, D. van Dyk and M. Welling, Eds., in Proceedings of Machine Learning Research, vol. 5. Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA: PMLR, 2009, pp. 384–391.
    1. J. M. Leland McInnes, John Healy. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.
    1. Kohonen T (1991) Self-organizing maps ophmization approaches. In: Kohonen T, Mäkisara K, Simula O, Kangas J (eds) Artificial Neural Networks. North-Holland, Amsterdam

LinkOut - more resources