Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jul 7:2025.07.03.663077.
doi: 10.1101/2025.07.03.663077.

Undersampling techniques for non-linear chemical space visualization

Affiliations

Undersampling techniques for non-linear chemical space visualization

Akash Surendran et al. bioRxiv. .

Abstract

The visualization of high-dimensional chemical space is a critical tool for understanding molecular diversity, structure-property relationships, and for guiding compound selection. However, the performance of non-linear dimensionality reduction (DR) techniques like t-Stochastic Neighborhood Embedding (t-SNE), Uniform Manifold Approximation and Projection (UMAP), and Generative Topographic Mapping (GTM) are often susceptible to the choice of hyperparameters, along with the high cost of their training for large datasets. In this study, we investigated the effect of undersampling methods on the choice of hyperparameter selection for these non-linear dimensionality reduction methods. Our results demonstrate that selecting small representative subsets of chemical data not only reduces computational costs associated with hyperparameter training but also serves as an innovative means to train non-linear DR methods, leading to projections that better preserve the local structure within the chemical space.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Brief workflow: 30 CHEMBL datasets curated for 30 different macromolecular targets were selected for this study. Morgan fingerprints were generated for each dataset, and samples (10% and 20% of the original dataset size) were generated using iSIM complementary similarity sampling methods. Nonlinear DR methods were trained using each sample, and the corresponding original dataset was projected in 2D. Neighborhood Preservation Metrics were analyzed to compare the quality of these projections for chemical space visualization.
Figure 2:
Figure 2:
t-SNE plots for CHEMBL234 by sampling 10% of the data. Blue portions depict the sampled molecules and grey portions represent the projection of the entire dataset. The former three methods target specific portions but the latter two target diverse regions of the chemical space.
Figure 3:
Figure 3:
t-SNE, UMAP and GTM projections of CHEMBL237 dataset trained on 20% samples by Extremes, Medoid, Outlier, Quota and Stratified sampling. The abundance of lighter shades in the heatmap of GTM indicates a superior local neighborhood preservation. t-SNE on the other hand, is more sensitive to the choice of sampling.
Figure 4:
Figure 4:
DR metrics plot in the order Extremes, Medoid, Outlier, Quota and Stratified. The color scheme is as follows: t-SNE: Yellow, UMAP: Green and GTM: Red. The shaded regions represent the standard deviation across datasets.

Similar articles

References

    1. Lipinski C.; Hopkins A. Navigating chemical space for biology and medicine. Nature 2004, 432, 855–861. - PubMed
    1. Reymond J.-L. The chemical space project. Accounts of chemical research 2015, 48, 722–730. - PubMed
    1. Reymond J.-L.; Ruddigkeit L.; Blum L.; Van Deursen R. The enumeration of chemical space. Wiley Interdisciplinary Reviews: Computational Molecular Science 2012, 2, 717–733.
    1. Medina-Franco J. L.; Chávez-Hernández A. L.; López-López E.; Saldívar-González F. I. Chemical multiverse: an expanded view of chemical space. Molecular Informatics 2022, 41, 2200116. - PMC - PubMed
    1. Reymond J.-L.; Van Deursen R.; Blum L. C.; Ruddigkeit L. Chemical space as a source for new drugs. MedChemComm 2010, 1, 30–38.

Publication types

LinkOut - more resources