Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 7;3(6):1160-1171.
doi: 10.1039/d4dd00041b. eCollection 2024 Jun 12.

iSIM: instant similarity

Affiliations

iSIM: instant similarity

Kenneth López-Pérez et al. Digit Discov. .

Abstract

The quantification of molecular similarity has been present since the beginning of cheminformatics. Although several similarity indices and molecular representations have been reported, all of them ultimately reduce to the calculation of molecular similarities of only two objects at a time. Hence, to obtain the average similarity of a set of molecules, all the pairwise comparisons need to be computed, which demands a quadratic scaling in the number of computational resources. Here we propose an exact alternative to this problem: iSIM (instant similarity). iSIM performs comparisons of multiple molecules at the same time and yields the same value as the average pairwise comparisons of molecules represented by binary fingerprints and real-value descriptors. In this work, we introduce the mathematical framework and several applications of iSIM in chemical sampling, visualization, diversity selection, and clustering.

PubMed Disclaimer

Conflict of interest statement

There are no conflicts to declare.

Figures

Fig. 1
Fig. 1. iSIM vs. average pairwise similarity for 10 000 randomly generated libraries. Molecules are represented by binary fingerprints.
Fig. 2
Fig. 2. iSIM vs. average pairwise similarity for 30 CHEMBL libraries. Molecules are represented by binary MACCS, RDKit, and ECFP4 (binary) fingerprints.
Fig. 3
Fig. 3. iSIM vs. average pairwise similarity for 10 000 randomly generated libraries. Molecules are represented by random generated fingerprints with continuous normalized descriptors.
Fig. 4
Fig. 4. iSIM vs. average pairwise similarity for 30 CHEMBL libraries. Molecules are represented by 208 RDKIT continuous and discrete numerical normalized descriptors.
Fig. 5
Fig. 5. Structures of the CHEMBL214 database ranked by increasing complementary similarity using the RDKIT fingerprints and iRR similarity index. Structures shown correspond to the top (medoids) and bottom (outliers) three molecules.
Fig. 6
Fig. 6. MaxMin (bmax, yellow), iRR (iSIM, blue), and sqrt_iRR (sqrt_iSIM, green) results for the diversity sampling of the CHEMBL214 dataset represented by RDKIT fingerprints: (A) pairwise similarity of the selected set, (B) minimum similarity between elements of the selected set, (C) maximum similarity between elements of the selected set.
Fig. 7
Fig. 7. (A) iSIMDiv and iSIMRevDiv selections for different data percentages (1–99%, in 1% steps) for the CHEMBL214 dataset represented by RDKIT fingerprints and selected by the iRR index. (B) Computing time variation of the diversity selection methods with the data percentage selected.
Fig. 8
Fig. 8. Graphical explanation of the medoid, outlier, extreme, stratified and quota sampling methods.
Fig. 9
Fig. 9. PCA scoring plots of the CHEMBL214 dataset represented by RDKIT binary fingerprints. Blue points represent the 10% selected molecules by each selection algorithm, while grey points represent non-selected molecules. iSIM related methods use iT.
Fig. 10
Fig. 10. t-SNE plots for the CHEMBL214 dataset represented by RDKIT binary fingerprints. Blue points represent the 10% selected molecules by each selection algorithm, while grey points represent non-selected molecules. iSIM related methods use the iT similarity index as a metric.
Fig. 11
Fig. 11. Dendrograms from hierarchical clustering of molecules in the CHEMBL214 (top) and CHEMBL2835 (bottom) libraries using iSM on MACCS fingerprints. The number of elements in each cluster is indicated in brackets. Coloring corresponds to the final 10 clusters. The dashed red line represents the cut-off for the optimal number of clusters (25 for CHEMBL214, 41 for CHEMBL2835).
Fig. 12
Fig. 12. Medoids of each of the 10 colored clusters in the CHEMBL214 (top) and CHEMBL2835 (bottom) libraries using iSM on MACCS fingerprints.

Similar articles

Cited by

References

    1. Fernández-de Gortari E. García-Jacas C. R. Martinez-Mayorga K. Medina-Franco J. L. J. Cheminf. 2017;9:9. - PMC - PubMed
    1. Todeschini R. and Consonni V., Handbook of Molecular Descriptors, Wiley, 2000
    1. Gugler S. Reiher M. J. Chem. Theory Comput. 2022;18:6670–6689. doi: 10.1021/acs.jctc.2c00718. - DOI - PubMed
    1. Jaccard P. New Phytol. 1912;11:37–50. doi: 10.1111/j.1469-8137.1912.tb05611.x. - DOI
    1. Rogers D. J. Tanimoto T. T. Science. 1960;132:1115–1118. doi: 10.1126/science.132.3434.1115. - DOI - PubMed

LinkOut - more resources