Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov 22;10(1):55.
doi: 10.1186/s13321-018-0311-x.

Statistical-based database fingerprint: chemical space dependent representation of compound databases

Affiliations

Statistical-based database fingerprint: chemical space dependent representation of compound databases

Norberto Sánchez-Cruz et al. J Cheminform. .

Abstract

Background: Simplified representation of compound databases has several applications in cheminformatics. Herein, we introduce an alternative and general method to build single fingerprint representations of compound databases. The approach is inspired on the previously published modal fingerprints that are aimed to capture the most significant bits of a fingerprint representation for a compound data set. The novelty of the herein proposed statistical-based database fingerprint (SB-DFP) is that it is generated based on binomial proportions comparisons taking as reference the distribution of "1" bits on a large representative set of the chemical space.

Results: To illustrate the Method, SB-DFPs were constructed for 28 epigenetic target data sets retrieved from a recently published epigenomics database of interest in probe and drug discovery. For each target data set, the SB-DFPs were built based on two representative fingerprints of different design using as reference a data set with more than 15 million compounds from ZINC. The application of SB-DFP was illustrated and compared to other methods through association relationships of the 28 epigenetic data sets and similarity searching. It was found that SB-DFPs captured overall, the common features between data sets and the distinct features of each set. In similarity searching SB-DFP equaled or outperformed other approaches for at least 20 out of the 28 sets.

Conclusions: SB-DFP is a general approach based on binomial proportion comparisons to represent a compound data set with a single fingerprint. SB-DFP can be developed, at least in principle, based on any fingerprint and reference data set. SB-DFP is a good alternative for exploration of relationships between targets through its associated compound data sets and performing similarity searching.

Keywords: Chemical space; Epi-informatics; Molecular fingerprints; Representation; Similarity searching.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Schematic representation of single fingerprints for a compound database and an hypothetical 20-bit fingerprint. The upper part of charts shows the binary representation of the generated single fingerprint: a database fingerprint (DFP) and b statistical-based database fingerprint (SB-DFP)
Fig. 2
Fig. 2
Dendograms for hierarchical clustering of targets computed with different approaches based in two molecular fingerprints, MACCS keys and ECFP4. a The ground truth; b, e all-compound comparisons (ACC); c, f database fingerprint (DFP); d, g statistical-based database fingerprint (SB-DFP). The Adjusted Rand Index (ARI) of each clustering is indicated in each panel. See main text for details
Fig. 3
Fig. 3
Early enrichment performance of similarity searches. Average recovery rates (selection set size equal to the number of ADCs) for three search strategies over 28 epigenetic data sets are reported in a histogram representation for a MACCS keys and b ECFP4. Standard deviations are displayed as error bars
Fig. 4
Fig. 4
General performance of similarity searches. Average AUCs for three search strategies over 28 epigenetic data sets are reported in a histogram representation for a MACCS keys and b ECFP4. Standard deviations are displayed as error bars

References

    1. Cereto-Massagué A, Ojeda MJ, Valls C, et al. Molecular fingerprint similarity search in virtual screening. Methods. 2015;71:58–63. doi: 10.1016/j.ymeth.2014.08.005. - DOI - PubMed
    1. Muegge I, Mukherjee P. An overview of molecular fingerprint similarity search in virtual screening. Expert Opin Drug Discov. 2016;11:137–148. doi: 10.1517/17460441.2016.1117070. - DOI - PubMed
    1. Heikamp K, Bajorath J. Fingerprint design and engineering strategies: rationalizing and improving similarity search performance. Future Med Chem. 2012;4:1945–1959. doi: 10.4155/fmc.12.126. - DOI - PubMed
    1. Shemetulskis NE, Weininger D, Blankley CJ, et al. Stigmata: an algorithm to determine structural commonalities in diverse datasets. J Chem Inf Comput Sci. 1996;36:862–871. doi: 10.1021/ci950169+. - DOI - PubMed
    1. Hert J, Willett P, Wilton DJ, et al. Comparison of fingerprint-based methods for virtual screening using multiple bioactive reference structures. J Chem Inf Comput Sci. 2004;44:1177–1185. doi: 10.1021/ci034231b. - DOI - PubMed

LinkOut - more resources