Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Feb 6:9:9.
doi: 10.1186/s13321-017-0195-1. eCollection 2017.

Database fingerprint (DFP): an approach to represent molecular databases

Affiliations

Database fingerprint (DFP): an approach to represent molecular databases

Eli Fernández-de Gortari et al. J Cheminform. .

Abstract

Background: Molecular fingerprints are widely used in several areas of chemoinformatics including diversity analysis and similarity searching. The fingerprint-based analysis of chemical libraries, in particular of large collections, usually requires the molecular representation of each compound in the library that may lead to issues of storage space and redundant calculations. In fact, information redundancy is inherent to the data, resulting on binary digit positions in the fingerprint without significant information.

Results: Herein is proposed a general approach to represent an entire compound library with a single binary fingerprint. The development of the database fingerprint (DFP) is illustrated first using a short fingerprint (MACCS keys) for 10 data sets of general interest in chemistry. The application of the DFP is further shown with PubChem fingerprints for the data sets used in the primary example but with a larger number of compounds, up to 25,000 molecules. The performance of DFP were studied through differential Shannon entropy, k-mean clustering, and DFP/Tanimoto similarity.

Conclusions: The DFP is designed to capture key information of the compound collection and can be used to compare and assess the diversity of molecular libraries. This Preliminary Communication shows the potential of the novel fingerprint to conduct inter-library relationships. A major future goal is to apply the DFP for virtual screening and developing DFP for other data sets based on several different type of fingerprints.Graphical AbstractDatabase fingerprint captures the key information of molecular databases to perform chemical space characterization and virtual screening.

Keywords: Diversity; Information content; Molecular fingerprints; Shannon entropy; Similarity.

PubMed Disclaimer

Figures

Graphical Abstract
Graphical Abstract
Database fingerprint captures the key information of molecular databases to perform chemical space characterization and virtual screening
Fig. 1
Fig. 1
a Schematic representation of a binary and dictionary-based molecular fingerprint. b Schematic representation of a database fingerprint (DFP)
Fig. 2
Fig. 2
Overview of the approach implemented in this work
Fig. 3
Fig. 3
Probability distributions of MACCS keys (166-bits) of representative data sets studied in this work. The number of compounds, mean MACCS keys/Tanimto similarity, and Shannon entropy (SE) are shown
Fig. 4
Fig. 4
Relationship Shannon Entropy and MACCS keys/Tanimoto similarity for the ten compound data sets in Table 1. A drugs, I general screening, C clinical, G GDB13, D DNMT1, E epigenetic focused, M semi-synthetic, N natural products, B benzimidazole, GR GRAS, R random
Fig. 5
Fig. 5
Relationship Shannon entropy and DFP/Tanimoto similarity and k-mean Euclidean clustering for the ten compound data sets in Table 2 at threshold of 0.5 threshold value
Fig. 6
Fig. 6
Probability distribution of the 198 significant bit positions recovered from the original databases represented by PubChem fingerprint at threshold of 0.5

References

    1. Medina-Franco JL, Maggiora GM. Molecular similarity analysis. In: Bajorath J, editor. Chemoinformatics for drug discovery. Hoboken: Wiley; 2014. pp. 343–399.
    1. Durant JL, Leland BA, Henry DR, Nourse JG. Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci. 2002;42(6):1273–1280. doi: 10.1021/ci010132r. - DOI - PubMed
    1. Shannon CE, Weaver W. The mathematical theory of communication. Urbana: University of Illinois Press; 1963.
    1. Guha R, Schürer SC. Utilizing high throughput screening data for predictive toxicology models: protocols and application to MLSCN assays. J Comput Aided Mol Des. 2008;22(6):367–384. doi: 10.1007/s10822-008-9192-9. - DOI - PubMed
    1. Godden JW, Bajorath J. Analysis of chemical information content using shannon entropy. In: Lipkowitz KB, Cundari TR, editors. Reviews in computational chemistry. Hoboken: Wiley; 2007. pp. 263–289.

LinkOut - more resources