Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Mar 23;13(1):27.
doi: 10.1186/s13321-021-00506-2.

Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approach

Affiliations

Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approach

Hiroyuki Kuwahara et al. J Cheminform. .

Abstract

Two-dimensional (2D) chemical fingerprints are widely used as binary features for the quantification of structural similarity of chemical compounds, which is an important step in similarity-based virtual screening (VS). Here, using an eigenvalue-based entropy approach, we identified 2D fingerprints with little to no contribution to shaping the eigenvalue distribution of the feature matrix as related ones and examined the degree to which these related 2D fingerprints influenced molecular similarity scores calculated with the Tanimoto coefficient. Our analysis identified many related fingerprints in publicly available fingerprint schemes and showed that their presence in the feature set could have substantial effects on the similarity scores and bias the outcome of molecular similarity analysis. Our results have implication in the optimal selection of 2D fingerprints for compound similarity analysis and the identification of potential hits for compounds with target biological activity in VS.

Keywords: 2D fingerprint; Chemoinformatics; Similarity-based virtual screening; Structure-activity relationship; Unsupervised feature selection.

PubMed Disclaimer

Conflict of interest statement

We declare that we have no competing interests.

Figures

Fig. 1
Fig. 1
An illustrative example for the effects of related fingerprints on similarity measures. A hypothetical fingerprint scheme with nine bit keys (F1 to F9) is used to represent small molecules in a hypothetical compound dataset. The fingerprint matrix of this dataset is found to have a perfect multicollinearity in the first four features with 2F1=F2+F3+F4. The similarity of a query compound against three compounds is computed using Tanomoto coefficient (Tc) with and without this collinearity. For the results without the collinearity, the Tanimoto coefficient without the first four features (F1 to F4) is shown
Fig. 2
Fig. 2
Fingerprint usage patterns of MACCS 166 keys on HMDB metabolite dataset. a The on-bit count of each key. b The pairwise Pearson’s correlation coefficient value for each pair of 68 MACCS keys with moderate on-bit counts
Fig. 2
Fig. 2
Fingerprint usage patterns of MACCS 166 keys on HMDB metabolite dataset. a The on-bit count of each key. b The pairwise Pearson’s correlation coefficient value for each pair of 68 MACCS keys with moderate on-bit counts
Fig. 3
Fig. 3
Eigenvalue-based analysis of MACCS and Pubchem fingerprint matrices. a Normalized eigenvalues of the first 10 components for MACCS and Pubchem fingerprint matrices. b The distribution of the eigenvalue-based entropy for MACCS and Pubchem fingerprints
Fig. 4
Fig. 4
Comparison of the Tanimoto similarity score based on different reduction levels. a The fraction of used fingerprints with respect to given reduced levels for MACCS and Pubchem fingerprints. The reduced level 0 indicates the fraction for the original fingerprints. b The comparison of the first 6 principal components with respect to different reduction levels. c The comparison of the fraction of metabolite pairs with the absolute similarity score difference between the original set of fingerprints and a given reduced set of fingerprints exceeding specified threshold values
Fig. 5
Fig. 5
Illustration of 60 metabolite pairs with high levels of changes in Tanimoto similarity measures. Heatmap showing the similarity scores of 60 metabolite pairs based (y-axis) on given levels of reduced fingerprint sets (x-axis). From the MACCS and Pubchem fingerprint dictionaries, 30 pairs are selected from each based on the difference between the original set of fingerprints and a reduced set of fingerprints with reduced level 0.3
Fig. 6
Fig. 6
Contribution of related fingerprints to the similarity score for 10 randomly selected query compounds in DrugBank. The scatterplot shows two contribution measures from the Tanimoto coefficient (the ratio of the intersecting set to the union) of the drug compounds with the 50 highest similarity scores for each query compound. The x-axis shows the contribution of the related fingerprints to the union set, while the y-axis shows the contribution to the intersecting set. The related fingerprints are defined to be the removed ones based on the reduced level 0.3. a MACCS scheme. b Pubchem scheme
Fig. 7
Fig. 7
Relative changes of the Tanimoto similarity measure with the fingerprints in the reduced level 0.3 with respect to the one with the original fingerprints. The relative similarity changes are shown for the 33 similar-compound pairs with a high consensus by the 143 experts (80%). The error bars indicate the 1st and the 3rd quartiles of 10,000 Tanimoto coefficients computed with random pruning of the original fingerprints. These randomly selected fingerprint vectors have the same length as the one for the the reduced level 0.3. a MACCS scheme. b Pubchem scheme

Similar articles

Cited by

References

    1. Smith A. Screening for drug discovery: the leading question. Nature. 2002;418:453–459. - PubMed
    1. Lyne PD. Structure-based virtual screening: an overview. Drug Discovery Today. 2002;7:1047–1055. doi: 10.1016/S1359-6446(02)02483-2. - DOI - PubMed
    1. Willett P. Similarity-based virtual screening using 2D fingerprints. Drug Discovery Today. 2006;11:1046–1053. doi: 10.1016/j.drudis.2006.10.005. - DOI - PubMed
    1. Scior T, Bender A, Tresadern G, Medina-Franco JL, Martínez-Mayorga K, et al. Recognizing pitfalls in virtual screening: a critical review. J Chemical Information Modeling. 2012;52:867–881. doi: 10.1021/ci200528d. - DOI - PubMed
    1. Cereto-Massagué A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallvé S, et al. Molecular fingerprint similarity search in virtual screening. Methods. 2015;71:58–63. doi: 10.1016/j.ymeth.2014.08.005. - DOI - PubMed

LinkOut - more resources