Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jul 22;3(1):26.
doi: 10.1186/1758-2946-3-26.

PubChem3D: Biologically relevant 3-D similarity

Affiliations

PubChem3D: Biologically relevant 3-D similarity

Sunghwan Kim et al. J Cheminform. .

Abstract

Background: The use of 3-D similarity techniques in the analysis of biological data and virtual screening is pervasive, but what is a biologically meaningful 3-D similarity value? Can one find statistically significant separation between "active/active" and "active/inactive" spaces? These questions are explored using 734,486 biologically tested chemical structures, 1,389 biological assay data sets, and six different 3-D similarity types utilized by PubChem analysis tools.

Results: The similarity value distributions of 269.7 billion unique conformer pairs from 734,486 biologically tested compounds (all-against-all) from PubChem were utilized to help work towards an answer to the question: what is a biologically meaningful 3-D similarity score? The average and standard deviation for the six similarity measures STST-opt, CTST-opt, ComboTST-opt, STCT-opt, CTCT-opt, and ComboTCT-opt were 0.54 ± 0.10, 0.07 ± 0.05, 0.62 ± 0.13, 0.41 ± 0.11, 0.18 ± 0.06, and 0.59 ± 0.14, respectively. Considering that this random distribution of biologically tested compounds was constructed using a single theoretical conformer per compound (the "default" conformer provided by PubChem), further study may be necessary using multiple diverse conformers per compound; however, given the breadth of the compound set, the single conformer per compound results may still apply to the case of multi-conformer per compound 3-D similarity value distributions. As such, this work is a critical step, covering a very wide corpus of chemical structures and biological assays, creating a statistical framework to build upon.The second part of this study explored the question of whether it was possible to realize a statistically meaningful 3-D similarity value separation between reputed biological assay "inactives" and "actives". Using the terminology of noninactive-noninactive (NN) pairs and the noninactive-inactive (NI) pairs to represent comparison of the "active/active" and "active/inactive" spaces, respectively, each of the 1,389 biological assays was examined by their 3-D similarity score differences between the NN and NI pairs and analyzed across all assays and by assay category types. While a consistent trend of separation was observed, this result was not statistically unambiguous after considering the respective standard deviations. While not all "actives" in a biological assay are amenable to this type of analysis, e.g., due to different mechanisms of action or binding configurations, the ambiguous separation may also be due to employing a single conformer per compound in this study. With that said, there were a subset of biological assays where a clear separation between the NN and NI pairs found. In addition, use of combo Tanimoto (ComboT) alone, independent of superposition optimization type, appears to be the most efficient 3-D score type in identifying these cases.

Conclusion: This study provides a statistical guideline for analyzing biological assay data in terms of 3-D similarity and PubChem structure-activity analysis tools. When using a single conformer per compound, a relatively small number of assays appear to be able to separate "active/active" space from "active/inactive" space.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Atom and feature count histograms of biologically tested compounds. Frequency (blue) and percent cumulative frequency (red) of (a) heavy atom count and (b) total feature count for the 734,486 molecules tested in at least one bioassay archived in the PubChem BioAssay database.
Figure 2
Figure 2
Conformer volume and quadrupole histograms of biologically tested compounds. Frequency (blue) and percent cumulative frequency (red) of (a) volume, (b) Qx, (c) Qy, and (d) Qz for the 734,486 molecules tested in at least one bioassay archived in the PubChem BioAssay database.
Figure 3
Figure 3
Individual feature histograms of biologically tested compounds. Frequency (blue) and percent cumulative frequency (red) of respective feature atom count for the 734,486 molecules tested in at least one bioassay archived in the PubChem BioAssay database: (a) hydrogen-bond donor count, (b) hydrogen-bond acceptor count, (c) anion count, (d) cation count, (e) hydrophobe count, and (f) ring count.
Figure 4
Figure 4
Overall 3-D similarity statistics between biologically tested compounds. Distribution of 3-D similarity scores of 269,734,474,855 conformer pairs, arising from the 734,486 molecules tested in at least one bioassay archived in the PubChem BioAssay database: (a) ST-optimized similarity scores and (b) CT-optimized similarity scores. A single conformer was used for each compound. All values binned in 0.01 increments.
Figure 5
Figure 5
Per-CID shape similarity statistics of biologically tested compounds. Distribution of the average and standard deviation of the ST scores for each of the 734,486 molecules tested in at least one bioassay archived in the PubChem BioAssay database: (a) ST-optimized ST (STST-opt) and (b) CT-optimized ST (STCT-opt). All values binned in 0.01 increments.
Figure 6
Figure 6
Per-CID feature similarity statistics of biologically tested compounds. Distribution of the average and standard deviation of the CT scores for each of 734,486 molecules tested in at least one bioassay archived in the PubChem BioAssay database: (a) ST-optimized CT (CTST-opt) and (b) CT-optimized CT (CTCT-opt). All values binned in 0.01 increments.
Figure 7
Figure 7
Per-CID shape plus feature similarity statistics of biologically tested compounds. Distribution of the average and standard deviation of the ComboT scores for each of the 734,486 molecules tested in at least one bioassay archived in the PubChem BioAssay database: (a) ST-optimized ComboT (ComboTST-opt) and (b) CT-optimized ComboT (ComboTCT-opt). All values binned in 0.01 increments.
Figure 8
Figure 8
Assay counts by category. Assay count for each assay-type category in the PubChem BioAssay database: (a) for assays that have at least one tested molecule with 3-D information (as of January 28, 2010), (b) for assays that have at least one noninactive-noninactive (NN) pair and one noninactive-inactive (NI) pair, and (c) for assays that have at least six NN pairs and six NI pairs.
Figure 9
Figure 9
μ(XT) per-AID similarity histogram. The distribution of the average similarity scores for noninactive-noninactive (NN) pairs and noninactive-inactive (NI) pairs of 1,389 AIDs in the PubChem BioAssay database: (a) shape-Tanimoto (ST), (b) color-Tanimoto (CT), and (c) Combo-Tanimoto (ComboT). All values binned in 0.01 increments.
Figure 10
Figure 10
μ(XTNN-NI) per-AID similarity statistics. The distribution of the difference of the average similarity scores for noninactive-noninactive (NN) pairs and noninactive-inactive (NI) pairs of 1,389 AIDs in the PubChem BioAssay database: (a) shape-Tanimoto (ST), (b) color-Tanimoto (CT), and (c) Combo-Tanimoto (ComboT). All values binned in 0.01 increments.
Figure 11
Figure 11
Assay μ[μ(XTNN-NI)] outlier commonality by 3-D similarity type. The Venn diagrams show the number of AIDs whose difference of the average similarity scores for noninactive-noninactive (NN) pairs and noninactive-inactive (NI) pairs of 1,389 AIDs in the PubChem BioAssay database are out of the range of formula image, where "lower-bound" corresponds to μ - σ and "upper-bound" corresponds to μ + σ.
Figure 12
Figure 12
Assay μ[μ(ComboTNN-NI)] outlier commonality by superposition optimization type. The Venn diagrams show the number of AIDs whose difference of the average ComboT similarity scores for noninactive-noninactive (NN) pairs and noninactive-inactive (NI) pairs of 1,389 AIDs in the PubChem BioAssay database that are out of the range of formula image, where "lower-bound" corresponds to μ - σ and "upper-bound" corresponds to μ + σ. Upper-bound outliers tend to be shared by both superposition optimization types, while lower-bound outliers are less shared.
Figure 13
Figure 13
Separation between actives and inactives. An example of clear separation between formula image 3-D similarities of 0.45 (see Table 5), the four active compounds from AID 672: (a) shows 2-D and 3-D similarity dendrograms generated using the PubChem Structure Clustering tool; (b) shows the respective 2-D similarity values (lower triangle) and 3-D similarity values (upper triangle); and (c) shows a representative alignment showing how CID 647501 is 3-D similar to CID 787437 (ST/CT 0.73/0.28), despite low 2-D similarity (0.56).
Figure 14
Figure 14
2-D similarity isolates related chemical series. Dendrogram from the PubChem Structure Clustering tool for 88 of the 92 noninactive pairs from AID 2230 showing two primary clusters (containing 51 and 31 compounds, respectively) at 0.8 Tanimoto using 2-D similarity. Note that all but one compound is related above 0.7 Tanimoto.
Figure 15
Figure 15
3-D similarity interrelates chemical series. Dendrogram from the PubChem Structure Clustering tool for 88 of the 92 noninactive pairs from AID 2230 showing three primary clusters (containing 46, 20, and 13 compounds, respectively) at 1.2 combo Tanimoto (ComboT) using 3-D similarity, CT-optimized. All structures are interrelated at a ComboT of 1.04, more than 3.2 standard deviations beyond the random pair average of 0.59.
Figure 16
Figure 16
Analysis method overview. Pseudo code that describes the process by which the average and standard deviation of the ST-optimized similarity scores for noninactive-inactive (NI) pairs for individual bioassay were computed. This process was repeated for the CT-optimized similarity scores. For the average and standard deviation of the similarity scores for the noninactive-noninactive (NN) pairs were also computed in a similar manner, except that only the cid_aid_list1 (for noninactves) was searched both for cid1 and cid2.

Similar articles

Cited by

References

    1. Aina OH, Liu RW, Sutcliffe JL, Marik J, Pan CX, Lam KS. From combinatorial chemistry to cancer-targeting peptides. Mol Pharm. 2007;4:631–651. doi: 10.1021/mp700073y. - DOI - PubMed
    1. Pettersson S, Clotet-Codina I, Este JA, Borrell JI, Teixido J. Recent advances in combinatorial chemistry applied to development of anti-HIV drugs. Mini-Rev Med Chem. 2006;6:91–108. doi: 10.2174/138955706775197820. - DOI - PubMed
    1. Corbett PT, Leclaire J, Vial L, West KR, Wietor JL, Sanders JKM, Otto S. Dynamic combinatorial chemistry. Chem Rev. 2006;106:3652–3711. doi: 10.1021/cr020452p. - DOI - PubMed
    1. Rupasinghe CN, Spaller MR. The interplay between structure-based design and combinatorial chemistry. Curr Opin Chem Biol. 2006;10:188–193. doi: 10.1016/j.cbpa.2006.03.014. - DOI - PubMed
    1. Diller DJ. The synergy between combinatorial chemistry and high-throughput screening. Curr Opin Drug Discov Dev. 2008;11:346–355. - PubMed