Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jul 7:7:33.
doi: 10.1186/s13321-015-0070-x. eCollection 2015.

PubChem structure-activity relationship (SAR) clusters

Affiliations

PubChem structure-activity relationship (SAR) clusters

Sunghwan Kim et al. J Cheminform. .

Abstract

Background: Developing structure-activity relationships (SARs) of molecules is an important approach in facilitating hit exploration in the early stage of drug discovery. Although information on millions of compounds and their bioactivities is freely available to the public, it is very challenging to infer a meaningful and novel SAR from that information.

Results: Research discussed in the present paper employed a bioactivity-centered clustering approach to group 843,845 non-inactive compounds stored in PubChem according to both structural similarity and bioactivity similarity, with the aim of mining bioactivity data in PubChem for useful SAR information. The compounds were clustered in three bioactivity similarity contexts: (1) non-inactive in a given bioassay, (2) non-inactive against a given protein, and (3) non-inactive against proteins involved in a given pathway. In each context, these small molecules were clustered according to their two-dimensional (2-D) and three-dimensional (3-D) structural similarities. The resulting 18 million clusters, named "PubChem SAR clusters", were delivered in such a way that each cluster contains a group of small molecules similar to each other in both structure and bioactivity.

Conclusions: The PubChem SAR clusters, pre-computed using publicly available bioactivity information, make it possible to quickly navigate and narrow down the compounds of interest. Each SAR cluster can be a useful resource in developing a meaningful SAR or enable one to design or expand compound libraries from the cluster. It can also help to predict the potential therapeutic effects and pharmacological actions of less-known compounds from those of well-known compounds (i.e., drugs) in the same cluster.

Keywords: BioSystems; Cluster analysis; MeSH; Molecular similarity; PubChem; PubChem3D; Structure–activity relationship (SAR).

PubMed Disclaimer

Figures

Figure 1
Figure 1
Number of structure–activity relationship (SAR) clusters. These numbers do not include clusters with only one compound (i.e., singletons). 3-D clusters that have multiple conformers of only one compound were also regarded as singletons and not included in the statistics. N total indicates the total number of clusters for a given bioactivity similarity context. Numbers in parentheses on the pie charts indicate the percentage of five cluster types (based on structural similarity measures used in clustering) with respect to N total for the corresponding bioactivity similarity context. For all three bioactivity similarity contexts, there are more 3-D clusters than 2-D clusters.
Figure 2
Figure 2
Distribution of 2-D and 3-D cluster sizes in terms of the number of “compounds” per cluster. Panels a, b and c are for assay-, protein-, and pathway-centric clusters, respectively. The proportion of small clusters (e.g., with two or three compounds) are much greater for 3-D clusters than for 2-D clusters. This may be related to the use of multiple conformers per compound for 3-D clustering.
Figure 3
Figure 3
Distribution of 3-D cluster sizes in terms of the number of “conformers” per cluster. Panels a, b and c are for assay-, protein-, and pathway-centric clusters, respectively. Data for 2-D clusters are not shown because 2-D clustering does not use conformers.
Figure 4
Figure 4
Distribution of the number of clusters across all UIDs per compound. The UID indicates AID, GI, and BSID for assay-centric (panel a), protein-centric (panel b), and pathway-centric clusters (panel c), respectively.
Figure 5
Figure 5
Cluster overlap between similarity measures. The overlap between clusters from five different similarity measures is quantified with the average O(i,j) values, where i and j are indices for rows and columns, respectively (see text for the definition).
Figure 6
Figure 6
The number of the PubChem SAR clusters with high-value compounds (HVCs). The HVCs have high potencies (blue), MeSH annotations (red), or “Pharmacological Action” annotations (green). Panels a, b, and c are for assay-, protein-, and pathway-centric clusters. Numbers in parentheses indicate the percentages relative to the respective total cluster counts.
Figure 7
Figure 7
Distribution of the number of high-value compounds (HVCs) per cluster. Panels a, b, and c are for the assay-, target-, and pathway-centric clusters.
Figure 8
Figure 8
ComboT CT-opt and 2-D clusters for AID 47904. Each node represents a non-inactive compound and the edge between two nodes within a cluster indicates that the distance between the two CIDs is closer than the d thresh value used for clustering. The node color represents the value of the inhibition constant (K i) for the compound against human carbonic anhydrase (CA) isozyme II. All singletons are removed.
Figure 9
Figure 9
Collapse of conformer clusters into compound clusters. A compound is represented with a square node and its conformer is represented with a round node of the same color. An edge between two conformer nodes indicates that the distance between them is below the d thresh value used for clustering, and the edge between two compound nodes indicates that at least one conformer pair arising from the two compounds is below the d thresh value. PubChem 3-D SAR clustering algorithm is initially applied to conformers of non-inactive compounds, resulting in conformer clusters (in the left panel). Compound clusters are constructed by replacing the conformers with the respective compounds (in the right panel). As a result, a compound can occur in multiple compound clusters (via its different conformers).
Figure 10
Figure 10
ComboT CT-opt and 2-D clusters for aryl hydrocarbon receptor (AhR; GI 29337198). CID 15625 (2,3,7,8-Tetrachlorodibenzo-p-dioxin, also known as TCDD) is tested in two different publications. The numbers in the squares correspond to the CIDs. The colors of the squares indicate the publications where data were obtained.
Figure 11
Figure 11
CT CT-opt, ComboT CT-opt, and 2-D clusters for BSID 545294. The nodes are noninactive compounds in assays involved in BSID545294. The node colors represent the original literature from which the biological activities of the compounds were extracted (green for PMID 17346963, cyan for PMID 18707087, purple for PMID 21309593, and red for PMID 21591606). The node labels are omitted for brevity, but information on cluster members can be found in Additional file 3.

References

    1. Bolton EE, Wang Y, Thiessen PA, Bryant SH. PubChem: integrated platform of small molecules and biological activities. In: Ralph AW, David CS, editors. Annual reports in computational chemistry. Amsterdam: Elsevier; 2008. pp. 217–241.
    1. Wang YL, Xiao JW, Suzek TO, Zhang J, Wang JY, Bryant SH. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009;37:W623–W633. doi: 10.1093/nar/gkp456. - DOI - PMC - PubMed
    1. Wang YL, Bolton E, Dracheva S, Karapetyan K, Shoemaker BA, Suzek TO, et al. An overview of the PubChem BioAssay resource. Nucleic Acids Res. 2010;38:D255–D266. doi: 10.1093/nar/gkp965. - DOI - PMC - PubMed
    1. Wang YL, Xiao JW, Suzek TO, Zhang J, Wang JY, Zhou ZG, et al. PubChem’s BioAssay database. Nucleic Acids Res. 2012;40:D400–D412. doi: 10.1093/nar/gkr1132. - DOI - PMC - PubMed
    1. Wang YL, Suzek T, Zhang J, Wang JY, He SQ, Cheng TJ, et al. PubChem BioAssay: 2014 update. Nucleic Acids Res. 2014;42:D1075–D1082. doi: 10.1093/nar/gkt978. - DOI - PMC - PubMed