. 2012 Nov 7;4(1):28.

doi: 10.1186/1758-2946-4-28.

Effects of multiple conformers per compound upon 3-D similarity search and bioassay data analysis

Sunghwan Kim¹, Evan E Bolton, Stephen H Bryant

Affiliations

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, 8600 Rockville Pike, Bethesda, 20894, MD, USA. bolton@ncbi.nlm.nih.gov.

PMID: 23134593
PMCID: PMC3537644
DOI: 10.1186/1758-2946-4-28

Effects of multiple conformers per compound upon 3-D similarity search and bioassay data analysis

Sunghwan Kim et al. J Cheminform. 2012.

. 2012 Nov 7;4(1):28.

doi: 10.1186/1758-2946-4-28.

Authors

Sunghwan Kim¹, Evan E Bolton, Stephen H Bryant

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, 8600 Rockville Pike, Bethesda, 20894, MD, USA. bolton@ncbi.nlm.nih.gov.

PMID: 23134593
PMCID: PMC3537644
DOI: 10.1186/1758-2946-4-28

Abstract

Background: To improve the utility of PubChem, a public repository containing biological activities of small molecules, the PubChem3D project adds computationally-derived three-dimensional (3-D) descriptions to the small-molecule records contained in the PubChem Compound database and provides various search and analysis tools that exploit 3-D molecular similarity. Therefore, the efficient use of PubChem3D resources requires an understanding of the statistical and biological meaning of computed 3-D molecular similarity scores between molecules.

Results: The present study investigated effects of employing multiple conformers per compound upon the 3-D similarity scores between ten thousand randomly selected biologically-tested compounds (10-K set) and between non-inactive compounds in a given biological assay (156-K set). When the "best-conformer-pair" approach, in which a 3-D similarity score between two compounds is represented by the greatest similarity score among all possible conformer pairs arising from a compound pair, was employed with ten diverse conformers per compound, the average 3-D similarity scores for the 10-K set increased by 0.11, 0.09, 0.15, 0.16, 0.07, and 0.18 for STST-opt, CTST-opt, ComboTST-opt, STCT-opt, CTCT-opt, and ComboTCT-opt, respectively, relative to the corresponding averages computed using a single conformer per compound. Interestingly, the best-conformer-pair approach also increased the average 3-D similarity scores for the non-inactive-non-inactive (NN) pairs for a given assay, by comparable amounts to those for the random compound pairs, although some assays showed a pronounced increase in the per-assay NN-pair 3-D similarity scores, compared to the average increase for the random compound pairs.

Conclusion: These results suggest that the use of ten diverse conformers per compound in PubChem bioassay data analysis using 3-D molecular similarity is not expected to increase the separation of non-inactive from random and inactive spaces "on average", although some assays show a noticeable separation between the non-inactive and random spaces when multiple conformers are used for each compound. The present study is a critical next step to understand effects of conformational diversity of the molecules upon the 3-D molecular similarity and its application to biological activity data analysis in PubChem. The results of this study may be helpful to build search and analysis tools that exploit 3-D molecular similarity between compounds archived in PubChem and other molecular libraries in a more efficient way.

PubMed Disclaimer

Figures

**Figure 1**
**Similarity distributions for “single-conformer” (*Scenario A*) approach.** Binned distributions in 0.01 increments of the 3-D similarity scores for the unique “conformer-conformer” pairs arising from 10,000 randomly selected biologically tested compounds (10-K set), computed using a single conformer per compound for **(a)** ST-optimized and **(b)** CT-optimized superpositions.

**Figure 2**
**Similarity distributions for multi-conformer “all-conformer-pair” (*Scenario B*) approach.** Binned distributions in 0.01 increments of the 3-D similarity scores for the unique “conformer-conformer” pairs arising from 10,000 randomly selected biologically tested compounds (10-K set), computed using ten diverse conformers per compound and the “all-conformer-pair” approach for **(a)** ST-optimized and **(b)** CT-optimized superpositions.

**Figure 3**
**Similarity distributions for multi-conformer “best-conformer-pair” (*Scenario D*) approach.** Binned distributions in 0.01 increments of the 3-D similarity scores for the unique “conformer-compound” pairs arising from 10,000 randomly selected biologically tested compounds (10-K set), computed using ten diverse conformers per compound and the “best-conformer-pair” approach for **(a)** ST-optimized and **(b)** CT-optimized superpositions.

**Figure 4**
**Similarity distributions for multi-conformer “best-conformer-pair” (*Scenario E*) approach.** Binned distributions in 0.01 increments of the 3-D similarity scores for the unique “compound-compound” pairs arising from 10,000 randomly selected biologically tested compounds (10-K set), computed using ten diverse conformers per compound and the “best-conformer-pair” approach for **(a)** ST-optimized and **(b)** CT-optimized superpositions.

**Figure 5**
**Average and standard deviation distributions for shape-Tanimoto (ST), per “query”.** Binned distributions in 0.01 increments of the average and standard deviation of the shape-Tanimoto (ST) scores per query-type for the five similarity search scenarios tested (see Table 1): *Scenario A* [**(a)** and **(b)**], *Scenario B* [**(c)** and **(d)**], *Scenario C* [**(e)** and **(f)**], *Scenario D* [**(g)** and **(h)**], and *Scenario E* [**(i)** and **(j)**]. The left panels [**(a)**, **(c)**, **(e)**, **(g)**, and **(i)**] are for the ST-optimized ST scores, and the right panels [**(b)**, **(d)**, **(f)**, **(h)**, and **(j)**] are for the CT-optimized ST scores.

**Figure 6**
**Average and standard deviation distributions for color-Tanimoto (CT), per “query”.** Binned distributions in 0.01 increments of the average and standard deviation of the color-Tanimoto (CT) scores per query-type for the five similarity search scenarios tested (see Table 1): *Scenario A* [**(a)** and **(b)**], *Scenario B* [**(c)** and **(d)**], *Scenario C* [**(e)** and **(f)**], *Scenario D* [**(g)** and **(h)**], and *Scenario E* [**(i)** and **(j)**]. The left panels [**(a)**, **(c)**, **(e)**, **(g)**, and **(i)**] are for the ST-optimized CT scores, and the right panels [**(b)**, **(d)**, **(f)**, **(h)**, and **(j)**] are for the CT-optimized CT scores.

**Figure 7**
**Average and standard deviation distributions for combo-Tanimoto (ComboT), per “query”.** Binned distributions in 0.01 increments of the average and standard deviation of the combo-Tanimoto (ComboT) scores per query-type for the five similarity search scenarios tested (see Table 1): *Scenario A* [**(a)** and **(b)**], *Scenario B* [**(c)** and **(d)**], *Scenario C* [**(e)** and **(f)**], *Scenario D* [**(g)** and **(h)**], and *Scenario E* [**(i)** and **(j)**]. The left panels [**(a)**, **(c)**, **(e)**, **(g)**, and **(i)**] are for the ST-optimized ComboT scores, and the right panels [**(b)**, **(d)**, **(f)**, **(h)**, and **(j)**] are for the CT-optimized ComboT scores.

**Figure 8**
**Break down of assays by type.** Assay-type counts for the 1,528 bioassays considered in the present study.

**Figure 9**
**Per-AID shape-Tanimoto (ST)-optimized 3-D similarity average values.** Binned distributions in 0.01 increments of the average 3-D similarity scores for non-inactive–non-inactive (NN) pairs of 1,528 AIDs in the PubChem BioAssay database, computed at the shape-Tanimoto-optimized superposition: **(a)** shape-Tanimoto (ST), **(b)** color-Tanimoto (CT), and **(c)** combo-Tanimoto (ComboT). “Single conformer”, “Multiple conformers (all)”, and “Multiple conformers (best)” correspond to search scenarios A, B, and E, respectively (See Table 1).

**Figure 10**
**Per-AID color-Tanimoto (CT)-optimized 3-D similarity average values.** Binned distributions in 0.01 increments of the average 3-D similarity scores for non-inactive–non-inactive (NN) pairs of 1,528 AIDs in the PubChem BioAssay database, computed at color-Tanimoto-optimized superposition: **(a)** shape-Tanimoto (ST), **(b)** color-Tanimoto (CT), and **(c)** combo-Tanimoto (ComboT). “Single conformer”, “Multiple conformers (all)”, and “Multiple conformers (best)” correspond to search scenarios A, B, and E, respectively (See Table 1).

**Figure 11**
**Deviation from random of per-AID shape-Tanimoto (ST)-optimized 3-D similarity average values.** Deviation of the ST-optimized 3-D similarity scores for non-inactive–non-inactive (NN) pairs of 1,528 AIDs from the corresponding average for the random compound pairs, computed using both a single conformer and best multiple (ten) diverse conformers per compound: **(a)** ST-optimized ST, **(b)** ST-optimized CT, and **(c)** ST-optimized ComboT. The deviations are binned with increment of 0.1 standard deviation (σ) unit. “Single” and “Multiple” refer to search scenarios A and E, respectively (See Table 1).

**Figure 12**
**Deviation from random of per-AID color-Tanimoto (CT)-optimized 3-D similarity average values.** Deviation of the CT-optimized 3-D similarity scores for non-inactive–non-inactive (NN) pairs of 1,528 AIDs from the corresponding average for the random compound pairs, computed using both a single conformer and best multiple (ten) diverse conformers per compound: **(a)** CT-optimized ST, **(b)** CT-optimized CT, and **(c)** CT-optimized ComboT. The deviations are binned with increment of 0.1 standard deviation (σ) unit. “Single” and “Multiple” refer to search scenarios A and E, respectively (See Table 1).

**Figure 13**
**Demonstrated multi-conformer effects using AID 1033.** Effects of employing multiple conformers per compound upon 3-D similarity of the non-inactive compounds tested in AID 1033. Eight compounds in panel **(a)** are non-inactive in AID 1033. Panel **(b)** depicts the dendrogram that shows the 2-D similarity among the eight structures, computed using the PubChem subgraph fingerprints. The dendrograms for the 3-D shape-optimized combo-Tanimoto (*ComboT*^ST-opt) similarity are shown in panels **(c)** and **(d)** for a single conformer per compound and ten diverse conformers per compound, respectively. Panel **(e)** compares conformer superpositions between two of the non-inactive compounds (CIDs 668798 and 1246750). LID stands for the local identifier, which represents different conformers of a compound.

**Figure 14**
**Demonstrated multi-conformer effects using AID 491.** Effects of employing multiple conformers per compound upon 3-D similarity of non-inactive compounds tested in AID 491. Panel **(a)** shows the dendrogram based on 2-D similarity among eight compounds selected from 60 non-inactive compounds in AID 491. The dendrograms for the 3-D shape-optimized combo-Tanimoto (*ComboT*^ST-opt) similarity are shown in panels **(b)** and **(c)** for a single conformer per compound and ten diverse conformers per compound, respectively. Panel **(d)** compares conformer superpositions between two of the non-inactive compounds (CIDs 490518 and 505938). LID stands for the local conformer identifier, which represents different conformers of a compound.

**Figure 15**
**Summary comparison of overall average similarity.** Comparison of the overall average 3-D similarity scores, μμ(XT)], for the non-inactive–non-inactive (NN) pairs with those for the non-inactive–inactive (NI) pairs and random compound pairs. The words, “Single”, “Best”, and “All”, in the legend box indicate the single-conformer approach (*Scenario A*), “best-conformer-pair” approach (*Scenario E*), and “all-conformer-pair” approach (*Scenario B*), respectively. Study A is the present study, and Study B is a previous study by Kim et al. (Ref. [10]).

See this image and copyright information in PMC

Cited by

An update on PUG-REST: RESTful interface for programmatic access to PubChem.
Kim S, Thiessen PA, Cheng T, Yu B, Bolton EE. Kim S, et al. Nucleic Acids Res. 2018 Jul 2;46(W1):W563-W570. doi: 10.1093/nar/gky294. Nucleic Acids Res. 2018. PMID: 29718389 Free PMC article.
Similar compounds versus similar conformers: complementarity between PubChem 2-D and 3-D neighboring sets.
Kim S, Bolton EE, Bryant SH. Kim S, et al. J Cheminform. 2016 Nov 4;8:62. doi: 10.1186/s13321-016-0163-1. eCollection 2016. J Cheminform. 2016. PMID: 27872662 Free PMC article.
Integration of mass spectral fingerprinting analysis with precursor ion (MS1) quantification for the characterisation of botanical extracts: application to extracts of Centella asiatica (L.) Urban.
Alcazar Magana A, Wright K, Vaswani A, Caruso M, Reed RL, Bailey CF, Nguyen T, Gray NE, Soumyanath A, Quinn J, Stevens JF, Maier CS. Alcazar Magana A, et al. Phytochem Anal. 2020 Nov;31(6):722-738. doi: 10.1002/pca.2936. Epub 2020 Apr 12. Phytochem Anal. 2020. PMID: 32281154 Free PMC article.
Target enhanced 2D similarity search by using explicit biological activity annotations and profiles.
Yu X, Geer LY, Han L, Bryant SH. Yu X, et al. J Cheminform. 2015 Nov 17;7:55. doi: 10.1186/s13321-015-0103-5. eCollection 2015. J Cheminform. 2015. PMID: 26583046 Free PMC article.
Finding Potential Multitarget Ligands Using PubChem.
Kim S, Shoemaker BA, Bolton EE, Bryant SH. Kim S, et al. Methods Mol Biol. 2018;1825:63-91. doi: 10.1007/978-1-4939-8639-2_2. Methods Mol Biol. 2018. PMID: 30334203 Free PMC article.

See all "Cited by" articles

References

1. Bolton EE, Wang Y, Thiessen PA, Bryant SH. In: Annual Reports in Computational Chemistry. Volume 4. Ralph AW, David CS, editor. Amsterdam, the Netherlands: Elsevier; 2008. PubChem: integrated platform of small molecules and biological activities; pp. 217–241.
1. Wang YL, Bolton E, Dracheva S, Karapetyan K, Shoemaker BA, Suzek TO, Wang JY, Xiao JW, Zhang J, Bryant SH. An overview of the PubChem BioAssay resource. Nucleic Acids Res. 2010;38:D255–D266. doi: 10.1093/nar/gkp965. - DOI - PMC - PubMed
1. Wang YL, Xiao JW, Suzek TO, Zhang J, Wang JY, Zhou ZG, Han LY, Karapetyan K, Dracheva S, Shoemaker BA. et al.PubChem's BioAssay Database. Nucleic Acids Res. 2012;40:D400–D412. doi: 10.1093/nar/gkr1132. - DOI - PMC - PubMed
1. Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S. et al.Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2012;40:D13–D25. doi: 10.1093/nar/gkr1184. - DOI - PMC - PubMed
1. PubChem3D Thematic Series. http://www.jcheminf.com/series/pubchem3d.

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Effects of multiple conformers per compound upon 3-D similarity search and bioassay data analysis

Affiliation

Effects of multiple conformers per compound upon 3-D similarity search and bioassay data analysis

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources