Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes

Gergely Zahoránszky-Kőhalmi¹, Cristian G Bologa¹, Tudor I Oprea¹

Affiliations

PMID: 27030802
PMCID: PMC4812625
DOI: 10.1186/s13321-016-0127-5

Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes

Gergely Zahoránszky-Kőhalmi et al. J Cheminform. 2016.

. 2016 Mar 30:8:16.

doi: 10.1186/s13321-016-0127-5. eCollection 2016.

Authors

Gergely Zahoránszky-Kőhalmi¹, Cristian G Bologa¹, Tudor I Oprea¹

Affiliation

¹ Translational Informatics Division, University of New Mexico School of Medicine, MSC09 5025, Albuquerque, NM 87131 USA.

PMID: 27030802
PMCID: PMC4812625
DOI: 10.1186/s13321-016-0127-5

Erratum in

Erratum to: Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes.
Zahoránszky-Kőhalmi G, Bologa CG, Ursu O, Oprea TI. Zahoránszky-Kőhalmi G, et al. J Cheminform. 2016 May 20;8:28. doi: 10.1186/s13321-016-0140-8. eCollection 2016. J Cheminform. 2016. PMID: 27213018 Free PMC article.

Abstract

Background: Complex network theory based methods and the emergence of "Big Data" have reshaped the terrain of investigating structure-activity relationships of molecules. This change gave rise to new methods which need to face an important challenge, namely: how to restructure a large molecular dataset into a network that best serves the purpose of the subsequent analyses. With special focus on network clustering, our study addresses this open question by proposing a data transformation method and a clustering framework.

Results: Using the WOMBAT and PubChem MLSMR datasets we investigated the relation between varying the similarity threshold applied on the similarity matrix and the average clustering coefficient of the emerging similarity-based networks. These similarity networks were then clustered with the InfoMap algorithm. We devised a systematic method to generate so-called "pseudo-reference" clustering datasets which compensate for the lack of large-scale reference datasets. With help from the clustering framework we were able to observe the effects of varying the similarity threshold and its consequence on the average clustering coefficient and the clustering performance.

Conclusions: We observed that the average clustering coefficient versus similarity threshold function can be characterized by the presence of a peak that covers a range of similarity threshold values. This peak is preceded by a steep decline in the number of edges of the similarity network. The maximum of this peak is well aligned with the best clustering outcome. Thus, if no reference set is available, choosing the similarity threshold associated with this peak would be a near-ideal setting for the subsequent network cluster analysis. The proposed method can be used as a general approach to determine the appropriate similarity threshold to generate the similarity network of large-scale molecular datasets.

PubMed Disclaimer

Figures

**Fig. 1**
Transforming a similarity matrix to a similarity network. The upper part of the figure shows the original similarity matrix and a network representing it. The *lower part* of the figure shows a threshold matrix and the corresponding similarity network that was derived by applying a t = 0.7 similarity threshold on the original similarity matrix. Elements of the similarity matrix containing similarity-coefficients greater than or equal to t = 0.7 are transformed to 1. Rest of the elements of the similarity matrix are colored with light gray in the threshold matrix and their values are transformed to 0. In the resultant similarity network molecule D is a singleton because it only has molecules less similar to itself than the similarity threshold of choice

**Fig. 2**
The influence of edge addition/removal on the average clustering coefficient. An intriguing dynamics between a network’s average clustering coefficient is observed upon adding or removing edges from the network. a Provides an example in which the average clustering coefficient increases followed by the addition of a new edge, shown as *red dashed line* in the lower network. b Shows a somewhat counterintuitive scenario in which the average clustering coefficient of a network actually decreases upon the addition of one edge. The added edge is shown as *red dashed line* in the lower network

**Fig. 3**
Cluster size distribution of pseudo-reference clustering datasets. The x-axis of the graph is shown on log-scale and it represents the size of clusters in the case of the pseudo-clustering datasets generated from the WOMBAT and PubChem MLSMR datasets. The y-axis represents the relative frequency of certain cluster sizes. A given dataset is characterized by cluster sizes that have a higher frequency. The overall frequency of cluster sizes provides the cluster size profile of a dataset. As it can be seen the cluster size profile of the two datasets are nearly identical, with small differences in the low cluster size and in the large cluster size regions

**Fig. 4**
Average clustering coefficient of similarity networks in the function of the similarity threshold. For all datasets it is possible to identify a peak that stands out in comparison with the others by spanning the largest range of similarity threshold t. The threshold associated with the highest *ACC* value in the peak is denoted as t _α, i.e. the so-called obvious local maximum of the *ACC*(t) function. Fingerprint: ECFP_4, similarity measure: Tanimoto similarity-coefficient. a SCL dataset. b WOMBAT dataset. c PubChem MLSMR dataset

**Fig. 5**
Number of edges in the function of the similarity threshold. Fingerprint: ECFP_4, similarity measure: Tanimoto similarity-coefficient. For each dataset it can be observed that the number of edges shows a decrease of steep slope at low ranges of the applied similarity threshold. This steep decline is followed by a drastic change in the slope over a short range of the similarity threshold. a SCL dataset. b WOMBAT dataset. c PubChem MLSMR dataset

**Fig. 6**
Clustering performance in the function of the similarity threshold. On each figure shown are the sensitivity and specificity values associated with the determined t _α, i.e. the ‘obvious’ local maximum to choose. *Dashed vertical line* indicates the location of t _α on the x-axis. a In the case of the SCL dataset both sensitivity and specificity values meet the ideal value of 1 over a range of similarity thresholds (0.19 ≤ t ≤ 0.27 and at t = 0.23). Please note that above t = 0.91 the similarity network only consists of singletons, therefore the respective experimental points are not displayed on the graph. b In the case of the WOMBAT dataset the value of sensitivity and specificity associated with t _α = 0.40 are 0.8689 and 0.9994, respectively. The deviation between these values and their observed maximum is acceptable. c In the case of the PubChem MLSMR dataset the sensitivity and specificity associated with t _α = 0.50 are 0.4905 and 0.9997, respectively. The deviation between these values and their observed maximum is acceptable

See this image and copyright information in PMC

References

1. Palla G, Derényi I, Farkas I, Vicsek T. Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005;435(7043):814–818. doi: 10.1038/nature03607. - DOI - PubMed
1. Derényi I, Palla G, Vicsek T. Clique percolation in random networks. Phys Rev Lett. 2005;94(16):160202. doi: 10.1103/PhysRevLett.94.160202. - DOI - PubMed
1. Adamcsek B, Palla G, Farkas IJ, Derényi I, Vicsek T. CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006;22(8):1021–1023. doi: 10.1093/bioinformatics/btl039. - DOI - PubMed
1. Zahoránszky LA, Katona GY, Hári P, Málnási-Csizmadia A, Zweig KA, Zahoránszky-Köhalmi G. Breaking the hierarchy—a new cluster selection mechanism for hierarchical clustering methods. Algorithms Mol Biol. 2009;4(1):12. doi: 10.1186/1748-7188-4-12. - DOI - PMC - PubMed
1. Rosvall M, Bergstrom CT. Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci USA. 2008;105(4):1118–1123. doi: 10.1073/pnas.0706851105. - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes

Affiliation

Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes

Authors

Affiliation

Erratum in

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources