Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar 30:8:16.
doi: 10.1186/s13321-016-0127-5. eCollection 2016.

Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes

Affiliations

Impact of similarity threshold on the topology of molecular similarity networks and clustering outcomes

Gergely Zahoránszky-Kőhalmi et al. J Cheminform. .

Erratum in

Abstract

Background: Complex network theory based methods and the emergence of "Big Data" have reshaped the terrain of investigating structure-activity relationships of molecules. This change gave rise to new methods which need to face an important challenge, namely: how to restructure a large molecular dataset into a network that best serves the purpose of the subsequent analyses. With special focus on network clustering, our study addresses this open question by proposing a data transformation method and a clustering framework.

Results: Using the WOMBAT and PubChem MLSMR datasets we investigated the relation between varying the similarity threshold applied on the similarity matrix and the average clustering coefficient of the emerging similarity-based networks. These similarity networks were then clustered with the InfoMap algorithm. We devised a systematic method to generate so-called "pseudo-reference" clustering datasets which compensate for the lack of large-scale reference datasets. With help from the clustering framework we were able to observe the effects of varying the similarity threshold and its consequence on the average clustering coefficient and the clustering performance.

Conclusions: We observed that the average clustering coefficient versus similarity threshold function can be characterized by the presence of a peak that covers a range of similarity threshold values. This peak is preceded by a steep decline in the number of edges of the similarity network. The maximum of this peak is well aligned with the best clustering outcome. Thus, if no reference set is available, choosing the similarity threshold associated with this peak would be a near-ideal setting for the subsequent network cluster analysis. The proposed method can be used as a general approach to determine the appropriate similarity threshold to generate the similarity network of large-scale molecular datasets.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Transforming a similarity matrix to a similarity network. The upper part of the figure shows the original similarity matrix and a network representing it. The lower part of the figure shows a threshold matrix and the corresponding similarity network that was derived by applying a t = 0.7 similarity threshold on the original similarity matrix. Elements of the similarity matrix containing similarity-coefficients greater than or equal to t = 0.7 are transformed to 1. Rest of the elements of the similarity matrix are colored with light gray in the threshold matrix and their values are transformed to 0. In the resultant similarity network molecule D is a singleton because it only has molecules less similar to itself than the similarity threshold of choice
Fig. 2
Fig. 2
The influence of edge addition/removal on the average clustering coefficient. An intriguing dynamics between a network’s average clustering coefficient is observed upon adding or removing edges from the network. a Provides an example in which the average clustering coefficient increases followed by the addition of a new edge, shown as red dashed line in the lower network. b Shows a somewhat counterintuitive scenario in which the average clustering coefficient of a network actually decreases upon the addition of one edge. The added edge is shown as red dashed line in the lower network
Fig. 3
Fig. 3
Cluster size distribution of pseudo-reference clustering datasets. The x-axis of the graph is shown on log-scale and it represents the size of clusters in the case of the pseudo-clustering datasets generated from the WOMBAT and PubChem MLSMR datasets. The y-axis represents the relative frequency of certain cluster sizes. A given dataset is characterized by cluster sizes that have a higher frequency. The overall frequency of cluster sizes provides the cluster size profile of a dataset. As it can be seen the cluster size profile of the two datasets are nearly identical, with small differences in the low cluster size and in the large cluster size regions
Fig. 4
Fig. 4
Average clustering coefficient of similarity networks in the function of the similarity threshold. For all datasets it is possible to identify a peak that stands out in comparison with the others by spanning the largest range of similarity threshold t. The threshold associated with the highest ACC value in the peak is denoted as t α, i.e. the so-called obvious local maximum of the ACC(t) function. Fingerprint: ECFP_4, similarity measure: Tanimoto similarity-coefficient. a SCL dataset. b WOMBAT dataset. c PubChem MLSMR dataset
Fig. 5
Fig. 5
Number of edges in the function of the similarity threshold. Fingerprint: ECFP_4, similarity measure: Tanimoto similarity-coefficient. For each dataset it can be observed that the number of edges shows a decrease of steep slope at low ranges of the applied similarity threshold. This steep decline is followed by a drastic change in the slope over a short range of the similarity threshold. a SCL dataset. b WOMBAT dataset. c PubChem MLSMR dataset
Fig. 6
Fig. 6
Clustering performance in the function of the similarity threshold. On each figure shown are the sensitivity and specificity values associated with the determined t α, i.e. the ‘obvious’ local maximum to choose. Dashed vertical line indicates the location of t α on the x-axis. a In the case of the SCL dataset both sensitivity and specificity values meet the ideal value of 1 over a range of similarity thresholds (0.19 ≤ t ≤ 0.27 and at t = 0.23). Please note that above t = 0.91 the similarity network only consists of singletons, therefore the respective experimental points are not displayed on the graph. b In the case of the WOMBAT dataset the value of sensitivity and specificity associated with t α = 0.40 are 0.8689 and 0.9994, respectively. The deviation between these values and their observed maximum is acceptable. c In the case of the PubChem MLSMR dataset the sensitivity and specificity associated with t α = 0.50 are 0.4905 and 0.9997, respectively. The deviation between these values and their observed maximum is acceptable

Similar articles

Cited by

References

    1. Palla G, Derényi I, Farkas I, Vicsek T. Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005;435(7043):814–818. doi: 10.1038/nature03607. - DOI - PubMed
    1. Derényi I, Palla G, Vicsek T. Clique percolation in random networks. Phys Rev Lett. 2005;94(16):160202. doi: 10.1103/PhysRevLett.94.160202. - DOI - PubMed
    1. Adamcsek B, Palla G, Farkas IJ, Derényi I, Vicsek T. CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006;22(8):1021–1023. doi: 10.1093/bioinformatics/btl039. - DOI - PubMed
    1. Zahoránszky LA, Katona GY, Hári P, Málnási-Csizmadia A, Zweig KA, Zahoránszky-Köhalmi G. Breaking the hierarchy—a new cluster selection mechanism for hierarchical clustering methods. Algorithms Mol Biol. 2009;4(1):12. doi: 10.1186/1748-7188-4-12. - DOI - PMC - PubMed
    1. Rosvall M, Bergstrom CT. Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci USA. 2008;105(4):1118–1123. doi: 10.1073/pnas.0706851105. - DOI - PMC - PubMed