Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Oct 30;10(10):e0141756.
doi: 10.1371/journal.pone.0141756. eCollection 2015.

Clustering Acoustic Segments Using Multi-Stage Agglomerative Hierarchical Clustering

Affiliations

Clustering Acoustic Segments Using Multi-Stage Agglomerative Hierarchical Clustering

Lerato Lerato et al. PLoS One. .

Abstract

Agglomerative hierarchical clustering becomes infeasible when applied to large datasets due to its O(N2) storage requirements. We present a multi-stage agglomerative hierarchical clustering (MAHC) approach aimed at large datasets of speech segments. The algorithm is based on an iterative divide-and-conquer strategy. The data is first split into independent subsets, each of which is clustered separately. Thus reduces the storage required for sequential implementations, and allows concurrent computation on parallel computing hardware. The resultant clusters are merged and subsequently re-divided into subsets, which are passed to the following iteration. We show that MAHC can match and even surpass the performance of the exact implementation when applied to datasets of speech segments.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. An example of a dendrogram.
Fig 2
Fig 2. The first stage of MAHC algorithm.
Fig 3
Fig 3. The second stage of MAHC algorithm.
Fig 4
Fig 4. The complete MAHC algorithm.
Fig 5
Fig 5. Best-fit lines to locate the knee of the graph in the L method.
Fig 6
Fig 6. AHC results of a small experiment with 29 true clusters.
The peak in the F-measure occurs at 24 clusters, while the knee of the L Method is found at 22 clusters.
Fig 7
Fig 7. Distribution of the number of segments per class for the two independent Set A and Set B.
Fig 8
Fig 8. Performance of MAHC and PSC for the small sets in terms of F-measure, using F-measure to determine thresholds in stage 1.
(a) F-measure for Small Set A (b) MAHC optimal number of clusters for Small Set A (c) F-measure for Small Set B (d) MAHC optimal number of clusters for Small Set B.
Fig 9
Fig 9. Performance of MAHC for the small sets in terms of F-measure, using the L method to determine thresholds in stage 1.
(a) MAHC and PSC F-measure for Small Set A (b) MAHC optimal number of clusters for Small Set A (c) MAHC and PSC F-Measure for Small Set B (d) MAHC optimal number of clusters for Small Set B.
Fig 10
Fig 10. Performances for the Medium Set.
(a) MAHC and PSC F-measure (b) MAHC optimal number of clusters (NC).
Fig 11
Fig 11. Confusion matrix of base phones of the large TIMIT dataset.
Fig 12
Fig 12. Influence of the number of subsets used by MAHC on the execution time.
Classical AHC is included as a baseline.

References

    1. Jain AK. Data Clustering: 50 years beyond K-means. Pattern Recognition Letters. 2010;31(8):651–666. 10.1016/j.patrec.2009.09.011 - DOI
    1. Jain AK, Dubes RC. Algorithms for Clustering Data. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.; 1988.
    1. Manning CD, Raghavan P. Introduction to Information Retrieval. New York, USA: Cambridge University Press; 2008.
    1. Fung G. A Comprehensive Overview of Basic Clustering Algorithms; 2001.
    1. Jain AK, Murty MN, Flynn PJ. Data Clustering: A Review. ACM Computing Surveys. 1999;31(3):264–323. 10.1145/331499.331504 - DOI

Publication types

LinkOut - more resources