. 2007 Feb 7:8:44.

doi: 10.1186/1471-2105-8-44.

Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement

Francisco R Pinto¹, João A Carriço, Mário Ramirez, Jonas S Almeida

Affiliations

Affiliation

¹ Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina de Lisboa, Av, Professor Egas Moniz, 1649-028 Lisboa, Portugal. fpinto@fm.ul.pt

PMID: 17286861
PMCID: PMC1802093
DOI: 10.1186/1471-2105-8-44

Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement

Francisco R Pinto et al. BMC Bioinformatics. 2007.

. 2007 Feb 7:8:44.

doi: 10.1186/1471-2105-8-44.

Authors

Francisco R Pinto¹, João A Carriço, Mário Ramirez, Jonas S Almeida

Affiliation

¹ Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina de Lisboa, Av, Professor Egas Moniz, 1649-028 Lisboa, Portugal. fpinto@fm.ul.pt

PMID: 17286861
PMCID: PMC1802093
DOI: 10.1186/1471-2105-8-44

Abstract

Background: Biological information is commonly used to cluster or classify entities of interest such as genes, conditions, species or samples. However, different sources of data can be used to classify the same set of entities and methods allowing the comparison of the performance of two data sources or the determination of how well a given classification agrees with another are frequently needed, especially in the absence of a universally accepted "gold standard" classification.

Results: Here, we describe a novel measure--the Ranked Adjusted Rand (RAR) index. RAR differs from existing methods by evaluating the extent of agreement between any two groupings, taking into account the intercluster distances. This characteristic is relevant to evaluate cases of pairs of entities grouped in the same cluster by one method and separated by another. The latter method may assign them to close neighbour clusters or, on the contrary, to clusters that are far apart from each other. RAR is applicable even when intercluster distance information is absent for both or one of the groupings. In the first case, RAR is equal to its predecessor, Adjusted Rand (HA) index. Artificially designed clusterings were used to demonstrate situations in which only RAR was able to detect differences in the grouping patterns. A study with larger simulated clusterings ensured that in realistic conditions, RAR is effectively integrating distance and partition information. The new method was applied to biological examples to compare 1) two microbial typing methods, 2) two gene regulatory network distances and 3) microarray gene expression data with pathway information. In the first application, one of the methods does not provide intercluster distances while the other originated a hierarchical clustering. RAR proved to be more sensitive than HA in the choice of a threshold for defining clusters in the hierarchical method that maximizes agreement between the results of both methods.

Conclusion: RAR has its major advantage in combining cluster distance and partition information, while the previously available methods used only the latter. RAR should be used in the research problems were HA was previously used, because in the absence of inter cluster distance effects it is an equally effective measure, and in the presence of distance effects it is a more complete one.

PubMed Disclaimer

Figures

**Figure 1**
**Small clusterings example of *RAR*'s unique properties**. Clustering A divides 9 points (numbered circles) in three clusters identified by rectangles. By splitting the {1, 2, 3, 4} cluster, the clusterings B, C and D were formed. One of the child clusters kept the same location. The second child cluster moved away from the original location. In B and C, the second child cluster has only one entity, while in D it has three. In B and D the two split clusters are nearest neighbours, while in C they are maximally separated. The two dimensional coordinates of the points in the figure were used to compute average distances between clusters and to calculate *RAR* and other clustering comparison measures. The results are presented in Table 4.

**Figure 2**
**Ranked Adjusted Rand (*RAR*), Adjusted Rand (HA) and Wallace (W) indices for the comparison of *emm* type with PFGE clusterings using different Dice dissimilarity thresholds**. Dice dissimilarity is in a 0–100 scale. The plot in the top indicates the number of PFGE clusters originated with the respective threshold, while the number of *emm* types is always 12. The minimum threshold studied, 1, does not originate 325 clusters because there are sets of isolates whose PFGE band patterns have a Dice dissimilarity of 0. W(*emm-PFGE*) is the probability that a pair of isolates is in the same PFGE cluster knowing that they have the same *emm* type. Analogously, W(*PFGE-emm*) is the probability that a pair of isolates has the same *emm* type knowing that they are in the same PFGE cluster. HA reflects the evolution of both Wallace indices. The plateau of maximum HA, between the thresholds of 28 and 41, is a region of compromise where both Wallace indices are high. The curve of *RAR* values shows a more complex behaviour, with a plateau of maximum values between the thresholds of 20 and 29, and a significant decrease between 29 and 41, where HA is nearly constant.

**Figure 3**
**Ranked Mismatch Matrix (*RMM*) composition at different Dice dissimilarity thresholds for PFGE clustering**. The *RMMs* for the comparison of *emm* type with PFGE clusterings have dimensions p × 2, where p depends on the number of PFGE clusters and the two columns correspond to isolate pairs with the same or with different *emm* type. The PFGE intercluster distance rank is represented in the horizontal axis. The isolate pairs with the same *emm* type are represented with full lines while for pairs with different *emm* type a dashed line was used. The frequencies plotted in the vertical axis are relative, meaning that the content of each *RMM* element was divided by the sum of all *RMM* elements. It corresponds to the fraction of isolate pairs contributing for the respective *RMM* element. *RMM* composition was studied at three different thresholds (T = 21, 29 and 41) because, 21 is an optimal threshold for *RAR* but not for HA, 29 is an optimal threshold for both measures and 41 is a slightly sub-optimal threshold for HA (it is at the end of the maximal plateau of HA in Figure 3) and a bad threshold for *RAR*. The frequency distributions of isolate pairs with the same *emm* type are similar for the three thresholds. This is not the case for isolate pairs with different *emm* type. Here, as the threshold increases, the frequency peaks become larger and occur at lower cluster distance ranks, contributing in this way for a weaker agreement.

See this image and copyright information in PMC

Cited by

Genomic investigation of Lactococcus formosensis, Lactococcus garvieae, and Lactococcus petauri reveals differences in species distribution by human and animal sources.
Chan Y-X, Cao H, Jiang S, Li X, Fung K-K, Lee C-H, Sridhar S, Chen JH-K, Ho P-L. Chan Y-X, et al. Microbiol Spectr. 2024 Jun 4;12(6):e0054124. doi: 10.1128/spectrum.00541-24. Epub 2024 Apr 30. Microbiol Spectr. 2024. PMID: 38687062 Free PMC article.
Development of a Peptide-Based Multiepitope Vaccine from the SARS-CoV-2 Spike Protein for Targeted Immune Response Against COVID-19.
Campelo TA, Noronha Souza PF, Brito DMS, Frota CC, Antas PRZ. Campelo TA, et al. Protein Pept Lett. 2025;32(4):299-311. doi: 10.2174/0109298665364226250328084245. Protein Pept Lett. 2025. PMID: 40231512
A confidence interval for the wallace coefficient of concordance and its application to microbial typing methods.
Pinto FR, Melo-Cristino J, Ramirez M. Pinto FR, et al. PLoS One. 2008;3(11):e3696. doi: 10.1371/journal.pone.0003696. Epub 2008 Nov 11. PLoS One. 2008. PMID: 19002246 Free PMC article.
Performance Comparison Between Fourier-Transform Infrared Spectroscopy-based IR Biotyper and Matrix-Assisted Laser Desorption/Ionization Time-of-Flight Mass Spectrometry for Strain Diversity.
Jun SY, Kim YA, Lee SJ, Jung WW, Kim HS, Kim SS, Kim H, Yong D, Lee K. Jun SY, et al. Ann Lab Med. 2023 Mar 1;43(2):174-179. doi: 10.3343/alm.2023.43.2.174. Epub 2022 Oct 25. Ann Lab Med. 2023. PMID: 36281511 Free PMC article.
Evaluation of jackknife and bootstrap for defining confidence intervals for pairwise agreement measures.
Severiano A, Carriço JA, Robinson DA, Ramirez M, Pinto FR. Severiano A, et al. PLoS One. 2011;6(5):e19539. doi: 10.1371/journal.pone.0019539. Epub 2011 May 18. PLoS One. 2011. PMID: 21611165 Free PMC article.

See all "Cited by" articles

References

1. Rohlf FJ. Methods of Comparing Classifications. Annu Rev Ecol Syst. 1974;5:101–113. doi: 10.1146/annurev.es.05.110174.000533. - DOI
1. Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1973;66:846–850. doi: 10.2307/2284239. - DOI
1. Fowlkes EB, Mallows CL. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association. 1983;78:553–569. doi: 10.2307/2288117. - DOI
1. Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2:193–218. doi: 10.1007/BF01908075. - DOI
1. Sneath PH, Sokal RR. Numerical Taxonomy. San Francisco: Freeman; 1973.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement

Affiliation

Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources