. 2009 Jul;103(1):32-45.

doi: 10.1038/hdy.2009.29. Epub 2009 Apr 1.

An agglomerative hierarchical approach to visualization in Bayesian clustering problems

K J Dawson¹, K Belkhir

Affiliations

PMID: 19337306
PMCID: PMC2705916
DOI: 10.1038/hdy.2009.29

An agglomerative hierarchical approach to visualization in Bayesian clustering problems

K J Dawson et al. Heredity (Edinb). 2009 Jul.

. 2009 Jul;103(1):32-45.

doi: 10.1038/hdy.2009.29. Epub 2009 Apr 1.

Authors

K J Dawson¹, K Belkhir

Affiliation

¹ Centre for Mathematical and Computational Biology, Rothamsted Research, Harpenden, Hertfordshire, UK. kevin.dawson@bbsrc.ac.uk

PMID: 19337306
PMCID: PMC2705916
DOI: 10.1038/hdy.2009.29

Abstract

Clustering problems (including the clustering of individuals into outcrossing populations, hybrid generations, full-sib families and selfing lines) have recently received much attention in population genetics. In these clustering problems, the parameter of interest is a partition of the set of sampled individuals--the sample partition. In a fully Bayesian approach to clustering problems of this type, our knowledge about the sample partition is represented by a probability distribution on the space of possible sample partitions. As the number of possible partitions grows very rapidly with the sample size, we cannot visualize this probability distribution in its entirety, unless the sample is very small. As a solution to this visualization problem, we recommend using an agglomerative hierarchical clustering algorithm, which we call the exact linkage algorithm. This algorithm is a special case of the maximin clustering algorithm that we introduced previously. The exact linkage algorithm is now implemented in our software package PartitionView. The exact linkage algorithm takes the posterior co-assignment probabilities as input and yields as output a rooted binary tree, or more generally, a forest of such trees. Each node of this forest defines a set of individuals, and the node height is the posterior co-assignment probability of this set. This provides a useful visual representation of the uncertainty associated with the assignment of individuals to categories. It is also a useful starting point for a more detailed exploration of the posterior distribution in terms of the co-assignment probabilities.

PubMed Disclaimer

Figures

**figure 1. The exact linkage algorithm**
Here we illustrate the exact linkage algorithm with an example for a sample of 5 individuals. At the initialisation step (iteration t = 0) we create 5 terminal nodes, labelled respectively 1, 2, 3, 4 and 5. This is followed by iterations t = 1, 2, 3, 4. At each iteration t (≥ 1) we begin with an *update* step. This is followed by an *acceptance* step at which we create a new internal node, labelled -t (where t is the iteration at which it is created). At the update step we update the set of proposed nodes. This updated set of proposed nodes is formed from all possible pairs of the root nodes available at the start of the present iteration. In Figure 1, every node (The height of a node is equal to the co-assignment probability of the set of individuals defined by the node) is positioned at a height equal to the co-assignment probability of the set defined by that node. The proposed nodes are arranged on the vertical axis to the right of the forest. (The height of a node is equal to the co-assignment probability of the set of individuals defined by the node.) Notice that it is always the highest proposed node which is accepted. At iteration t = 1 the proposed node (4, 5) is accepted and added to the tree, where it is labelled -1. At iteration t = 2 the proposed node (1, 2) is accepted and added to the tree, where it is labelled -2. At iteration t = 3 the proposed node (-1, 3) is accepted and added to the tree, where it is labelled -3. At iteration t = 4 the proposed node (-3, -2) is accepted and added to the tree, where it is labelled -4.

**figure 2. Example 1. The true matrilineal pedigree of the simulated data set**
The true matrilineal pedigree of the exhaustive sample of all n = N = 100 individuals from a single local population. Outcrossing events are indicated by diamonds. Those selfing lines which are represented in the sample by more than 2 individuals are colour-coded.

**figure 3. Example 1. Output from the exact linkage algorithm**
The rooted binary forest generated by applying the exact linkage algorithm to the output from the IMPPS Markov chain sampler (a pooled sample of 60 000 observations).

**figure 4. Example 2. Output from the maximin agglomerative algorithm with d = 2**
The rooted binary forest generated by applying the maximin agglomerative algorithm, with d = 2, to the output from the HWLER Markov chain sampler.

**figure 5. Example 2. Output from the maximin agglomerative algorithm with d = 3**
The rooted binary forest generated by applying the maximin agglomerative algorithm, with d = 3, to the output from the HWLER Markov chain sampler.

**figure 6. Example 2. Output from the maximin agglomerative algorithm with d = 4**
The rooted binary forest generated by applying the maximin agglomerative algorithm, with d = 4, to the output from the HWLER Markov chain sampler.

**figure 7. Example 2. Output from the exact linkage algorithm**
The rooted binary forest generated by applying the exact linkage algorithm to the output from the HWLER Markov chain sampler.

See this image and copyright information in PMC

Cited by

A spatial dirichlet process mixture model for clustering population genetics data.
Reich BJ, Bondell HD. Reich BJ, et al. Biometrics. 2011 Jun;67(2):381-90. doi: 10.1111/j.1541-0420.2010.01484.x. Epub 2010 Sep 3. Biometrics. 2011. PMID: 20825394 Free PMC article.

References

1. Aigner M. Combinatorial theory. Springer-Verlag; New York: 1979.
1. Almudevar A, Field C. Inference of single generation sibling relationships based on DNA markers. Journal of Agricultural Biology and Environmental statistics. 1999;4:136–165.
1. Anderson EC, Thompson EA. A model-based method for identifying species hybrids using multilocus genetic data. Genetics. 2002;106:1217–1229. - PMC - PubMed
1. Berger JO. Statistical Decision Theory and Bayesian Analysis. 2nd edition Springer-Verlag; New York: 1985.
1. Celeux G, Hurn M, Robert CP. Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association. 2000;95:957–970.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An agglomerative hierarchical approach to visualization in Bayesian clustering problems

Affiliation

An agglomerative hierarchical approach to visualization in Bayesian clustering problems

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous