Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Mar;16(1):301-370.
doi: 10.1214/20-BA1197. Epub 2020 Feb 13.

Centered Partition Processes: Informative Priors for Clustering (with Discussion)

Affiliations

Centered Partition Processes: Informative Priors for Clustering (with Discussion)

Sally Paganin et al. Bayesian Anal. 2021 Mar.

Abstract

There is a very rich literature proposing Bayesian approaches for clustering starting with a prior probability distribution on partitions. Most approaches assume exchangeability, leading to simple representations in terms of Exchangeable Partition Probability Functions (EPPF). Gibbs-type priors encompass a broad class of such cases, including Dirichlet and Pitman-Yor processes. Even though there have been some proposals to relax the exchangeability assumption, allowing covariate-dependence and partial exchangeability, limited consideration has been given on how to include concrete prior knowledge on the partition. For example, we are motivated by an epidemiological application, in which we wish to cluster birth defects into groups and we have prior knowledge of an initial clustering provided by experts. As a general approach for including such prior knowledge, we propose a Centered Partition (CP) process that modifies the EPPF to favor partitions close to an initial one. Some properties of the CP prior are described, a general algorithm for posterior computation is developed, and we illustrate the methodology through simulation examples and an application to the motivating epidemiology study of birth defects.

Keywords: Bayesian clustering; Bayesian nonparametrics; Dirichlet Process; centered process; exchangeable probability partition function; mixture model; product partition model.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Hasse diagram for the lattice of set partitions of 4 elements. A line is drawn when two partitions have a covering relation. For example {1} {2, 3, 4} is connected with 3 partitions obtained by splitting the block {2, 3, 4} in every possible way, and with partition 1, obtained by merging the two clusters.
Figure 2:
Figure 2:
Prior probabilities of the 52 set partitions of N = 5 elements for the CP process with uniform base EPPF. In each graph the CP process is centered on a different partition c0 highlighted in blue. The cumulative probabilities across different values of the penalization parameter ψ are joined to form the curves, while the probability of a given partition corresponds to the area between the curves.
Figure 3:
Figure 3:
Prior probabilities of the 52 set partitions of N = 5 elements for the CP process with Dirichlet process of α = 1 base EPPF. In each graph the CP process is centered on a different partition c0 highlighted in blue. The cumulative probabilities across different values of the penalization parameter ψ are joined to form the curves, while the probability of a given partition corresponds to the area between the curves.
Figure 4:
Figure 4:
Prior probabilities of the 52 set partitions of N = 5 elements for the CP process with Pitman-Yor process base EPPF with σ = 0.25 and α ≈ −0.004, such that the expected number of clusters equal to log(5) ≈ 1.6. In each graph the CP process is centered on a different partition c0 highlighted in blue. The cumulative probabilities across different values of the penalization parameter ψ are joined to form the curves, while the probability of a given partition corresponds to the area between the curves.
Figure 5:
Figure 5:
Prior probabilities of the 52 set partitions of N = 5 elements for the CP process with Pitman-Yor process base EPPF with σ = 0.75 and α ≈ −0.691, such that the expected number of clusters equal to log(5) ≈ 1.6. In each graph the CP process is centered on a different partition c0 highlighted in blue. The cumulative probabilities across different values of the penalization parameter ψ are joined to form the curves, while the probability of a given partition corresponds to the area between the curves.
Figure 6:
Figure 6:
Illustration of results from the local search algorithm based on the Hasse diagram of Π4 starting from c0 = {1}{2,3,4}. Partitions are colored according the exploration order following a dark-light gradient. Notice that after 3 iterations the space is entirely explored.
Figure 7:
Figure 7:
Estimate of the cumulative prior probabilities assigned to different distances from c0 for N = 12 and c0 with configuration {3, 3, 3, 3}, under the CP process with uniform prior on the left and Dirichlet Process on the right. Black dots correspond to the base prior with no penalization, while dots from bottom-to-top correspond to increasing values of ψ5,10,15,20. Tables report the minimum distance values such that F (δ) ≥ 0.9.
Figure 8:
Figure 8:
Results from grouped logistic regressions with DP(α = 1) prior and CP process prior with DP(α = 1) base EPPF for ψ = 15, 17, centered on the true partition. Heatmaps on the left side show the posterior similarity matrix. On the right side, boxplots show the distribution of deviations from the maximum likelihood baseline coefficients and posterior mean estimates for each defect i = 1,…, 12.
Figure 9:
Figure 9:
Results from grouped logistic regression using CP process prior with DP(α = 1) base EPPF for ψ = 15 centered on partition c0=1,5,92,6,103,7,114,8,12 which has distance 3.16 from the true one. Heatmaps on the left side show the posterior similarity matrix. On the right side, boxplots show the distribution of deviations from the maximum likelihood baseline coefficients and posterior mean estimates for each defect i = 1,…, 12.
Figure 10:
Figure 10:
Posterior allocation matrices obtained using the CP process with a DP (α = 1) prior for different values of ψ ∈ {0, 40, 80, 120}. On the y-axis labels are colored according base grouping information c0, with dots on the diagonal highlighting differences between c0 and the estimated partition ĉ.
Figure 11:
Figure 11:
Comparison of significant odds ratio under ψ ∈ {0, 40, 80, 120, ∞} for some exposure factors and 4 selected heart defects in 4 different groups under c0. Dots are in correspondence of significant mean posterior log-odds ratios (log-OR) at 95% with red encoding risk factors (log-OR > 0) and green protective factors (log-OR < 0).

References

    1. Arratia R and DeSalvo S (2016). “Probabilistic divide-and-conquer: a new exact simulation method, with integer partitions as an example.” Combinatorics, Probability and Computing, 25(3): 324–351. MR3482658. doi: 10.1017/S0963548315000358. 327 - DOI
    1. Barrientos AF, Jara A, Quintana FA, et al. (2012). “On the support of MacEachern’s dependent Dirichlet processes and extensions.” Bayesian Analysis, 7(2): 277–310. MR2934952. doi: 10.1214/12-BA709. 302 - DOI
    1. Barry D and Hartigan JA (1992). “Product partition models for change point problems.” The Annals of Statistics, 260–279. MR1150343. doi: 10.1214/aos/1176348521. 303 - DOI
    1. Blei DM and Frazier PI (2011). “Distance dependent Chinese restaurant processes.” Journal of Machine Learning Research, 12(Aug): 2461–2488. MR2834504. 303
    1. Botto LD, Lin AE, Riehle-Colarusso T, Malik S, Correa A, and Study NBDP (2007). “Seeking causes: classifying and evaluating congenital hearth defects in etiologic studies.” Birth Defects Research Part A: Clinical and Molecular Teratology, 79(10): 714–727. 319 - PubMed

LinkOut - more resources