Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jun;24(6):472-485.
doi: 10.1089/cmb.2016.0138. Epub 2016 Nov 11.

A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets

Affiliations

A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets

Chandler Zuo et al. J Comput Biol. 2017 Jun.

Abstract

Current analytic approaches for querying large collections of chromatin immunoprecipitation followed by sequencing (ChIP-seq) data from multiple cell types rely on individual analysis of each data set (i.e., peak calling) independently. This approach discards the fact that functional elements are frequently shared among related cell types and leads to overestimation of the extent of divergence between different ChIP-seq samples. Methods geared toward multisample investigations have limited applicability in settings that aim to integrate 100s to 1000s of ChIP-seq data sets for query loci (e.g., thousands of genomic loci with a specific binding site). Recently, Zuo et al. developed a hierarchical framework for state-space matrix inference and clustering, named MBASIC, to enable joint analysis of user-specified loci across multiple ChIP-seq data sets. Although this versatile framework estimates both the underlying state-space (e.g., bound vs. unbound) and also groups loci with similar patterns together, its Expectation-Maximization-based estimation structure hinders its applicability with large number of loci and samples. We address this limitation by developing MAP-based asymptotic derivations from Bayes (MAD-Bayes) framework for MBASIC. This results in a K-means-like optimization algorithm that converges rapidly and hence enables exploring multiple initialization schemes and flexibility in tuning. Comparison with MBASIC indicates that this speed comes at a relatively insignificant loss in estimation accuracy. Although MAD-Bayes MBASIC is specifically designed for the analysis of user-specified loci, it is able to capture overall patterns of histone marks from multiple ChIP-seq data sets similar to those identified by genome-wide segmentation methods such as ChromHMM and Spectacle.

Keywords: ChIP-Seq; MAD-Bayes; small-variance asymptotics; unified state-space inference and clustering.

PubMed Disclaimer

Conflict of interest statement

No competing financial interests exist.

Figures

<b>FIG. 1.</b>
FIG. 1.
Overview of the MBASIC modeling framework. Curves within each panel depict different replicates under the experimental conditions C1, C2, and C3. Loci A and D are in the same cluster.
<b>FIG. 2.</b>
FIG. 2.
(a) Run-time comparisons on a 64 bit machine with Intel Xeon 3.0 GHz processor and 64GB of RAM and eight cores. (b) State-space prediction error. (c) Clustering accuracy based on the adjusted Rand index. (d) Clustering assignments of the singletons when formula image
<b>FIG. 3.</b>
FIG. 3.
(a) Comparison of clusters and state labels between MAD-Bayes, Spectacle, and ChromHMM. (b) Jaccard index between MAD-Bayes clusters and ChromHMM states. (c) Jaccard index between MAD-Bayes clusters and Spectacle states. The diagonal blocks indicate agreement between clusters and states; MAD-Bayes clusters and Spectacle states are ordered according to their overlap with the ChromHMM states. MAD-Bayes, MAP-based asymptotic derivations from Bayes.
<b>Appendix FIG. 1.</b>
Appendix FIG. 1.
A graphical interpretation of the conjugacy between formula image and J. We use the K-means initialization to compute surrogate values for formula image for a large collection of formula image. The formula image value that can yield J clusters in the global solution must satisfy formula image When formula image satisfies this condition, a line with slope formula image passing through formula image on the graph should be tangent to the trace of all formula image values. Although using the surrogate formula image values can lead to the curve connecting the formula image values to be non-convex, making the solution for formula image not hold for some J, we can use a convex approximation to the trace of formula image so that a formula image exists for each J. A simpler approach is to order formula image from largest to smallest and requires the following condition for formula image. formula image. Algorithm 2 essentially applies this idea to select the formula image values. Each J corresponds to a formula image of value formula image that satisfies the conjugacy inequality. The algorithm essentially tries to identify the range of formula image that leads up to formula image number of clusters.
<b>Appendix FIG. 2.</b>
Appendix FIG. 2.
Comparison of the clustering accuracy with the adjusted Rand index by excluding the singleton loci.
None

Similar articles

References

    1. Aldous D.J. 1983. Exchangeability and related topics, 1–198. In École d'Été de Probabilités de Saint-Flour XIII 1983. Ed: Hennequin P.L. Springer, Berlin; Heidelberg
    1. Banerjee A. 2005. Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749
    1. Bao Y., Vinciotti V., Wit E., et al. . 2013. Accounting for immunoprecipitation efficiencies in the statistical analysis of ChIP-seq data. BMC Bioinformatics 14, 169. - PMC - PubMed
    1. Bao Y., Vinciotti V., Wit E., et al. . 2014. Joint modeling of ChIP-seq data via a Markov random field model. Biostatistics 15, 296–310 - PubMed
    1. Bardet A.F., He Q., Zeitlinger J., and Stark A. 2012. A computational pipeline for comparative chip-seq analyses. Nat. Protoc. 7, 45–61 - PubMed

LinkOut - more resources