Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Mar 1;29(5):533-41.
doi: 10.1093/bioinformatics/btt012. Epub 2013 Jan 16.

Sparsely correlated hidden Markov models with application to genome-wide location studies

Affiliations

Sparsely correlated hidden Markov models with application to genome-wide location studies

Hyungwon Choi et al. Bioinformatics. .

Abstract

Motivation: Multiply correlated datasets have become increasingly common in genome-wide location analysis of regulatory proteins and epigenetic modifications. Their correlation can be directly incorporated into a statistical model to capture underlying biological interactions, but such modeling quickly becomes computationally intractable.

Results: We present sparsely correlated hidden Markov models (scHMM), a novel method for performing simultaneous hidden Markov model (HMM) inference for multiple genomic datasets. In scHMM, a single HMM is assumed for each series, but the transition probability in each series depends on not only its own hidden states but also the hidden states of other related series. For each series, scHMM uses penalized regression to select a subset of the other data series and estimate their effects on the odds of each transition in the given series. Following this, hidden states are inferred using a standard forward-backward algorithm, with the transition probabilities adjusted by the model at each position, which helps retain the order of computation close to fitting independent HMMs (iHMM). Hence, scHMM is a collection of inter-dependent non-homogeneous HMMs, capable of giving a close approximation to a fully multivariate HMM fit. A simulation study shows that scHMM achieves comparable sensitivity to the multivariate HMM fit at a much lower computational cost. The method was demonstrated in the joint analysis of 39 histone modifications, CTCF and RNA polymerase II in human CD4+ T cells. scHMM reported fewer high-confidence regions than iHMM in this dataset, but scHMM could recover previously characterized histone modifications in relevant genomic regions better than iHMM. In addition, the resulting combinatorial patterns from scHMM could be better mapped to the 51 states reported by the multivariate HMM method of Ernst and Kellis.

Availability: The scHMM package can be freely downloaded from http://sourceforge.net/p/schmm/ and is recommended for use in a linux environment.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Three strategies to model multiple data series: (a) independent HMMs, (b) fully coupled HMM and (c) sparsely correlated HMMs for series data O. Associated with observed data are hidden states h, which are to be inferred. For c, the arrows in dashed lines indicate the couplings introduced to adjust the transition kernel of each series
Fig. 2.
Fig. 2.
Simulation studies. Each method is represented by different symbols: squares for iHMM, circles for scHMM and triangles for fullHMM. (a) Independent case (2-fold): short-length signals were planted in random locations in three different series data. (b) One-group case (2-fold): replicate experiments where binding sites are expected to be shared in all experiment. (c) Two-group case (2-fold): two sets of three correlated series. (d) Three-group case (2-fold): three inter-dependent groups of two correlated series. In all panels, signal was simulated from Poisson (10) and background noise was simulated from Poisson (5)
Fig. 3.
Fig. 3.
Correlation between 39 histone modifications (and RNA Pol II and CTCF) using the probability estimates from iHMM and scHMM

Similar articles

Cited by

References

    1. Bannister A, et al. Spatial distribution of di- and tri-methyl lysine 36 of histone H3 at active genes. J. Biol. Chem. 2005;280:17732–17736. - PubMed
    1. Barski A, et al. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. - PubMed
    1. Bernstein B, et al. The mammalian epigenome. Cell. 2007;128:669–681. - PubMed
    1. Choi H, et al. Hierarchical hidden Markov model with application to joint analysis of ChIP-chip and ChIP-seq data. Bioinformatics. 2009;25:1715–1721. - PMC - PubMed
    1. Churchill G. Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 1989;51:79–94. - PubMed

Publication types