Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2017 May 23;112(10):2021-2029.
doi: 10.1016/j.bpj.2017.04.027.

An Introduction to Infinite HMMs for Single-Molecule Data Analysis

Affiliations
Review

An Introduction to Infinite HMMs for Single-Molecule Data Analysis

Ioannis Sgouralis et al. Biophys J. .

Abstract

The hidden Markov model (HMM) has been a workhorse of single-molecule data analysis and is now commonly used as a stand-alone tool in time series analysis or in conjunction with other analysis methods such as tracking. Here, we provide a conceptual introduction to an important generalization of the HMM, which is poised to have a deep impact across the field of biophysics: the infinite HMM (iHMM). As a modeling tool, iHMMs can analyze sequential data without a priori setting a specific number of states as required for the traditional (finite) HMM. Although the current literature on the iHMM is primarily intended for audiences in statistics, the idea is powerful and the iHMM's breadth in applicability outside machine learning and data science warrants a careful exposition. Here, we explain the key ideas underlying the iHMM, with a special emphasis on implementation, and provide a description of a code we are making freely available. In a companion article, we provide an important extension of the iHMM to accommodate complications such as drift.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A synthetic time trace illustrating measurements of a hypothetical biomolecule that undergoes conformational transitions. (Left) The state space consists of conformations depicted discretely as σ1,σ2,. (Middle) Time series of noisy observations, xn, produced by the biomolecule (blue) and the corresponding noiseless trace (red). Over the time course of the measurements, the biomolecule attains only conformations σ1σ5, though additional conformations might be visited at subsequent times. For the sake of concreteness only, we label these states in order of appearance from 1 through 5. (Right) Binning the collected observations reveals “emission distributions,” Fσk, associated with each conformation. These distributions are highlighted with red lines. The centers (mean values) of the emission distributions are used to obtain the noiseless trace in the middle panel. The illustration on the left is created using data from (47) (PDB: 2N4G). To see this figure in color, go online.
Figure 2
Figure 2
Graphical representation of the HMM. In the HMM, a biomolecule of interest transitions between unobserved states sn according to the probability vectors π˜sn and generates observations xn according to the probability distributions Fsn that depend on the parameter ϕsn. Here, following convention, the xn values are shaded to denote that these quantities are observed, whereas the sn values are hidden. Arrows denote the dependences among the model variables and red lines denote the model parameters. To see this figure in color, go online.
Figure 3
Figure 3
Graphical representation of the iHMM. The hidden Markov model that formulates the observations to be analyzed (black lines) is shown together with its priors (red lines). For completeness, we also show the concentration parameters α and γ and the prior probability distribution on the emission parameters, H, that fully characterize the iHMM. The key difference from the HMM shown in Fig. 2 is that now the model parameters π˜σk and ϕσk are treated as random variables similar to the hidden states, sn, and observations, xn. For details, see the main text. To see this figure in color, go online.
Figure 4
Figure 4
Synthetic data sets resembling a hypothetical biomolecule undergoing transitions between discrete states that we analyzed with the iHMM. (Left) Time series x¯=(x1,,xN) of noisy observations. During the measuring period, the biomolecule attains five conformations, σ1,,σ5. The number of conformations are a priori unknown and the iHMM seeks to determine the probability over the number of states, as well as their properties, given the data available. In data set 1, the biomolecule transitions often through every state. By contrast, in data set 2, transitions to some states are rare. As a result, all states in data set 1 are almost equally visited throughout the experiment time course, whereas in data set 2, higher states are visited, by chance, only toward the end of the trace. (Right) The corresponding emission distributions, Fσk, as obtained by simply binning the observations (blue) and plotting the exact ones used for the simulations (red). For both data sets, the emission distributions show significant overlap. In all panels, dotted lines indicate the exact mean values, μσk, of the emission distributions. To see this figure in color, go online.
Figure 5
Figure 5
After some iterations, the sampler used in the iHMM to analyze data set 1 of Fig. 4 eventually converges to the correct number of states. The number of visited states, K(r) (top), and the means of the emission distributions, μσk(r) (bottom), change throughout the sampler’s iterations. Unlike the HMM, which uses a finite and fixed state space, the iHMM learns the number of available states and grows/shrinks the state space as required by the data.
Figure 6
Figure 6
We may use samples from the iHMM posterior probability to infer the size of the state space and the location of each state. In particular, we illustrate histograms for P(K|x¯) (top) and P(μσk|x¯) (bottom) using data set 1 of Fig. 4. In both panels, dashed lines indicate the exact (ground-truth) values used to produce the data in Fig. 4. To see this figure in color, go online.
Figure 7
Figure 7
We may use the iHMM to estimate portions of the complete state space such as those contained in different segments of data set 2 provided in Fig. 4. (Upper) Estimated noiseless traces for two cases: 1) using a limited segment of the full trace; and 2) using the full trace. Although only the latter case allows an estimate of all five states, both cases provide similar estimates over those states that they mutually visit. (Lower) Corresponding estimates of the number of states contained in each trace. To see this figure in color, go online.

Similar articles

Cited by

References

    1. Rabiner L., Juang B. An introduction to hidden Markov models. IEEE ASSP Mag. 1986;3:4–16.
    1. Eddy S.R. What is a hidden Markov model? Nat. Biotechnol. 2004;22:1315–1316. - PubMed
    1. Yoon B.J. Hidden Markov models and their applications in biological sequence analysis. Curr. Genomics. 2009;10:402–415. - PMC - PubMed
    1. Krogh A., Brown M., Haussler D. Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 1994;235:1501–1531. - PubMed
    1. Streit R.L., Barrett R.F. Frequency line tracking using hidden Markov models. IEEE Trans. Acoust. Speech Signal Process. 1990;38:586–598.

LinkOut - more resources