Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 26;478(2266):20220177.
doi: 10.1098/rspa.2022.0177.

Inferring the shape of data: a probabilistic framework for analysing experiments in the natural sciences

Affiliations

Inferring the shape of data: a probabilistic framework for analysing experiments in the natural sciences

Korak Kumar Ray et al. Proc Math Phys Eng Sci. .

Abstract

A critical step in data analysis for many different types of experiments is the identification of features with theoretically defined shapes in N-dimensional datasets; examples of this process include finding peaks in multi-dimensional molecular spectra or emitters in fluorescence microscopy images. Identifying such features involves determining if the overall shape of the data is consistent with an expected shape; however, it is generally unclear how to quantitatively make this determination. In practice, many analysis methods employ subjective, heuristic approaches, which complicates the validation of any ensuing results-especially as the amount and dimensionality of the data increase. Here, we present a probabilistic solution to this problem by using Bayes' rule to calculate the probability that the data have any one of several potential shapes. This probabilistic approach may be used to objectively compare how well different theories describe a dataset, identify changes between datasets and detect features within data using a corollary method called Bayesian Inference-based Template Search; several proof-of-principle examples are provided. Altogether, this mathematical framework serves as an automated 'engine' capable of computationally executing analysis decisions currently made by visual inspection across the sciences.

Keywords: Bayesian inference; data analysis; feature detection; machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1.
Figure 1.
Applications of shape calculation to the physical and life sciences. A graphical representation of our mathematical framework (black), with examples of data analysis methods made possible by it (purple), specific tasks these methods enable (blue) and applications of these tasks in specific techniques in the physical and life sciences (dark cyan) along with the specific problems (light cyan) that the application of our framework to these techniques may address. The given examples are not meant to be exhaustive. (Online version in colour.)
Figure 2.
Figure 2.
Illustration of the Bayesian inference-based template search algorithm. An example of a BITS process is shown, where three different biomolecules are searched for in a two-dimensional image of a cellular environment. Different sets of cellular components are coloured differently for illustrative purposes to demonstrate the expected locations of the different biomolecules (green for cell membrane components, purple for translation machinery, blue for enzymes and yellow and orange for transcription and replication machinery). A set of rotational templates (in grey boxes) generated from different models of the biomolecules of interest (coloured as earlier) is scanned through subsections of the image (white arrow), and the probability that each template best matches the local shape of the data in a specific subsection (white box) is calculated and then marginalized into an aggregate probability for each model that is used to identify the local composition of the image. The probability values shown were chosen to illustrate the example case of the null template being identified in the case where the shape of a subsection cannot be explained by any of the model templates. Adapted from illustrations by David S. Goodsell, RCSB Protein Data Bank (DOIs: doi:10.2210/rcsb_pdb/goodsell-gallery-028, doi:10.2210/rcsb_pdb/mom_2000_10, and doi:10.2210/rcsb_pdb/mom_2016_6, doi:10.2210/rcsb_pdb/mom_2003_3). (Online version in colour.)
Figure 3.
Figure 3.
Examples of analyses based on shape calculations. (a) A fluorescence lifetime dataset (left) may be modelled as a convolution of an exponential decay of unknown lifetime (τ) and a Gaussian instrument response function of unknown width. By comparing shapes of the data with convolved templates, a joint log-probability map of the two unknowns is constructed (middle) and the deconvolved functions corresponding to the parameters with the maximum probability are plotted (right). (b) A signal versus time trajectory (left) with two discontinuous jumps may be compared in shape with a set of templates corresponding to all possible jump times to generate a map of the joint log-probabilities for the times of the jumps (middle). The times corresponding to the maximum probabilities are overlaid (in blue) over the raw signal and the continuous segments identified are idealized (in red) using a Gaussian filter (right). (c) A fluorescence emitter in a microscope image (left) may be modelled as a Gaussian of known width centred at a certain location. By using a subpixel grid to generate such templates with varying centres, a map of the joint probability for the co-ordinates of the emitter is plotted, along with marginalized probabilities for the x-andy-axes (middle). The co-ordinates with the maximum probability are overlaid (in blue) over the raw image (right). (Online version in colour.)
Figure 4.
Figure 4.
Computational efficiency of full-dataset versus BITS analyses. (a) An example signal versus time trajectory (left) with one discontinuous jump analysed by comparing the shape of the data with templates the size of the entire dataset (full template) and using BITS with a small template encoding a ‘step up’. The logarithm of the computational time, required to perform an analysis, τ, is shown as a function of the logarithm of the length of the signal versus time trajectory, N. The linearized curves were fit with a first-order polynomial to yield the computational scaling of each calculation. These values match the predicted scaling for one template (R=1) of O(NR+1=N2) and O(N1), for full template and BITS, respectively. (b) As in (a), but with signal versus time trajectories (left) with two discontinuous jumps such as in figure 3b. The computational scaling matches the predicted scaling for two templates (R=2) of O(N3) and O(N1). Together these results demonstrate the linear scaling of BITS with respect to the number of features in a dataset.

References

    1. Johnstone IM, Titterington DM. 2009. Statistical challenges of high-dimensional data. Phil. Trans. R. Soc. A 367, 4237-4253. (10.1098/rsta.2009.0159) - DOI - PMC - PubMed
    1. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA. 2010. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733-739. (10.1038/nrg2825) - DOI - PMC - PubMed
    1. Jaynes ET. 2003. Probability theory: the logic of science. Cambridge, UK; New York, NY: Cambridge University Press.
    1. Bishop CM. 2006. Pattern recognition and machine learning. New York, NY: Springer.
    1. Kinz-Thompson CD, Ray KK, Gonzalez RL. 2021. Bayesian inference: the comprehensive approach to analyzing single-molecule experiments. Annu. Rev. Biophys. 50, 191-208. (10.1146/annurev-biophys-082120-103921) - DOI - PMC - PubMed

LinkOut - more resources