Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct 14;15(10):e0238835.
doi: 10.1371/journal.pone.0238835. eCollection 2020.

Analyzing the fine structure of distributions

Affiliations

Analyzing the fine structure of distributions

Michael C Thrun et al. PLoS One. .

Abstract

One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or contain subsets related to different states of the data producing process. Data visualization tools should deliver a clear picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs typically use kernel density estimates and include both the classical histogram, as well as the modern tools like ridgeline plots, bean plots and violin plots. If density estimation parameters remain in a default setting, conventional methods pose several problems when visualizing the PDF of uniform, multimodal, skewed distributions and distributions with clipped data, For that reason, a new visualization tool called the mirrored density plot (MD plot), which is specifically designed to discover interesting structures in continuous features, is proposed. The MD plot does not require adjusting any parameters of density estimation, which is what may make the use of this plot compelling particularly to non-experts. The visualization tools in question are evaluated against statistical tests with regard to typical challenges of explorative distribution analysis. The results of the evaluation are presented using bimodal Gaussian, skewed distributions and several features with already published PDFs. In an exploratory data analysis of 12 features describing quarterly financial statements, when statistical testing poses a great difficulty, only the MD plots can identify the structure of their PDFs. In sum, the MD plot outperforms the above mentioned methods.

PubMed Disclaimer

Conflict of interest statement

The authors state hereby that they have no competing interests.

Figures

Fig 1
Fig 1
Uniform distribution in the interval [−2,2] of a 1 000 points sample visualized by a ridgeline plot (a) of ggridges on CRAN [41] (top left) and violin plot (b, top right), bottom: bean plot (d, right) and MD plot (c, left). In the ridgeline, violin and bean plot, the borders of the uniform distribution are skewed contrary to the real amount of values around the borders 2,−2. The bean plot and ridgeline plot indicate multimodality but Hartigan’s dip statistic [12] disagrees: p(n = 1 000,D = 0. 01215) = 0.44.
Fig 2
Fig 2. Scatterplots of a Monte Carlo simulation in which samples were drawn and testing was performed in a given range of parameters in 100 iterations.
The visualization is restricted to the median and 99 percentile of the p-values for each x value. The test of Hartigan’s dip statistic is highly significant for a mean higher than 2.4 in a sample of size n = 31.000.
Fig 3
Fig 3
Plots of the bimodal distribution of changing mean of second Gaussian: Ridgeline plots (a) of ggridges on CRAN [41], violin plot (b), bean plot (c), and MD plot (d). Bimodality is visible beginning with a mean of 2.4 in a bean plot, ridgeline plot and MD plot, but the MD plot draws a robustly estimated Gaussian (magenta) if statistical testing is not significant, which indicates that the distributions are not unimodal with a mean of two. The bimodality of the distribution is not visible in the violin plot [4] of the implementation [34]”.
Fig 4
Fig 4. Scatterplots of a Monte Carlo simulation in which samples were drawn and testing was performed in a given range of parameters in 100 iterations.
The visualization is restricted to the median and 99 percentile of the p-values for each x value. The D'Agostino test of skewness [14] was highly significant for skewness outside of the range of [0.95,1.05] in a sample of n = 15.000. Scatter plots were generated with plotly [32].
Fig 5
Fig 5
Plots of skewed normal distribution with different skewness using the R package fGarch [51] on CRAN: Ridgeline plots (a) of ggridges on CRAN [41], violin plot (b), bean plot (c) and MD plot (d). The sample size is n = 15000. The violin plot is less sensitive to the skewness of the distribution. The MD plot allows for an easier detection of skewness by ordering the columns automatically.
Fig 6
Fig 6. MTY feature clipped in the range marked in red with a robustly estimated average of the whole data in magenta (left) and not clipped (right).
The bean plot (a) underestimates the density in the direction of the clipped range [1800, 6000] and draws a density outside of the range of values. Additionally, this leads to the misleading interpretation that the average lies at 4000 instead of 4300. The MD plot (b) visualizes the density independently of the clipping. Note that for a better comparison, we disabled the additional overlaying plots.
Fig 7
Fig 7
Distribution analyses performed on the log of German population’s income in 2003 with ridgeline plots (a) of ggridges on CRAN (37) do not indicate clipping or multimodality.
Fig 8
Fig 8
Distribution analyses performed on the log of German population’s income in 2003 with the violin plot (b), bean plot (a) and MD plot (c). The bean plot and violin plot visualize an additional mode in the range of 4–4.5. The bean plot visualizes a PDF above the maximum value (red line). The multimodality of ITS is not visible with the default binwidth. Only the MD plot visualizes a clearly clipped and skewed multimodal distribution. Note that for a better comparison, we disabled the additional overlaying plots.
Fig 9
Fig 9. MD plots of selected features from 269 companies on the German stock market reporting quarterly financial statements by the prime standard.
The features are concave ordered and the same as in Fig 10 and Table A in S2 File. For 8 out of 12 distributions, there is a hard cut at the value zero which overlaps with Table A in S2 File. The features are highly skewed besides net tangible assets, total assets, and total stockholder equity. The latter two are multimodal.
Fig 10
Fig 10
Bean plots of selected features from 269 companies on the German stock market reporting quarterly financial statements by the Prime standard (top, a) and ridgeline plots (b, bottom) of ggridges on CRAN (37). The features are concave ordered and the same as in Fig 9. There is no hard cut around the value zero (red line), and the features are unimodal or uniform with a large variance and a small skewness. The visualizations disagrees with the descriptive statistics in Table A in S2 File. Note that for a better comparison, we disabled the additional overlaying plots in bean plots.
Fig 11
Fig 11
Visualization of the distribution of as few as two features at once is incorrect if the ranges vary widely (a). This is shown on the example of the MD plot (a). However, the MD plot enables the user to set simple transformations enabling the visualization of several distributions at once even if the ranges vary (b).

References

    1. Michael JR. The stabilized probability plot. Biometrika. 1983;70(1):11–7.
    1. Wilk MB, Gnanadesikan R. Probability plotting methods for the analysis for the analysis of data. Biometrika. 1968;55(1):1–17. - PubMed
    1. Tukey JW. Exploratory data analysis. Mosteller F, editor. United States Addison-Wesley Publishing Company; 1977. 688 p.
    1. Hintze JL, Nelson RD. Violin plots: a box plot-density trace synergism. The American Statistician. 1998;52(2):181–4.
    1. Kampstra P. Beanplot: A boxplot alternative for visual comparison of distributions. Journal of Statistical Software, Code Snippets. 2008;28(1):1–9. 10.18637/jss.v028.c01 - DOI

LinkOut - more resources