Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 24;23(9):1100.
doi: 10.3390/e23091100.

Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences

Affiliations

Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences

Andrew D Back et al. Entropy (Basel). .

Abstract

Entropy estimation faces numerous challenges when applied to various real-world problems. Our interest is in divergence and entropy estimation algorithms which are capable of rapid estimation for natural sequence data such as human and synthetic languages. This typically requires a large amount of data; however, we propose a new approach which is based on a new rank-based analytic Zipf-Mandelbrot-Li probabilistic model. Unlike previous approaches, which do not consider the nature of the probability distribution in relation to language; here, we introduce a novel analytic Zipfian model which includes linguistic constraints. This provides more accurate distributions for natural sequences such as natural or synthetic emergent languages. Results are given which indicates the performance of the proposed ZML model. We derive an entropy estimation method which incorporates the linguistic constraint-based Zipf-Mandelbrot-Li into a new non-equiprobable coincidence counting algorithm which is shown to be effective for tasks such as entropy rate estimation with limited data.

Keywords: Zipf–Mandelbrot–Li law; entropy estimation; language models; probabilistic natural sequences.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
The inverse model curves (1..40) for the proposed algorithm are shown here for the range of the top 40 ranked symbols, also shown here for an alphabet size of M=200.
Figure 2
Figure 2
This graph shows the relative convergence performance λf(M) of the proposed algorithm scaled against a conventional plug-in entropy estimation algorithm as measured against the symbolic alphabet size M. Note that the improvement is easily several orders of magnitude for alphabet sizes of interest in the range 20–40.
Figure 3
Figure 3
The effect of the proposed linguistic constraints on the modified ZML (cZML) vs. the usual ZML model are shown here by contrasting the relative parameters α˜/α against M. It can be readily seen that the effect is most significant for smaller values of M but diminishes quickly as M increases, observed here for values up to M=12.
Figure 4
Figure 4
The effect of the proposed linguistic constraints on the modified cZML model are shown here by contrasting the ranked probabilities in each case. An example is shown here for M=4.
Figure 5
Figure 5
The second form of linguistically constrained cZML model has the effect of flattening the ranked probabilities. In the example shown here for M=10, it can be observed that the mid-ranked probabilities are increased, while the highest and lowest ranked probabilities are decreased.
Figure 6
Figure 6
The second form of linguistically constrained cZML model has the effect of flattening the ranked probabilities. In the example shown here for M=26, it can be observed that the mid-ranked probabilities are increased, while the highest and lowest ranked probabilities are decreased.
Figure 7
Figure 7
Performance of the constrained linguistic cZML model on actual English language data as compared to the unconstrained ZML model. In this case, we consider the full set of 676 two letter bigrams from the Google data set. The nonlinear behavior of the actual data is evidently (dashed curve) modeled with a higher degree of accuracy than the unconstrained model (dot dashed curve).
Figure 8
Figure 8
The convergence performance of the proposed entropy estimation algorithm is compared to a regular plug-in estimator for M=30 as a function of the sample size, up to Ns=106 samples averaged across Nv=75 trials. It can be noted that the estimate H^0(Ns) from the proposed algorithm converges rapidly very closely towards the true entropy value (within three decimal places). In contrast, the conventional plug-in estimator H^p(Ns) converges much more slowly towards a biased estimate.

References

    1. Shannon C.E. A Mathematical Theory of Communication (Parts I and II) Bell Syst. Tech. J. 1948;XXVII:379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. - DOI
    1. Shannon C.E. A Mathematical Theory of Communication (Part III) Bell Syst. Tech. J. 1948;XXVII:623–656. doi: 10.1002/j.1538-7305.1948.tb00917.x. - DOI
    1. Schürmann T., Grassberger P. Entropy estimation of symbol sequences. Chaos. 1996;6:414–427. doi: 10.1063/1.166191. - DOI - PubMed
    1. Jelinek F., Mercer R.L., Bahl L.R., Baker J.K. Perplexity—A measure of the difficulty of speech recognition tasks. J. Acoust. Soc. Am. 1977;62:S63. doi: 10.1121/1.2016299. - DOI
    1. Shannon C.E. Prediction and Entropy of Printed English. Bell Syst. Tech. J. 1951;30:50–64. doi: 10.1002/j.1538-7305.1951.tb01366.x. - DOI

LinkOut - more resources