Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences

Andrew D Back¹, Janet Wiles¹

Affiliations

PMID: 34573725
PMCID: PMC8468050
DOI: 10.3390/e23091100

Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences

Andrew D Back et al. Entropy (Basel). 2021.

. 2021 Aug 24;23(9):1100.

doi: 10.3390/e23091100.

Authors

Andrew D Back¹, Janet Wiles¹

Affiliation

¹ School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD 4072, Australia.

PMID: 34573725
PMCID: PMC8468050
DOI: 10.3390/e23091100

Abstract

Entropy estimation faces numerous challenges when applied to various real-world problems. Our interest is in divergence and entropy estimation algorithms which are capable of rapid estimation for natural sequence data such as human and synthetic languages. This typically requires a large amount of data; however, we propose a new approach which is based on a new rank-based analytic Zipf-Mandelbrot-Li probabilistic model. Unlike previous approaches, which do not consider the nature of the probability distribution in relation to language; here, we introduce a novel analytic Zipfian model which includes linguistic constraints. This provides more accurate distributions for natural sequences such as natural or synthetic emergent languages. Results are given which indicates the performance of the proposed ZML model. We derive an entropy estimation method which incorporates the linguistic constraint-based Zipf-Mandelbrot-Li into a new non-equiprobable coincidence counting algorithm which is shown to be effective for tasks such as entropy rate estimation with limited data.

Keywords: Zipf–Mandelbrot–Li law; entropy estimation; language models; probabilistic natural sequences.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
The inverse model curves (1..40) for the proposed algorithm are shown here for the range of the top 40 ranked symbols, also shown here for an alphabet size of $M = 200 .$

**Figure 2**
This graph shows the relative convergence performance $λ_{f} (M)$ of the proposed algorithm scaled against a conventional plug-in entropy estimation algorithm as measured against the symbolic alphabet size $M .$ Note that the improvement is easily several orders of magnitude for alphabet sizes of interest in the range 20–40.

**Figure 3**
The effect of the proposed linguistic constraints on the modified ZML (cZML) vs. the usual ZML model are shown here by contrasting the relative parameters $\tilde{α} / α$ against $M .$ It can be readily seen that the effect is most significant for smaller values of M but diminishes quickly as M increases, observed here for values up to $M = 12 .$

**Figure 4**
The effect of the proposed linguistic constraints on the modified cZML model are shown here by contrasting the ranked probabilities in each case. An example is shown here for $M = 4 .$

**Figure 5**
The second form of linguistically constrained cZML model has the effect of flattening the ranked probabilities. In the example shown here for $M = 10,$ it can be observed that the mid-ranked probabilities are increased, while the highest and lowest ranked probabilities are decreased.

**Figure 6**
The second form of linguistically constrained cZML model has the effect of flattening the ranked probabilities. In the example shown here for $M = 26,$ it can be observed that the mid-ranked probabilities are increased, while the highest and lowest ranked probabilities are decreased.

**Figure 7**
Performance of the constrained linguistic cZML model on actual English language data as compared to the unconstrained ZML model. In this case, we consider the full set of 676 two letter bigrams from the Google data set. The nonlinear behavior of the actual data is evidently (dashed curve) modeled with a higher degree of accuracy than the unconstrained model (dot dashed curve).

**Figure 8**
The convergence performance of the proposed entropy estimation algorithm is compared to a regular plug-in estimator for $M = 30$ as a function of the sample size, up to $N_{s} = 10^{6}$ samples averaged across $N_{v} = 75$ trials. It can be noted that the estimate ${\hat{H}}_{0} (N_{s})$ from the proposed algorithm converges rapidly very closely towards the true entropy value (within three decimal places). In contrast, the conventional plug-in estimator ${\hat{H}}_{p} (N_{s})$ converges much more slowly towards a biased estimate.

See this image and copyright information in PMC

References

1. Shannon C.E. A Mathematical Theory of Communication (Parts I and II) Bell Syst. Tech. J. 1948;XXVII:379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. - DOI
1. Shannon C.E. A Mathematical Theory of Communication (Part III) Bell Syst. Tech. J. 1948;XXVII:623–656. doi: 10.1002/j.1538-7305.1948.tb00917.x. - DOI
1. Schürmann T., Grassberger P. Entropy estimation of symbol sequences. Chaos. 1996;6:414–427. doi: 10.1063/1.166191. - DOI - PubMed
1. Jelinek F., Mercer R.L., Bahl L.R., Baker J.K. Perplexity—A measure of the difficulty of speech recognition tasks. J. Acoust. Soc. Am. 1977;62:S63. doi: 10.1121/1.2016299. - DOI
1. Shannon C.E. Prediction and Entropy of Printed English. Bell Syst. Tech. J. 1951;30:50–64. doi: 10.1002/j.1538-7305.1951.tb01366.x. - DOI

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences

Affiliation

Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources