Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 11:27:48-56.
doi: 10.1016/j.csbj.2024.11.042. eCollection 2025.

Informational rescaling of PCA maps with application to genetic distance

Affiliations

Informational rescaling of PCA maps with application to genetic distance

Nassim Nicholas Taleb et al. Comput Struct Biotechnol J. .

Abstract

Principal Component Analysis (PCA) is a powerful multivariate tool allowing the projection of data in low-dimensional representations. Nevertheless, datapoint distances on these low-dimensional projections are challenging to interpret. Here, we propose a computationally simple heuristic to transform a map based on standard PCA (when the variables are asymptotically Gaussian) into an entropy-based map where distances are based on mutual information (MI). Moreover, we show that in certain instances our proposed scaled PCA can improve cluster identification. Rescaling principal component-based distances using MI results in a representation of relative statistical associations when, as in genetics, it is applied on bit measurements between individuals' genomic mutual information. This entropy-rescaled PCA, while preserving order relationships (along a dimension), quantifies relative distances into information units, such as "bits". We illustrate the effect of this rescaling using genomics data derived from world populations and describe how the interpretation of results is impacted.

Keywords: Entropy; Genetic distance; Genetic maps; Information theory; Mutual information.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1
Fig. 1
The visual intuition for the three possible methods for informational distances. We generate bivariate normal distributions for X and Y, and represent the iso-densities on the X and Y axes. Each square is equidistant with respect to the parameters 1) correlation, ρ (top left), 2) correlation squared (top right), ρ2, and 3) Mutual Information (bottom center), MI to the one to its left and its right, above and below it, as well as on the diagonal. The parameters in brackets are {ρ} for the top left, {ρ,ρ2} for the top right, and {MI, ρ, ρ2} for the bottom center. The square of the correlation was selected because it maps to the explained variance in traditional regression analyses. MI seems to match the visual representation of associated randomness.
Fig. 2
Fig. 2
A theoretical example showing how entropy-rescaled principal components (PCs) change the relative distances to make them linear to information. This is made possible due to the information-theoretic optimality of the PCs under thin-tailed distributions. The model illustrates how ordinal relationships are conserved on each dimension, under the transformation axis-wise, but the cardinal distances are significantly altered.
Fig. 3
Fig. 3
Transformation of PCA maps to accommodate informational distances. Since the maps are built by positioning the correlation (or covariance) with respect to Principal Component PCn and PCm, m > n > =1 on the x and y axes respectively, our correction corresponds to multiplying the values of the axes by sgn(ρ)12log(1ρ2), which is visually equivalent to stretching the map along both the x and y axes.
Fig. 4
Fig. 4
Conventional Principal Component Analysis for 5 populations: Buryat, Spanish, Sri Lankan Tamil in the UK (STU), Colombian in Medellín, Colombia (CLM) and Gujarati Indians in Houston, Texas, USA (GIH). While the gap between CLM and GIH appears rather large in conventional PCA, comparable to the distance between CLM and Buryat, rescaling places CLM substantially closer to GIH, shown in (b).
Fig. 5
Fig. 5
A different world view: the commonly observed triangular PCA shape of world populations undergoes proximity rearrangements using information-based rescaling. Non-African and non-Asian populations are much closer together in (b).

References

    1. Taleb N.N. Statistical consequences of fat tails: real world preasymptotics, epistemology, and applications. 2022. arXiv:2001.10488https://arxiv.org/abs/2001.10488 Available from:
    1. Soyer E., Hogarth R.M. The illusion of predictability: how regression statistics mislead experts. Int J Forecast. 2012;28(3):695–711. doi: 10.1016/j.ijforecast.2012.02.002. https://www.sciencedirect.com/science/article/pii/S0169207012000258 Available from: - DOI
    1. Taleb N. Random House Publishing Group; 2008. Fooled by randomness: the hidden role of chance in life and in the markets, incerto.
    1. Goldstein D, Taleb N. We don't quite know what we are talking about when we talk about volatility, vol. 33 (03 2007).
    1. Goldstein D., Taleb N. Tandon School of Engineering, New York University; 2020. Common misapplications and misinterpretations of correlation in social science. preprint.

LinkOut - more resources