Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr;30(4):420-431.
doi: 10.1089/cmb.2022.0395. Epub 2023 Jan 3.

A Novel Information-Theory-Based Genetic Distance That Approximates Phenotypic Differences

Affiliations

A Novel Information-Theory-Based Genetic Distance That Approximates Phenotypic Differences

David S Campo et al. J Comput Biol. 2023 Apr.

Abstract

Application of genetic distances to measure phenotypic relatedness is a challenging task, reflecting the complex relationship between genotype and phenotype. Accurate assessment of proximity among sequences with different phenotypic traits depends on how strongly the chosen distance is associated with structural and functional properties. In this study, we present a new distance measure Mutual Information and Entropy H (MIH) for categorical data such as nucleotide or amino acid sequences. MIH applies an information matrix (IM), which is calculated from the data and captures heterogeneity of individual positions as measured by Shannon entropy and coordinated substitutions among positions as measured by mutual information. In general, MIH assigns low weights to differences occurring at high entropy positions or at dependent positions. MIH distance was compared with other common distances on two experimental and two simulated data sets. MIH showed the best ability to distinguish cross-immunoreactive sequence pairs from non-cross-immunoreactive pairs of variants of the hepatitis C virus hypervariable region 1 (26,883 pairwise comparisons), and Major Histocompatibility Complex (MHC) binding peptides (n = 181) from non-binding peptides (n = 129). Analysis of 74 simulated RNA secondary structures also showed that the ratio between MIH distance of sequences from the same RNA structure and MIH of sequences from different structures is three orders of magnitude greater than for Hamming distances. These findings indicate that lower MIH between two sequences is associated with greater probability of the sequences to belong to the same phenotype. Examination of rule-based phenotypes generated in silico showed that (1) MIH is strongly associated with phenotypic differences, (2) IM of sequences under selection is very different from IM generated under random scenarios, and (3) IM is robust to sampling. In conclusion, MIH strongly approximates structural/functional distances and should have important applications to a wide range of biological problems, including evolution, artificial selection of biological functions and structures, and measuring phenotypic similarity.

Keywords: Shannon entropy; categorical variables; genetic distance; machine learning; mutual information; natural and artificial selection; phenotype; protein.

PubMed Disclaimer

Conflict of interest statement

AUTHOR’S DISCLOSURE

The authors declare they have no competing financial interests.

Figures

Figure 1.
Figure 1.
Distance comparison among four datasets. A) MHC-binding dataset, comparing 5 different distance types. B) Cross-reactivity dataset, comparing 5 different distance types. C) RNA secondary structure dataset, comparing Hamming and MIH and showing a boxplot of all the ratios. D) Rule-based dataset, comparing Hamming and MIH depending on the size of the selected phenotypes (x-axis). The error bars show the standard deviation among 1000 phenotypes.
Figure 2.
Figure 2.
Information matrix of rule-based phenotypes. A) Boxplot of the RMSE between the IM of observed samples and the null IM with three scenarios: rule-based, random and connected. C) RMSE (Root Mean Squared Error) between the IM of subsamples and the null IM or the full IM. The x-axis shows the different sampling levels.
Figure 3.
Figure 3.
Fitness and rule-based phenotypes. A) Between-phenotype comparison: Scatterplot between fitness RMSE and Hamming distance. B) Between-phenotype comparison: Scatterplot between fitness RMSE and Hamming distance. C) Within-phenotype scatterplot of RMSE (between hamming and MIH distances) and the fitness standard deviation of the entire space. D) Within-phenotype scatterplot of the Pearson correlation between fitness differentials and three distance types: Hamming, MIH global and local MIH (including only one-step neighbors). The x-axis shows the different sizes of the selected rule-based phenotypes.

References

    1. Afonnikov D, Oshchepkov D, Kolchanov N. Detection of conserved physico-chemical characteristics of proteins by analyzing clusters of positions with co-ordinated substitutions. Bioinformatics 2001;17(11):1035–1046. - PubMed
    1. Altschuh D, Lesk A, Bloomer A, et al. Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. J Mol Biol 1987;193(4):693–707. - PubMed
    1. Altschuh D, Vernet T, Berti P, et al. Coordinated amino acid changes in homologous protein families. Protein Eng 1988;2(3):193–199. - PubMed
    1. Amon JJ, Devasia R, Xia G, et al. Molecular epidemiology of foodborne hepatitis a outbreaks in the United States, 2003. J Infect Dis 2005;192(8):1323–1330; doi: 10.1086/462425. - DOI - PubMed
    1. Armstrong GL, Wasley A, Simard EP, et al. The prevalence of hepatitis C virus infection in the United States, 1999 through 2002. Annals of internal medicine 2006;144(10):705–714. - PubMed