Beyond mutations: Accounting for quantitative changes in the analysis of protein evolution

Xiaoyong Wu^{1

2}, Shesh N Rai^{1

2}, Georg F Weber³

Affiliations

¹ Biostatistics and Informatics Shared Resources, University of Cincinnati Cancer Center, College of Medicine, Cincinnati, OH, USA.
² Cancer Data Science Center, University of Cincinnati College of Medicine Department of Biostatistics, Health Informatice and Data Sciences, Cincinnati, OH, USA.
³ University of Cincinnati Cancer Center, College of Pharmacy, Cincinnati, OH, USA.

PMID: 39021584
PMCID: PMC11253266
DOI: 10.1016/j.csbj.2024.06.017

Beyond mutations: Accounting for quantitative changes in the analysis of protein evolution

Xiaoyong Wu et al. Comput Struct Biotechnol J. 2024.

. 2024 Jun 21:23:2637-2647.

doi: 10.1016/j.csbj.2024.06.017. eCollection 2024 Dec.

Authors

Xiaoyong Wu^{1

2}, Shesh N Rai^{1

2}, Georg F Weber³

Affiliations

¹ Biostatistics and Informatics Shared Resources, University of Cincinnati Cancer Center, College of Medicine, Cincinnati, OH, USA.
² Cancer Data Science Center, University of Cincinnati College of Medicine Department of Biostatistics, Health Informatice and Data Sciences, Cincinnati, OH, USA.
³ University of Cincinnati Cancer Center, College of Pharmacy, Cincinnati, OH, USA.

PMID: 39021584
PMCID: PMC11253266
DOI: 10.1016/j.csbj.2024.06.017

Abstract

Molecular phylogenetic research has relied on the analysis of the coding sequences by genes or of the amino acid sequences by the encoded proteins. Enumerating the numbers of mismatches, being indicators of mutation, has been central to pertinent algorithms. Specific amino acids possess quantifiable characteristics that enable the conversion from "words" (strings of letters denoting amino acids or bases) to "waves" (strings of quantitative values representing the physico-chemical properties) or to matrices (coordinates representing the positions in a comprehensive property space). The application of such numerical representations to evolutionary analysis takes into account not only the occurrence of mutations but also their properties as influences that drive speciation, because selective pressures favor certain mutations over others, and this predilection is represented in the characteristics of the incorporated amino acids (it is not born out solely by the mismatches). Besides being more discriminating sources for tree-generating algorithms than match/mismatch, the number strings can be examined for overall similarity with average mutual information, autocorrelation, and fractal dimension. Bivariate wavelet analysis aids in distinguishing hypermutable versus conserved domains of the protein. The matrix depiction is readily subjected to comparisons of distances, and it allows the generation of heat maps or graphs. This analysis preserves the accepted taxa order where tree construction with standard approaches yields conflicting results (for the protein S100A6). It also aids hypothesis generation about the origin of mitochondrial proteins. These analytical algorithms have been automated in R and are applicable to various processes that are describable in matrix format.

Keywords: Clustering; Heat map; Matrix distance; Phylogenetic tree; Protein sequence; Wavelet analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Fig. 1**
Flow chart for quantitative phylogenetic analysis. Amino acid strings are converted to numerical descriptors. Those enable evolutionary analysis **(A)** as well as structural comparisons **(B)**. In the construction of phylogenetic trees, new input algorithms become available for the calculation of distances. Additionally, numerical comparison for overall relatedness among proteins is feasible, which is inaccessible on the basis of letter strings. The novel structural comparisons comprise heatmaps and bivariate wavelet analyses, both of which result in graphic depictions of mutable versus conserved regions. aa = amino acid, autocorr. = autocorrelation, ami = average mutual information, fractal dim. = fractal dimension, Shannon entr. = Shannon entropy, UPGMA = unweighted pair group method with arithmetic mean.

**Fig. 2**
Overall similarity among S100A6 proteins. Wavelet analysis for individual properties across five species. A) For volume, isoelectric point (pI), octanol partition coefficient (octanol), and their combined average, the physico-chemical properties (property), the inverse of the average mutual information (1/ami), and the inverse of the autocorrelation (1/autocorr) were compared pairwise among all sequences. The color coding displays the lowest values as yellow and the highest values as green. B) Bivariate wavelet analysis was conducted for the comparison between Homo sapiens and the three-toed turtle. For the properties of volume and isoelectric point, displayed are the plots for the average power (av. power), the cross-wavelet power, the coherence, and the phase difference (phase diff.). C) Distances between matrix descriptions of the proteins were compared pairwise among all sequences. The color coding displays the lowest values as yellow and the highest values as green.

**Fig. 3**
Phylogeny of S100A6. A) Conversion of the amino acid sequence to property values to state space coordinates is illustrated on the N-terminal first 20 amino acids of human S100A6. seq = sequence, vol = volume, pI = isoelectric point, oct int = octanol partition coefficient, matrix 1–5 indicates the coordinate values along the 5 state space axes. **B-D)** Conventional algorithms, starting from strings of letters. The internet applications used were phylogeny.fr [http://www.phylogeny.fr/] (B) and two algorithms in Gene Bee [http://www.genebee.msu.su/services/phtree_reduced.html] (C,D). **E-G)** The letter strings were converted to numbers, based on select physico-chemical properties of the individual amino acids. The sum differences were calculated pairwise between species, and the closeness of their values (calculated stepwise after averaging of the two smallest numbers) determined the distances on the trees for octanol partition coefficient (E), volume (F), and isoelectric point (G). **H-I)** For each of the three properties evaluated, average mutual information (H) and autocorrelation (I) were determined pairwise between species. The resultant values were averaged across the three properties. Trees were assembled stepwise from 1/(average mutual information) or 1/autocorrelation, such that the two taxa with the smallest value at each step were combined and their associated numbers were averaged before repeating the process. The positions and distances of the branches in the phylogenetic tree are reflective of the results obtained in the stepwise process. **J-K)** The letter strings were converted to 5-column matrices, based on the amino acid positions in a state space describing their overall properties. The matrix distances were calculated pairwise between species, as Euclidean distance (J) or Frobenius distance (K), and their values provided the input for a stepwise tree generation through combining the closest species, averaging distances, and continuing the process until all distances have been calculated. L) True/false table for the phylogenetic trees B) through K). Under the assumption that the ranking of evolutionary development orders the species under analysis as shown in the rows of the table, the value 1 was assigned for trees that matched this expectation, while a value of 0 was entered for different rankings.

**Fig. 4**
Phylogeny of mitochondrial proteins. Proteins encoded by the mitochondrial DNA were analyzed by conventional algorithms (Clustal Omega with default settings, left; phylogeny.fr with default settings, second from left; Mega 11 with alignment in NCBI Cobalt, saved in Fasta format, Mega align, UPGA method, default settings, third from left) as well as after matrix conversion of the letter strings (right, the UPGMA package was used to construct a phylogenetic tree from a distance matrix computed from the sequence alignment of the species). The proteins for Cytochrome b **(A)**, Cytochrome c oxidase I **(B)**, Cytochrome c oxidase III **(C)** and NADPH Dehydrogenase III **(D)** are displayed. Single-cell organisms are highlighted in blue (bacteria) or light blue (cyanobacteria), yellow (archaea), green (amoebae), orange (yeast) or red (endosymbionts). The nodes delineating the advanced organisms have been manually highlighted. To collect the source sequences, the search started with the NCBI landmark model organisms and then sought to add representatives of diverse clades. Limited N- or C-terminal truncations were implemented to reduce the numbers of gaps in the sequence alignments.

**Fig. 5**
Visualization of regional mutability in the matrix depiction of the amino acid sequences. A-C) Heat maps. A) S100A6. For each row (representing amino acids in consecutive positions) the Euclidean distance was calculated to a hypothetical reference matrix that represents the average of the 20 amino acids for each coordinate in the 5-dimensional state space (left). Hierarchical clustering analysis then arranged and put dendrograms on the columns (taxa) and rows (of note, the clustering in this dimension scrambles the amino acid sequence but highlights the extent of differences). B) Osteopontin consensus sequences of nine clades. The Euclidean distances were calculated to a hypothetical reference matrix that represents the average of the 20 amino acids for each coordinate in the 5-dimensional state space. Hierarchical clustering rearranged columns and rows. C) SARS-CoV-2 Spike Glycoprotein. The reference sequence for calculating Euclidean distances was the reported parent sequence from the start of the COVID-19 pandemic. Hierarchical clustering rearranged columns and rows. **D-E)** Connectivity graphs. S100A6. From the source sequences and alignment in Fig. 3, two species were chosen as examples. R routines were applied to generate the graphs [https://davetang.org/muse/2017/03/16/matrix-to-adjacency-list-in-r/]. The melt() function from the reshape2 package created adjacency lists from the input matrices. Then, igraph generated the display items. In each subfigure, the left graph displays the network of matrix-converted amino acids for the species “Tufted_duck_[XP_032060414.1]; the right graph shows the analogous network for the species ”Human_[AAP36486.1]”. D) Network graphs. The igraph package in R was used to display simple graphs. E) Newman-Girvan clustering. The igraph package can detect communities or subgraphs (shown in diverse colors) using the Newman-Girvan algorithm.

See this image and copyright information in PMC

References

1. Abarbanel H.D.I. Springer Nature; Switzerland: 1995. Analysis of observed chaotic data.
1. Arenas M. Trends in substitution models of molecular evolution. Front Genet. 2015;6:319. - PMC - PubMed
1. Bouda M., Caplan J., Saiers J.E. Box-counting dimension revisited: Presenting an efficient method of minimizing quantization error and an assessment of the self-similarity of structural root Systems. Front Plant Sci. 2016;7:149. - PMC - PubMed
1. Braun B.A., Schein C.H., Braun W. DGraph clusters flaviviruses and β-coronaviruses according to their hosts, disease type, and human cell receptors. Bioinfor Biol Insights. 2021;15:1–9. - PMC - PubMed
1. Charleston M. Phylogeny In: Brenner’s Encyclopedia of Genetics. Second ed, Volume 5, 2013. Doi:10.1016/B978–0-12–374984-0.01160–8.

LinkOut - more resources

Full Text Sources
- Elsevier Science
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Beyond mutations: Accounting for quantitative changes in the analysis of protein evolution

Affiliations

Beyond mutations: Accounting for quantitative changes in the analysis of protein evolution

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources