Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jul 1;35(7):1783-1797.
doi: 10.1093/molbev/msy055.

Alignment Modulates Ancestral Sequence Reconstruction Accuracy

Affiliations

Alignment Modulates Ancestral Sequence Reconstruction Accuracy

Ricardo Assunção Vialle et al. Mol Biol Evol. .

Abstract

Accurate reconstruction of ancestral states is a critical evolutionary analysis when studying ancient proteins and comparing biochemical properties between parental or extinct species and their extant relatives. It relies on multiple sequence alignment (MSA) which may introduce biases, and it remains unknown how MSA methodological approaches impact ancestral sequence reconstruction (ASR). Here, we investigate how MSA methodology modulates ASR using a simulation study of various evolutionary scenarios. We evaluate the accuracy of ancestral protein sequence reconstruction for simulated data and compare reconstruction outcomes using different alignment methods. Our results reveal biases introduced not only by aligner algorithms and assumptions, but also tree topology and the rate of insertions and deletions. Under many conditions we find no substantial differences between the MSAs. However, increasing the difficulty for the aligners can significantly impact ASR. The MAFFT consistency aligners and PRANK variants exhibit the best performance, whereas FSA displays limited performance. We also discover a bias towards reconstructed sequences longer than the true ancestors, deriving from a preference for inferring insertions, in almost all MSA methodological approaches. In addition, we find measures of MSA quality generally correlate highly with reconstruction accuracy. Thus, we show MSA methodological differences can affect the quality of reconstructions and propose MSA methods should be selected with care to accurately determine ancestral states with confidence.

PubMed Disclaimer

Figures

<sc>Fig</sc>. 1.
Fig. 1.
Reconstruction accuracies of MSA tools for simulated scenarios under tree heights of 0.8 and 1.0. Plots show the overall accuracy distribution for each parameter combination using tree heights of 0.8 and 1.0. Blue dots indicate the median, and red dots indicate the mean.
<sc>Fig</sc>. 2.
Fig. 2.
Reconstruction accuracies of MSA tools for simulated scenarios under tree heights of 1.2 and 2.0. Plots show the overall accuracy distribution for each parameter combination using tree heights of 1.2 and 2.0. Blue dots indicate the median, and red dots indicate the mean. Highlighted plot (red box) indicates the scenario with 64-taxon trees, tree height 1.2, sampling fraction 0.01, and indel rate 0.05, further explored in figures 4–6 and 8.
<sc>Fig</sc>. 3.
Fig. 3.
Number of scenarios with statistically significant differences in overall accuracy between each MSA. The reconstruction accuracies obtained by each MSA tool in 72 scenarios with varying parameter configurations were compared pairwise using a Mann–Whitney–Wilcoxon test. Figure shows counts of scenarios with significant differences (FDR adjusted P value < 0.01), where the entry in the i-th row and j-th column shows the number of times method i was better than method j (higher median accuracy).
<sc>Fig</sc>. 4.
Fig. 4.
Reconstruction accuracy by distance to root. Reconstruction accuracy at different distances from the root using simulation parameters of 64 taxa, tree height 1.2, sampling fraction 0.01 and indel rate 0.05. (A) Scatter plots of accuracies for each MSA. (B) Combined chart showing the locally weighted scatterplot smoothing (LOESS) of average reconstruction accuracy by distance to root for each MSA tool.
<sc>Fig</sc>. 5.
Fig. 5.
Distributions of insertion and deletion error metrics. Scatterplots show insertion and deletion error metrics for different MSA methods, based on the simulation parameters: 64 taxa, tree height 1.2, sampling fraction 0.01, and indel rate 0.05. Insertions are shown on the x-axis, deletions on the y-axis. Density distribution for each axis is also plotted.
<sc>Fig</sc>. 6.
Fig. 6.
Reconstructed sequence lengths and alignment lengths. Distributions of sequence and alignment lengths for each alignment method (simulation parameters: 64 taxa, tree height 1.2, sampling fraction 0.01, indel rate 0.05). (A) Distribution of ratios of reconstructed to true sequence lengths measured for all reconstructed nodes. Values higher than one represent reconstructed sequences longer than expected. (B) MSA length distributions for each method measured for each scenario replicate (100: ten trees, and ten alignments for each tree).
<sc>Fig</sc>. 7.
Fig. 7.
Relationship between reconstruction accuracy and MSA quality metrics. Average reconstruction accuracy and average MSA quality scores calculated for each simulated scenario (72 scenarios) using each MSA tool. MSA quality metrics described in the text are computed by comparing the MSA with the true simulated alignment. MetAl was used under the devol metric which corresponds to a dissimilarity score, so values were subtracted from 1 for ease of comparison. (r: Pearson’s correlation; r2: coefficient of determination).
<sc>Fig</sc>. 8.
Fig. 8.
MSA quality scores compared with reconstruction accuracy over different MSA tools. Differences of quality measures between MSA tools under simulation parameters of 64 taxa, tree height 1.2, sampling fraction 0.01, and indel rate 0.05. MSA quality scores (pink) represent values for each scenario replicate (ten trees and ten alignments for each tree). In all plots, reconstruction accuracies (blue) are shown for comparison, representing the expected behavior in terms of differences between tools. Values of reconstruction accuracies were measured as averages of all reconstructed node accuracies in each replicate, and are the same in each chart. MSA tools are ordered by reconstruction accuracy means (best to worst). Spearman rho correlations between MSA quality scores and reconstruction accuracies are shown for each metric. MetAl scores are shown as 1 − devol, to produce a similarity measure.

References

    1. Akanuma S, Iwami S, Yokoi T, Nakamura N, Watanabe H, Yokobori S, Yamagishi A.. 2011. Phylogeny-based design of a B-subunit of DNA gyrase and its ATPase domain using a small set of homologous amino acid sequences. J Mol Biol. 4122: 212–225. - PubMed
    1. Akanuma S, Yokobori S, Nakajima Y, Bessho M, Yamagishi A.. 2015. Robustness of predictions of extremely thermally stable proteins in ancient organisms. Evolution 6911: 2954–2962. - PubMed
    1. Anisimova M, Cannarozzi GM, Liberles DA.. 2010. Finding the balance between the mathematical and biological optima in multiple sequence alignment. Trends Evol Biol. 21: 7–48.
    1. Ashkenazy H, Penn O, Doron-Faigenboim A, Cohen O, Cannarozzi G, Zomer O, Pupko T.. 2012. FastML: a web server for probabilistic reconstruction of ancestral sequences. Nucleic Acids Res. 40(W1): W580–W584. - PMC - PubMed
    1. Bahr A, Thompson JD, Thierry JC, Poch O.. 2001. BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res. 291: 323–326. - PMC - PubMed

Publication types

LinkOut - more resources