Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Nov 18;6(12):3199-209.
doi: 10.1093/gbe/evu252.

Indel reliability in indel-based phylogenetic inference

Affiliations

Indel reliability in indel-based phylogenetic inference

Haim Ashkenazy et al. Genome Biol Evol. .

Abstract

It is often assumed that it is unlikely that the same insertion or deletion (indel) event occurred at the same position in two independent evolutionary lineages, and thus, indel-based inference of phylogeny should be less subject to homoplasy compared with standard inference which is based on substitution events. Indeed, indels were successfully used to solve debated evolutionary relationships among various taxonomical groups. However, indels are never directly observed but rather inferred from the alignment and thus indel-based inference may be sensitive to alignment errors. It is hypothesized that phylogenetic reconstruction would be more accurate if it relied only on a subset of reliable indels instead of the entire indel data. Here, we developed a method to quantify the reliability of indel characters by measuring how often they appear in a set of alternative multiple sequence alignments. Our approach is based on the assumption that indels that are consistently present in most alternative alignments are more reliable compared with indels that appear only in a small subset of these alignments. Using simulated and empirical data, we studied the impact of filtering and weighting indels by their reliability scores on the accuracy of indel-based phylogenetic reconstruction. The new method is available as a web-server at http://guidance.tau.ac.il/RELINDEL/.

Keywords: alignment reliability; indel analysis; multiple sequence alignment; phylogeny.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1.—
Fig. 1.—
The agreement regarding indel characters derived from three common MSA algorithms: MAFFT, PRANK, and CLUSTALW (A) using all indels and (B) using the most reliable indel characters identified by RELINDEL.
F<sc>ig</sc>. 2.—
Fig. 2.—
Phylogenetic trees reconstructed using all indel characters coded from MSAs produced by (A) PRANK, (B) MAFFT, and (C) CLUSTALW. When using indels derived from the PRANK MSAs, the obtained tree significantly differed from the accepted primate tree. The red branch shows the misplacement of Gorilla in the PRANK-based inference. Additional statistical information is provided in panel (D) (Informative, number of informative characters; CI, consistence index; RI, retention index).
F<sc>ig</sc>. 3.—
Fig. 3.—
MSAs and corresponding indel character matrices for the first 40 amino acids of the human AGPS gene (ENSG00000018510) as inferred by (A) PRANK, (B) MAFFT, and (C) CLUSTALW. Homoplasious indels, which conflict the accepted primate tree, are boxed in yellow. The three alignment methods highly disagree on the placement of these indels. RELINDEL identifies these indels as highly unreliable (see text).
F<sc>ig</sc>. 4.—
Fig. 4.—
Distribution of the indel-reliability scores for (A) PRANK, (B) MAFFT, and (C) CLUSTALW as a function of indel length.
F<sc>ig</sc>. 5.—
Fig. 5.—
Phylogenetic trees reconstructed using the most reliable indels characters coded from MSAs produced by (A) PRANK, (B) MAFFT, and (C) CLUSTALW and filtered by the RELINDEL method. The correct primate phylogeny was reconstructed when using indels derived from both PRANK and MAFFT. Homo is misplaced in the tree reconstructed based on CLUSTALW MSAs (the erroneous branch is marked in red). Additional statistical information is provided in panel (D) (Informative, number of informative characters; CI, consistence index; RI, retention index).
F<sc>ig</sc>. 6.—
Fig. 6.—
ROC curves, quantifying the ability of RELINDEL to accurately detect reliable indels based on simulated data. The AUC is given in parenthesis next to each alignment algorithm. ROC curves for simulations with (A) symmetric tree and (B) asymmetric tree.

References

    1. Adhikari AN, et al. Modeling large regions in proteins: applications to loops, termini, and folding. Protein Sci. 2012;21:107–121. - PMC - PubMed
    1. Ajawatanawong P, Atkinson GC, Watson-Haigh NS, Mackenzie B, Baldauf SL. SeqFIRE: a web application for automated extraction of indel regions and conserved blocks from protein multiple sequence alignments. Nucleic Acids Res. 2012;40:W340–W347. - PMC - PubMed
    1. Bapteste E, Philippe H. The potential value of indels as phylogenetic markers: position of trichomonads as a case study. Mol Biol Evol. 2002;19:972–977. - PubMed
    1. Belinky F, Cohen O, Huchon D. Large-scale parsimony analysis of metazoan indels in protein-coding genes. Mol Biol Evol. 2010;27:441–451. - PubMed
    1. Blackburne BP, Whelan S. Class of multiple sequence alignment algorithm affects genomic analysis. Mol Biol Evol. 2013;30:642–653. - PubMed

Publication types

LinkOut - more resources