Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 28;40(Suppl 1):i277-i286.
doi: 10.1093/bioinformatics/btae254.

Optimal phylogenetic reconstruction of insertion and deletion events

Affiliations

Optimal phylogenetic reconstruction of insertion and deletion events

Sanjana Tule et al. Bioinformatics. .

Abstract

Motivation: Insertions and deletions (indels) influence the genetic code in fundamentally distinct ways from substitutions, significantly impacting gene product structure and function. Despite their influence, the evolutionary history of indels is often neglected in phylogenetic tree inference and ancestral sequence reconstruction, hindering efforts to comprehend biological diversity determinants and engineer variants for medical and industrial applications.

Results: We frame determining the optimal history of indel events as a single Mixed-Integer Programming (MIP) problem, across all branch points in a phylogenetic tree adhering to topological constraints, and all sites implied by a given set of aligned, extant sequences. By disentangling the impact on ancestral sequences at each branch point, this approach identifies the minimal indel events that jointly explain the diversity in sequences mapped to the tips of that tree. MIP can recover alternate optimal indel histories, if available. We evaluated MIP for indel inference on a dataset comprising 15 real phylogenetic trees associated with protein families ranging from 165 to 2000 extant sequences, and on 60 synthetic trees at comparable scales of data and reflecting realistic rates of mutation. Across relevant metrics, MIP outperformed alternative parsimony-based approaches and reported the fewest indel events, on par or below their occurrence in synthetic datasets. MIP offers a rational justification for indel patterns in extant sequences; importantly, it uniquely identifies global optima on complex protein data sets without making unrealistic assumptions of independence or evolutionary underpinnings, promising a deeper understanding of molecular evolution and aiding novel protein design.

Availability and implementation: The implementation is available via GitHub at https://github.com/santule/indelmip.

PubMed Disclaimer

Conflict of interest statement

No competing interest is declared.

Figures

Figure 1.
Figure 1.
(A) An example MSA; (B) POGs from the example MSA, where green and red nodes represent non-gapped and gapped positions in a sequence, respectively; (C) Overall POAG representing the union of edges present in POGs.
Figure 2.
Figure 2.
Example of indel events and score for an example tree. A is ancestor of B and C. 1 refers to a non-gap and 0 refers to gap at a site in a sequence alignment.
Figure 3.
Figure 3.
(A) Indel score for 15 real datasets. X in the above figure denotes that BEP did not provide a solution for the dataset RNaseZ_624; (B) Indel score comparison between (synthetic) ground truth and MIP Indel solution for trees with 2000 extant sequences with varying mean distance δ and κ.
Figure 4.
Figure 4.
Comparison of ancestral indel patterns with synthetic ‘ground truth’ indel patterns (from TrAVIS) for trees t2000d0.8s1 and t2000d1s2. The ancestral nodes having no change in all four methods are not shown. The ancestral nodes are sorted in ascending order by number of pattern changes in MIP, PSP, BEP and SICP. The heatmap colourbar is scaled to a maximum value of 10.
Figure 5.
Figure 5.
(A) Example of ‘incohesive’ indel states at third site for a sequence MVLGF at an ancestral branch point, where the indel state is distinct from both its descendants and its ancestor. (B) Percentage of incohesive ancestors for 15 real datasets. The symbol ‘X’ in the figure indicates that BEP did not find solution for the RNaseZ_624 dataset.
Figure 6.
Figure 6.
Percentage of non-conforming ancestors for 15 real datasets. X in the above figure denotes that a solution was not found.
Figure 7.
Figure 7.
(A) The indel events at each branch point for optimal solution one and two (relative to the ancestor) is shown in green and red, respectively. Names starting with N are the ancestral branch point. We note that the total indel events in clade under ancestral branch point N236 is the same in both optimal solutions. The extants are represented by their UniProt identifiers. (B) Visualization of molecular sequence indel patterns for select nodes under the ancestral branch point N236 for sites from 980 to 1000. Ancestral branch point N237 is not shown as it has the same indel pattern in both optimal solutions. The sites, 991 and 992, accommodates both gap and ungap scenarios in a parsimonious manner. This is evident by the same sites in the extant sequences A0A233RAG0, A0A4D4INH5 and A0A4V4R7S6 present under the ancestral branch point of N238 and N239.

Similar articles

Cited by

References

    1. Bouchard-Côté A, Jordan MI.. Evolutionary inference via the poisson indel process. Proc Natl Acad Sci U S A 2013;110:1160–6. 10.1073/pnas.1220450110. - DOI - PMC - PubMed
    1. Chindelevitch L, Li Z, Blais E. et al. On the inference of parsimonious indel evolutionary scenarios. J Bioinform Comput Biol 2006;4:721–44. 10.1142/S0219720006002168. - DOI - PubMed
    1. Dessimoz C, Gil M.. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biol 2010;11:R37. - PMC - PubMed
    1. Dwivedi B, Gadagkar SR.. Phylogenetic inference under varying proportions of indel-induced alignment gaps. BMC Evol Biol 2009;9:211. - PMC - PubMed
    1. Edwards RJ, Shields DC.. Gasp: gapped ancestral sequence prediction for proteins. BMC Bioinformatics 2004;5:123. 10.1186/1471-2105-5-123. - DOI - PMC - PubMed

Publication types