Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 1;40(2):btae043.
doi: 10.1093/bioinformatics/btae043.

Statistical framework to determine indel-length distribution

Affiliations

Statistical framework to determine indel-length distribution

Elya Wygoda et al. Bioinformatics. .

Abstract

Motivation: Insertions and deletions (indels) of short DNA segments, along with substitutions, are the most frequent molecular evolutionary events. Indels were shown to affect numerous macro-evolutionary processes. Because indels may span multiple positions, their impact is a product of both their rate and their length distribution. An accurate inference of indel-length distribution is important for multiple evolutionary and bioinformatics applications, most notably for alignment software. Previous studies counted the number of continuous gap characters in alignments to determine the best-fitting length distribution. However, gap-counting methods are not statistically rigorous, as gap blocks are not synonymous with indels. Furthermore, such methods rely on alignments that regularly contain errors and are biased due to the assumption of alignment methods that indels lengths follow a geometric distribution.

Results: We aimed to determine which indel-length distribution best characterizes alignments using statistical rigorous methodologies. To this end, we reduced the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. Moreover, we developed a novel method to test if current indel models provide an adequate representation of the evolutionary process. We found that the best-fitting model varies among alignments, with a Zipf length distribution fitting the vast majority of them.

Availability and implementation: The data underlying this article are available in Github, at https://github.com/elyawy/SpartaSim and https://github.com/elyawy/SpartaPipeline.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Histogram of classified length distributions: (a) YIDB datasets; (b) EggNOG datasets.
Figure 2.
Figure 2.
AM score histogram grouped by classified indel-length distribution, within five different AM score intervals (each of width 0.2). (a) YIDB datasets. (b) EggNOG datasets. In each bin, the left-most, middle, and right distributions are Zipf, geometric, and Poisson, respectively.
Figure 3.
Figure 3.
Summary of the posterior predictive analysis for each of the summary statistics obtained for: (a) the YIDB datasets; and (b) the EggNOG datasets. Dots and stars represent datasets that were classified as geometric and as Zipf, respectively. The Y axis represents the percentage of dataset for which the posterior predictive p-value was between 0.025 and 0.0975. The list of summary statistics is available in Supplementary Table S1.

References

    1. Altschul SF, Erickson BW.. Optimal sequence alignment using affine gap costs. Bull Math Biol 1986;48:603–16. 10.1007/BF02462326. - DOI - PubMed
    1. Anzai T, Shiina T, Kimura N. et al. Comparative sequencing of human and chimpanzee MHC class I regions unveils insertions/deletions as the major path to genomic divergence. Proc Natl Acad Sci USA 2003;100:7708–13. 10.1073/pnas.1230533100. - DOI - PMC - PubMed
    1. Auton A, Abecasis GR, Altshuler DM. et al.; The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 2015;526:68–74. 10.1038/nature15393. - DOI - PMC - PubMed
    1. Ashkenazy H, Penn O, Doron-Faigenboim A. et al. FastML: a web server for probabilistic reconstruction of ancestral sequences. Nucleic Acids Res 2012;40:W580–W584. - PMC - PubMed
    1. Beaumont MA, Zhang W, Balding DJ.. Approximate Bayesian computation in population genetics. Genetics 2002;162:2025–35. 10.1111/j.1937-2817.2010.tb01236.x. - DOI - PMC - PubMed

Publication types