Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2009 Feb;26(2):473-80.
doi: 10.1093/molbev/msn275. Epub 2008 Nov 28.

Problems and solutions for estimating indel rates and length distributions

Affiliations
Comparative Study

Problems and solutions for estimating indel rates and length distributions

Reed A Cartwright. Mol Biol Evol. 2009 Feb.

Abstract

Insertions and deletions (indels) are fundamental but understudied components of molecular evolution. Here we present an expectation-maximization algorithm built on a pair hidden Markov model that is able to properly handle indels in neutrally evolving DNA sequences. From a data set of orthologous introns, we estimate relative rates and length distributions of indels among primates and rodents. This technique has the advantage of potentially handling large genomic data sets. We find that a zeta power-law model of indel lengths provides a much better fit than the traditional geometric model and that indel processes are conserved between our taxa. The estimated relative rates are about 12-16 indels per 100 substitutions, and the estimated power-law magnitudes are about 1.6-1.7. More significantly, we find that using the traditional geometric/affine model of indel lengths introduces artifacts into evolutionary analysis, casting doubt on studies of the evolution and diversity of indel formation using traditional models and invalidating measures of species divergence that include indel lengths.

PubMed Disclaimer

Figures

F<sc>IG</sc>. 1.—
FIG. 1.—
SMUVE, a generalized pair HMM architecture. Match state (M) emits a pair of nucleotides to the alignment via the K2P model, whereas indel states (U and V) emit nucleotides to only one sequence based on either a zeta or geometric distribution. Transition probabilities are evenly distributed out of each node unless otherwise noted; τ = 1/(a + 1) and η = e−2rt.
F<sc>IG</sc>. 2.—
FIG. 2.—
Comparison of optimal and EM estimation procedures. Each line represents one of four procedures: geo-SMUVE, geometric optimal, zeta-SMUVE, and zeta-optimal. RMSE and biases have been normalized using the true value of the parameters. Note that the indel length distribution parameters of the geometric and zeta models are not directly comparable to one another.
F<sc>IG</sc>. 3.—
FIG. 3.—
Estimated maximum likelihood SMUVE parameters, with 95% confidence intervals. Each bar is an estimate from a pairwise comparison among human, chimp, mouse, or rat. Distribution parameters are not directly comparable between the two models and are plotted on different axes.
F<sc>IG</sc>. 4.—
FIG. 4.—
Histograms of the observed number of indels of lengths 1–9 in the mouse–rat alignment space, with 95% confidence intervals. Triangles represent MLEs.

References

    1. Anzai T, Shiina T, Kimura N, et al. (21 co-authors) Comparative sequencing of human and chimpanzee MHC class I regions unveils insertions/deletions as the major path to genomic divergence. Proc Natl Acad Sci USA. 2003;100:7708–7713. - PMC - PubMed
    1. Benner SA, Cohen MA, Gonnet GH. Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J Mol Biol. 1993;229:1065–1082. - PubMed
    1. Britten RJ. Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels. Proc Natl Acad Sci USA. 2002;99:13633–13635. - PMC - PubMed
    1. Britten RJ. Majority of divergence between closely related DNA sequences is due to indels. Proc Natl Acad Sci USA. 2003;100:4661–4665. - PMC - PubMed
    1. Cartwright RA. Logarithmic gap costs decrease alignment accuracy. BMC Bioinformatics. 2006;7:527. - PMC - PubMed

Publication types