Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2008 Feb;18(2):298-309.
doi: 10.1101/gr.6725608. Epub 2007 Dec 11.

Uncertainty in homology inferences: assessing and improving genomic sequence alignment

Affiliations
Comparative Study

Uncertainty in homology inferences: assessing and improving genomic sequence alignment

Gerton Lunter et al. Genome Res. 2008 Feb.

Abstract

Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human-mouse divergence. We find that >15% of aligned bases are incorrect in existing whole-genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman-Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at http://genserv.anat.ox.ac.uk/grape/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Three types of alignment bias. Alignment algorithms are consistently biased toward likely distributions of indels across sequences, despite the occurrence of less-likely configurations at low frequencies. The figure shows four pairs of sequences with their homologies (left) and corresponding most-likely alignments (right), with wrongly aligned bases highlighted. We distinguish between three types of bias: gap wander (A), caused by spurious high-sequence similarity at nonhomologous sites; gap attraction (B,C), occurring when two indels have little separation, and gap annihilation (D), which occurs when two indels of equal size but opposite signature are found near to each other, favoring explanations without indel events.
Figure 2.
Figure 2.
Effects of alignment biases in relation to gaps. Alignment biases cause systematic errors in alignments that are non-uniformly distributed with respect to alignment gaps. (A, left) The proportion sequence identity (PID, blue triangles), the true PID (dashed), and the proportion of correctly aligned columns (accuracy, red circles), for realigned sequences evolving under a Jukes–Cantor model, as a function of the distance to the nearest gap in the inferred alignment. The spuriously high PID and low accuracy adjacent to gaps is caused by gap wander. Gap annihilation is responsible for the reduced accuracy, and the slight reduction of PID below the true value away from gaps. (B, right) A histogram of intergap distances (circles), and the best fit to a geometric distribution (red line). The scarcity of closely spaced gaps (less than about 20 nucleotides apart) is due to gap attraction and affects a large number of gaps (note the logarithmic scale).
Figure 3.
Figure 3.
Dependence of alignment accuracy on evolutionary distance. Accuracy decreases with increasing evolutionary distance. Shown are the false-positive fraction (FPF, orange squares); the predicted FPF based on gap wander alone (Fw, green open circles); the sensitivity (blue solid triangles); the proportion of correct alignment columns at distance 15 from the nearest gap (asymptotic accuracy, pink dots); and the average posterior probability at the same distance (asymptotic posterior, brown open triangles). Sequences were simulated for various values of the divergence σ (horizontal axes), and realigned using the same σ value. The substitution/indel rate ratio was fixed at γ = σ/δ = 7.5. Qualitatively the same behavior is seen when realigning using a fixed σ (see Supplemental Fig. S2).
Figure 4.
Figure 4.
Suboptimal parameters have minimal impact on alignment accuracy. Shown are: sensitivity to identify homologous nucleotide pairs (blue squares, on left axis), the false-positive fraction (orange triangles, on right axis), and the nonhomologous fraction (green circles, on right axis). Sequences were generated under a Jukes–Cantor model with substitution rate σ = 0.3 and indel rate δ = 0.05, and (A) realigned using a fixed substitution rate σ = 0.3 and a range of indel rates, and (B) using a fixed indel rate δ = 0.05 and variable σ.
Figure 5.
Figure 5.
Topology of the pair HMM for probabilistic alignments. (A) The model is implemented as a pair HMM with a match state (center) surrounded by delete (top) and insert (bottom) states. Hash signs (#) signify emissions, dashes (–) represent no emission (rather than the emission of a gap character); circles represent silent states and are included for clarity, and arrows represent allowed transitions. Paths through this HMM correspond to alignments (and dash signs then represent gap characters). Local alignments were computed by surrounding the core HMM by two pairs of “padding” states (P1 to P4) allowing the alignable portion of the sequences to be embedded in nonhomologous sequence. Note that the model allows a single pass through the central pair HMM, and padding sequence is allowed at both ends of the alignment only. (B) The observed indel-length spectrum in BLASTZ human–mouse alignments (right, circles) is better approximated by a mixture of two geometric distributions (red solid line) than by a single geometric distribution (corresponding to affine-gap scores; blue dashed line). This mixture distribution is implemented by duplicating the insert and delete states. Parameters of the model are: δ, the indel probability per aligned site; ε1 and ε2, the parameters governing the indel length distribution; α, the geometric mixture coefficient, τ, the alignment length parameter. (C) Screenshot of the alignment browser, showing a marginalized posterior decoding (MPD) alignment computed using this model, together with posterior column probabilities. Alignments generally contain columns with low posterior probability, indicating regions where competing alignments contribute a significant fraction of the total likelihood.
Figure 6.
Figure 6.
Dependence of alignment accuracy on modeling fidelity and inference procedure. Shown are the sensitivity, false-positive fraction, and nonhomologous fraction for three inference algorithms and various alignment models (see Table 1) used to align sequences from the human–mouse evolutionary simulation.
Figure 7.
Figure 7.
Posterior decoding shows fewer alignment biases. Shown are the intergap distance histograms for alignments obtained by Viterbi decoding (open squares), posterior decoding (open circles) and MPD (filled triangles), applied on the Full model. The scarcity of closely spaced gaps, resulting from gap attraction, is apparent for all decoding algorithms, but is much less pronounced for posterior decoding and MPD than for Viterbi decoding.
Figure 8.
Figure 8.
Posterior probability is an excellent indicator of alignment accuracy. Shown are the proportion of correctly aligned nucleotides (squares), the average sequence identity (triangles), and the proportion of nucleotides (histogram bars) across 10 posterior probability quantiles, obtained from realigned simulated human–mouse sequence data. For realignment, we used Viterbi decoding on the basic and full models.
Figure 9.
Figure 9.
Performance comparison of score-based aligners. Histogram bars show sensitivity (black; top left axis), false-positive fraction (gray, right axis) and nonhomologous fraction (striped, bottom left axis), for simulated sequence based on human–mouse evolutionary parameters. The results for two probabilistic aligners (leftmost two sets) are included for comparison. Histogram bars marked by asterisks are off the scale; nonhomologous fraction for Lagan, 0.212; Mavid, 0.201; ClustalW, 0.223. Note that the axes in Figure 6 have different scales.

Comment in

References

    1. Altschul S.F., Erickson B.W., Erickson B.W. Locally optimal subalignments using nonlinear similarity functions. Bull. Math. Biol. 1986;48:633–660. - PubMed
    1. Arndt P.F., Burge C.B., Hwa T., Burge C.B., Hwa T., Hwa T. DNA sequence evolution with neighbor-dependent mutation. J. Comput. Biol. 2003;10:313–322. - PubMed
    1. Batzoglou S. The many faces of sequence alignment. Brief Bioinform. 2005;6:6–22. - PubMed
    1. Blanchette M., Kent W.J., Riemer C., Elnitski L., Smit A.F., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Kent W.J., Riemer C., Elnitski L., Smit A.F., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Riemer C., Elnitski L., Smit A.F., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Elnitski L., Smit A.F., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Smit A.F., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Rosenbloom K., Clawson H., Green E.D., Clawson H., Green E.D., Green E.D., et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14:708–715. - PMC - PubMed
    1. Bray N., Pachter L., Pachter L. MAVID: Constrained ancestral alignment of multiple sequences. Genome Res. 2004;14:693–699. - PMC - PubMed

Publication types

LinkOut - more resources