Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Aug;39(15):6359-68.
doi: 10.1093/nar/gkr334. Epub 2011 May 16.

PSAR: measuring multiple sequence alignment reliability by probabilistic sampling

Affiliations

PSAR: measuring multiple sequence alignment reliability by probabilistic sampling

Jaebum Kim et al. Nucleic Acids Res. 2011 Aug.

Abstract

Multiple sequence alignment, which is of fundamental importance for comparative genomics, is a difficult problem and error-prone. Therefore, it is essential to measure the reliability of the alignments and incorporate it into downstream analyses. We propose a new probabilistic sampling-based alignment reliability (PSAR) score. Instead of relying on heuristic assumptions, such as the correlation between alignment quality and guide tree uncertainty in progressive alignment methods, we directly generate suboptimal alignments from an input multiple sequence alignment by a probabilistic sampling method, and compute the agreement of the input alignment with the suboptimal alignments as the alignment reliability score. We construct the suboptimal alignments by an approximate method that is based on pairwise comparisons between each single sequence and the sub-alignment of the input alignment where the chosen sequence is left out. By using simulation-based benchmarks, we find that our approach is superior to existing ones, supporting that the suboptimal alignments are highly informative source for assessing alignment reliability. We apply the PSAR method to the alignments in the UCSC Genome Browser to measure the reliability of alignments in different types of regions, such as coding exons and conserved non-coding regions, and use it to guide cross-species conservation study.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Procedure of probabilistic sampling. (A) Given an input MSA, (B) PSAR first chooses one sequence and makes a sub-alignment by leaving the chosen sequence out. Every gap in the left-out sequence and gaps spanning entire columns in the sub-alignment are removed. (C) Then, the pre-processed left-out sequence and sub-alignment are compared by dynamic programming (DP) based on a pair-HMM in Figure 2. Three DP tables, one for each pair-HMM state, are shown and they are filled from the top-left cell to the bottom-right cell. (D) Finally, PSAR probabilistically samples suboptimal alignments by tracing back from the bottom-right cell multiple times. An example of the sampled alignment is shown in (E).
Figure 2.
Figure 2.
Pair-HMM of the PSAR method. This is a special type of pair-HMM, which is a generative model of an MSA that is constructed from a sequence S, and a sub-alignment A. The state M emits one character in the sequence S and one column in the sub-alignment A. The state IS emits one character in S, whereas the state IA emits one column in A. The state B is a begin state and E is an end state.
Figure 3.
Figure 3.
Performance of PSAR in comparison with GUIDANCE on the insect benchmark (see ‘Methods’ section). Three MSA programs, Pecan (A), MAFFT (B), and ClustalW (C), were used to generate input MSAs. ROC curves are reported and AUC scores are shown in parentheses in legend. The GUIDANCE program was run with two MSA programs, MAFFT and ClustalW (‘Guidance.mafft’ and ‘Guidance.clustalw’ in legend, respectively), to generate perturbed MSAs.
Figure 4.
Figure 4.
Performance of PSAR in comparison with GUIDANCE on the mammal benchmark (see ‘Methods’ section). Three MSA programs, Pecan (A), MAFFT (B), and ClustalW (C), were used to generate input MSAs. ROC curves are reported and AUC scores are shown in parentheses in legend. The GUIDANCE program was run with two MSA programs, MAFFT and ClustalW (‘Guidance.mafft’ and ‘Guidance.clustalw’ in legend, respectively), to generate perturbed MSAs.
Figure 5.
Figure 5.
Fraction of unreliable alignments in human chromosome 22. (A) Fraction of different types of regions, such as coding exons (‘Coding Exons’), conserved non-coding regions [‘Non-coding (conserved)’], non-coding regions with repetitive elements [‘Non-coding (repeats)’], and the rest of non-coding regions [‘Non-coding (non-repeats)’]. (B) Fraction of unreliably-aligned positions for each different type of regions as a function of the PSAR score cutoff. ‘Overall’ represents the union of all types of regions (the whole chromosome 22).
Figure 6.
Figure 6.
phastCons conservation scores and their variability in an example unreliable region (human chromosome 22:37,982,804-37,983,325). The first row shows the PSAR scores. The phastCons conservation scores were computed for each suboptimal alignment in this region, and the mean and standard deviation are shown in the second and the third rows, respectively. The alignment at the bottom is the Multiz alignment in the red-dotted box. The low PSAR and the variable phastCons conservation scores are probably attributed to the equally likely placement of gaps highlighted by red rectangles in the Multiz alignment.

References

    1. Blanchette M. Computation and analysis of genomic multi-sequence alignments. Annu. Rev. Genomics Hum. Genet. 2007;8:193–213. - PubMed
    1. Notredame C. Recent evolutions of multiple sequence alignment algorithms. PLoS Comput. Biol. 2007;3:e123. - PMC - PubMed
    1. Pirovano W, Heringa J. Multiple sequence alignment. Methods Mol. Biol. 2008;452:143–161. - PubMed
    1. Simossis V, Kleinjung J, Heringa J. An overview of multiple sequence alignment. Curr. Protoc. Bioinformatics. 2003 Chapter 3, Unit 3.7. - PubMed
    1. Paten B, Herrero J, Beal K, Birney E. Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment. Bioinformatics. 2009;25:295–301. - PubMed

Publication types