PSAR: measuring multiple sequence alignment reliability by probabilistic sampling

Jaebum Kim¹, Jian Ma

Affiliations

PMID: 21576232
PMCID: PMC3159474
DOI: 10.1093/nar/gkr334

PSAR: measuring multiple sequence alignment reliability by probabilistic sampling

Jaebum Kim et al. Nucleic Acids Res. 2011 Aug.

. 2011 Aug;39(15):6359-68.

doi: 10.1093/nar/gkr334. Epub 2011 May 16.

Authors

Jaebum Kim¹, Jian Ma

Affiliation

¹ Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.

PMID: 21576232
PMCID: PMC3159474
DOI: 10.1093/nar/gkr334

Abstract

Multiple sequence alignment, which is of fundamental importance for comparative genomics, is a difficult problem and error-prone. Therefore, it is essential to measure the reliability of the alignments and incorporate it into downstream analyses. We propose a new probabilistic sampling-based alignment reliability (PSAR) score. Instead of relying on heuristic assumptions, such as the correlation between alignment quality and guide tree uncertainty in progressive alignment methods, we directly generate suboptimal alignments from an input multiple sequence alignment by a probabilistic sampling method, and compute the agreement of the input alignment with the suboptimal alignments as the alignment reliability score. We construct the suboptimal alignments by an approximate method that is based on pairwise comparisons between each single sequence and the sub-alignment of the input alignment where the chosen sequence is left out. By using simulation-based benchmarks, we find that our approach is superior to existing ones, supporting that the suboptimal alignments are highly informative source for assessing alignment reliability. We apply the PSAR method to the alignments in the UCSC Genome Browser to measure the reliability of alignments in different types of regions, such as coding exons and conserved non-coding regions, and use it to guide cross-species conservation study.

PubMed Disclaimer

Figures

**Figure 1.**
Procedure of probabilistic sampling. (A) Given an input MSA, (B) PSAR first chooses one sequence and makes a sub-alignment by leaving the chosen sequence out. Every gap in the left-out sequence and gaps spanning entire columns in the sub-alignment are removed. (C) Then, the pre-processed left-out sequence and sub-alignment are compared by dynamic programming (DP) based on a pair-HMM in Figure 2. Three DP tables, one for each pair-HMM state, are shown and they are filled from the top-left cell to the bottom-right cell. (D) Finally, PSAR probabilistically samples suboptimal alignments by tracing back from the bottom-right cell multiple times. An example of the sampled alignment is shown in (E).

**Figure 2.**
Pair-HMM of the PSAR method. This is a special type of pair-HMM, which is a generative model of an MSA that is constructed from a sequence S, and a sub-alignment A. The state M emits one character in the sequence S and one column in the sub-alignment A. The state I_S emits one character in S, whereas the state I_A emits one column in A. The state B is a begin state and E is an end state.

**Figure 3.**
Performance of PSAR in comparison with GUIDANCE on the insect benchmark (see ‘Methods’ section). Three MSA programs, Pecan (A), MAFFT (B), and ClustalW (C), were used to generate input MSAs. ROC curves are reported and AUC scores are shown in parentheses in legend. The GUIDANCE program was run with two MSA programs, MAFFT and ClustalW (‘Guidance.mafft’ and ‘Guidance.clustalw’ in legend, respectively), to generate perturbed MSAs.

**Figure 4.**
Performance of PSAR in comparison with GUIDANCE on the mammal benchmark (see ‘Methods’ section). Three MSA programs, Pecan (A), MAFFT (B), and ClustalW (C), were used to generate input MSAs. ROC curves are reported and AUC scores are shown in parentheses in legend. The GUIDANCE program was run with two MSA programs, MAFFT and ClustalW (‘Guidance.mafft’ and ‘Guidance.clustalw’ in legend, respectively), to generate perturbed MSAs.

**Figure 5.**
Fraction of unreliable alignments in human chromosome 22. (A) Fraction of different types of regions, such as coding exons (‘Coding Exons’), conserved non-coding regions [‘Non-coding (conserved)’], non-coding regions with repetitive elements [‘Non-coding (repeats)’], and the rest of non-coding regions [‘Non-coding (non-repeats)’]. (B) Fraction of unreliably-aligned positions for each different type of regions as a function of the PSAR score cutoff. ‘Overall’ represents the union of all types of regions (the whole chromosome 22).

**Figure 6.**
phastCons conservation scores and their variability in an example unreliable region (human chromosome 22:37,982,804-37,983,325). The first row shows the PSAR scores. The phastCons conservation scores were computed for each suboptimal alignment in this region, and the mean and standard deviation are shown in the second and the third rows, respectively. The alignment at the bottom is the Multiz alignment in the red-dotted box. The low PSAR and the variable phastCons conservation scores are probably attributed to the equally likely placement of gaps highlighted by red rectangles in the Multiz alignment.

See this image and copyright information in PMC

References

1. Blanchette M. Computation and analysis of genomic multi-sequence alignments. Annu. Rev. Genomics Hum. Genet. 2007;8:193–213. - PubMed
1. Notredame C. Recent evolutions of multiple sequence alignment algorithms. PLoS Comput. Biol. 2007;3:e123. - PMC - PubMed
1. Pirovano W, Heringa J. Multiple sequence alignment. Methods Mol. Biol. 2008;452:143–161. - PubMed
1. Simossis V, Kleinjung J, Heringa J. An overview of multiple sequence alignment. Curr. Protoc. Bioinformatics. 2003 Chapter 3, Unit 3.7. - PubMed
1. Paten B, Herrero J, Beal K, Birney E. Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment. Bioinformatics. 2009;25:295–301. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PSAR: measuring multiple sequence alignment reliability by probabilistic sampling

Affiliation

PSAR: measuring multiple sequence alignment reliability by probabilistic sampling

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources