Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Sep;41(17):e162.
doi: 10.1093/nar/gkt628. Epub 2013 Jul 22.

Graph-based modeling of tandem repeats improves global multiple sequence alignment

Affiliations

Graph-based modeling of tandem repeats improves global multiple sequence alignment

Adam M Szalkowski et al. Nucleic Acids Res. 2013 Sep.

Abstract

Tandem repeats (TRs) are often present in proteins with crucial functions, responsible for resistance, pathogenicity and associated with infectious or neurodegenerative diseases. This motivates numerous studies of TRs and their evolution, requiring accurate multiple sequence alignment. TRs may be lost or inserted at any position of a TR region by replication slippage or recombination, but current methods assume fixed unit boundaries, and yet are of high complexity. We present a new global graph-based alignment method that does not restrict TR unit indels by unit boundaries. TR indels are modeled separately and penalized using the phylogeny-aware alignment algorithm. This ensures enhanced accuracy of reconstructed alignments, disentangling TRs and measuring indel events and rates in a biologically meaningful way. Our method detects not only duplication events but also all changes in TR regions owing to recombination, strand slippage and other events inserting or deleting TR units. We evaluate our method by simulation incorporating TR evolution, by either sampling TRs from a profile hidden Markov model or by mimicking strand slippage with duplications. The new method is illustrated on a family of type III effectors, a pathogenicity determinant in agriculturally important bacteria Ralstonia solanacearum. We show that TR indel rate variation contributes to the diversification of this protein family.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Examples of TR unit duplication: (A) in phase and (B) not in phase. TR unit deletions or insertions (i.e. duplications) can occur at any position of a TR unit. The structure of the protein is retained independently of the ‘phase’ at which a duplication occurs.
Figure 2.
Figure 2.
Not all alignable character pairs can be detected by global alignment. In the shown example, two homologous sequences seqA and seqB are separated by one TR unit duplication and two subsequent deletions. Each character in the TR region of seqA retains a corresponding homologous character in seqB, but the global alignment is unable to detect all such relationships owing to the requirement of retaining the character order of each sequence (character A in seqA is not aligned to the homologous A in seqB). Although such alignment does not fully reflect the homology in terms of aligned pairs, it nevertheless correctly reflects the three indels.
Figure 3.
Figure 3.
Possible TR unit indels are inferred from a TR-MSA. This TR-MSA is obtained transparently by running a TR detection program. For each character in the TR region, edges ending at the character after a TR-homologous character are added to the sequence graph.
Figure 4.
Figure 4.
Simulated sequences consist of flanking regions of variable length (∼100 aa) and 6–20 TR units. These TR units are either sampled directly from a profile HMM (profile version) or a single unit is sampled and mutated at distance 0.2 expected substitutions per site (duplication version).
Figure 5.
Figure 5.
Only one aligned pair is allowed per ancestral character. Consider the central characters C and A, which are independently duplicated in both leaf sequences. By allowing only one aligned pair per ancestral character, the number of indel events is reconstructed correctly.
Figure 6.
Figure 6.
Results for the profile simulation method simulation of MSAs with GALA-LRR-like repeats. The vertical facets represent sequence divergence (i.e. total tree lengths measured in expected substitutions per site) used in simulation. ProGraphMSA was executed without any additional information on TRs, whereas ProGraphMSA+TR used TR information detected by TRUST, and ProGraphMSA+realTR was executed with the true TR-MSA provided (as known from simulation). Results for the popular programs MAFFT and MUSCLE are depicted for a qualitative comparison.
Figure 7.
Figure 7.
Results for the duplication method simulation of MSAs with GALA-LRR-like repeats. The vertical facets represent sequence divergence (i.e. total tree lengths measured in expected substitutions per site) used in simulation. ProGraphMSA was executed without any additional information on TRs, whereas ProGraphMSA+TR used TR information detected by TRUST, and ProGraphMSA+realTR was executed with the true TR-MSA provided (as known from simulation). Results for the popular programs MAFFT and MUSCLE are depicted for a qualitative comparison.
Figure 8.
Figure 8.
Real versus estimated number of TR unit indel events for MSAs with GALA-LRRs, tree length 0.5 and high TR unit indel rate of 1.0. As expected, the number of TR unit indel events was usually underestimated because of nested indels on single branches and multiple indels being erroneously merged.
Figure 9.
Figure 9.
The evolution of LRR tandem units in GALA proteins from Ralstonia solanacearum. Yellow circles represent the numbers of LRR indels inferred by ProGraphMSA+TR and are mapped to the corresponding nodes of the GALA phylogeny inferred by Remigi et al. (2011). Colored taxonomic ranges represent different paralogous GALA families. Numbers of LRR units in each strain are represented by gray columns.

Similar articles

Cited by

References

    1. Löytynoja A, Goldman N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008;320:16325. - PubMed
    1. Löytynoja A, Vilella AJ, Goldman N. Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm. Bioinformatics. 2012;28:1684–1691. - PMC - PubMed
    1. Anisimova M, Cannarozzi G, Liberles DA. Finding the balance between the mathematical and biological optima in multiple sequence alignment. Trends Evol. Biol. 2010;2:e7.
    1. Sammeth M, Heringa J. Global multiple-sequence alignment with repeats. Proteins. 2006;64:263274. - PubMed
    1. Phuong TM, Do CB, Edgar RC, Batzoglou S. Multiple alignment of protein sequences with repeats and rearrangements. Nucleic Acids Res. 2006;34:5932–5942. - PMC - PubMed

Publication types

Substances