Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Oct 11:8:382.
doi: 10.1186/1471-2105-8-382.

XSTREAM: a practical algorithm for identification and architecture modeling of tandem repeats in protein sequences

Affiliations

XSTREAM: a practical algorithm for identification and architecture modeling of tandem repeats in protein sequences

Aaron M Newman et al. BMC Bioinformatics. .

Abstract

Background: Biological sequence repeats arranged in tandem patterns are widespread in DNA and proteins. While many software tools have been designed to detect DNA tandem repeats (TRs), useful algorithms for identifying protein TRs with varied levels of degeneracy are still needed.

Results: To address limitations of current repeat identification methods, and to provide an efficient and flexible algorithm for the detection and analysis of TRs in protein sequences, we designed and implemented a new computational method called XSTREAM. Running time tests confirm the practicality of XSTREAM for analyses of multi-genome datasets. Each of the key capabilities of XSTREAM (e.g., merging, nesting, long-period detection, and TR architecture modeling) are demonstrated using anecdotal examples, and the utility of XSTREAM for identifying TR proteins was validated using data from a recently published paper.

Conclusion: We show that XSTREAM is a practical and valuable tool for TR detection in protein and nucleotide sequences at the multi-genome scale, and an effective tool for modeling TR domains with diverse architectures and varied levels of degeneracy. Because of these useful features, XSTREAM has significant potential for the discovery of naturally-evolved modular proteins with applications for engineering novel biostructural and biomimetic materials, and identifying new vaccine and diagnostic targets.

PubMed Disclaimer

Figures

Figure 1
Figure 1
XSTREAM Program Flow Chart. Activity Diagram of XSTREAM modeled using Enterprise Architect version 4.10.739 (Sparx Systems).
Figure 2
Figure 2
Multiple Alignment of TR domain from C. elegans. Standard TR properties are shown above the multiple alignment of a proline/glycine-rich TR domain in the C. elegans hypothetical protein sequence CE22309 from wormpep173 . 'Positions' denotes the corresponding input sequence index range of this TR domain and 'Copy N' denotes copy number. The consensus error is 0.13 because nG = 99, cG = 29, mG = 583, and tot = 1595 (see Consensus Building in Appendix). Gap characters are shown in red to emphasize the high indel content of this TR. Below the dashed double line is the consensus sequence followed by the consensus error string shown in blue. Columns of the alignment with 100% character identity have no symbol in the consensus error string. The symbols ':' and '*' denote a column with greater than or equal to 50% character identity and a column with less than 50% character identity respectively.
Figure 3
Figure 3
Discontinuous Domain Merging of TR from A. thaliana. Successful merging of non-overlapping TR regions is shown by a TR domain from A. thaliana predicted gene product gi 9293925. Characters in the intervening degenerate sequence space that do not match the consensus are each represented by 'x'. This TR has a period of 9, a copy number of 8.67, a consensus error of 0.09 [nG = 6, cG = 1, mG = 9, tot = 88 (95-7 x's) (see Consensus Building and Merging in Appendix)], and is located at sequence positions 1 – 85.
Figure 4
Figure 4
Example of a Nested TR Architecture. A nested TR of two hierarchical levels is illustrated with an example from T. brucei (copy number = 7.78, period = 138, positions = 651 – 1738). Since a nested TR is by definition, a TR within another TR, the level of nesting depth corresponds to the number of TR domains that encapsulate a particular nested TR. This example shows nested TRs in two representations: the compressed consensus sequence with nested TRs denoted within brackets, and a graphical depiction of the hierarchical structure and distribution of nested TRs, with the consensus represented by the brown bottom bar, and increasing levels of nesting represented by additional bars moving upward.
Figure 5
Figure 5
14 L. infantum TR Proteins Found by XSTREAM. A colored repeat distribution schematic generated by XSTREAM showing 14 L. infantum TR-containing proteins found by XSTREAM and not by Goto et al [18]. All protein sequence lengths are normalized, and shown from top to bottom in order of decreasing TR period. TR copies are separated by a vertical black line. Each color corresponds to a specific TR domain. In cases where TR domains of adjacent protein sequences share the same color, such TRs were grouped into the same class by the consensus comparison function (see Appendix).
Figure 6
Figure 6
TR Proteins from A. thaliana. A colored repeat distribution schematic generated by XSTREAM showing the 57 TR-containing proteins from A. thaliana (TAIR6_pep_20060907) with minP = 1 and minimum TR content = 0.7. These protein sequences are ordered by decreasing period from top to bottom. The longest period is shown in the top left panel and the shortest is shown in the bottom right panel. Notice two large classes of protein sequences (polyubiquitins and proline-rich extensin-like family proteins) as determined by grouping their TR domains with the consensus comparison module (see Appendix).
Figure 7
Figure 7
Seed Extension Example. Extension of the seed pair 'KYR' is illustrated using the input sequence S = PQKYRSACYKYRACYFG (|S| = 19) with parameter values L = 3 and g = 1. A tracing of this SE example is shown for the sequence iterator values (x, y) and the compared subwords in S. The SE subroutine used in each step is indicated in parentheses, where M = hashcode array and CW = consensus wobble.
Figure 8
Figure 8
Sequence Alignment using GRDP. The matrix on the left represents GRDP sequence alignment of sequences 'ATTCGA' and 'ATCGAT' with g = 2 and space complexity O(n2). Since g places an upper bound on traceable matrix width, we only use O(n) space, as shown with the matrix on the right. Notice that because the width of the matrix on the right is 2 g + 1, it accommodates all of the relevant information from the matrix on the left. The resulting pairwise alignment is also shown.

Similar articles

Cited by

References

    1. Landau GM, Schmidt JP, Sokol D. An algorithm for approximate tandem repeats. J Comp Biol. 2001;8:1–18. doi: 10.1089/106652701300099038. - DOI - PubMed
    1. Sokol D, Benson G, Tojeira J. Tandem repeats over the edit distance. Bioinformatics. 2007;23:E30–E35. doi: 10.1093/bioinformatics/btl309. - DOI - PubMed
    1. Cummings CJ, Zoghbi HY. Fourteen and counting: unraveling trinucleotide repeat diseases. Hum Molec Genet. 2000;9:909–916. doi: 10.1093/hmg/9.6.909. - DOI - PubMed
    1. Buard J, Vergnaud G. Complex recombination events at the hypermutable minisatellite CEB1 (D2S90. The EMBO J. 1994;13:3203–3210. - PMC - PubMed
    1. Verstrepen KJ, Jansen A, Lewitter F, Fink GR. Intragenic tandem repeats generate functional variability. Nat Genet. 2005;37:986–990. doi: 10.1038/ng1618. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources