Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2002:3:6.
doi: 10.1186/1471-2105-3-6. Epub 2002 Feb 5.

Universal sequence map (USM) of arbitrary discrete sequences

Affiliations
Comparative Study

Universal sequence map (USM) of arbitrary discrete sequences

Jonas S Almeida et al. BMC Bioinformatics. 2002.

Abstract

Background: For over a decade the idea of representing biological sequences in a continuous coordinate space has maintained its appeal but not been fully realized. The basic idea is that any sequence of symbols may define trajectories in the continuous space conserving all its statistical properties. Ideally, such a representation would allow scale independent sequence analysis--without the context of fixed memory length. A simple example would consist on being able to infer the homology between two sequences solely by comparing the coordinates of any two homologous units.

Results: We have successfully identified such an iterative function for bijective mapping psi of discrete sequences into objects of continuous state space that enable scale-independent sequence analysis. The technique, named Universal Sequence Mapping (USM), is applicable to sequences with an arbitrary length and arbitrary number of unique units and generates a representation where map distance estimates sequence similarity. The novel USM procedure is based on earlier work by these and other authors on the properties of Chaos Game Representation (CGR). The latter enables the representation of 4 unit type sequences (like DNA) as an order free Markov chain transition table. The properties of USM are illustrated with test data and can be verified for other data by using the accompanying web-based tool:http://bioinformatics.musc.edu/~jonas/usm/.

Conclusions: USM is shown to enable a statistical mechanics approach to sequence analysis. The scale independent representation frees sequence analysis from the need to assume a memory length in the investigation of syntactic rules.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Representation of the USM of the two stanzas, respectively dark and light spheres connected by dashed lines, in a reduced 3-dimension space obtained using the first three principal components, PC1,2,3. In a) the units corresponding to the segment "very fond of" both stanzas are connected by solid lines. The procedure is repeated in b) for the segment "bananas". These figures illustrate the property that similar segments converge in the USM representation, which is reflected by the docking of homologous units. The factorization for dimensionality reduction serves visualization purposes only. The variance represented by each of the three principal components is 40%, 13% and 11%, respectively.
Figure 2
Figure 2
Cross-tabulation of similarity between positions of the two stanzas. The figures can be reproduced using accompanying web based USM tool (see Abstract for URL address, test data also included), a) forward distance, df (Eq. 4); b) backward distance, db (Eq. 4); c) bi-directional similarity, D, compensated for φP3 = 0.55,n = 4.25 (Eq. 11). Notice that the values of diagonals between similar segments estimate the number of units in the segments, although each D value is computed solely from a single pairwise comparison of UCM coordinates; d) Compounded similarity, dc, with a maximum for the mid-position of the similar segments (Eq. 12).
Figure 3
Figure 3
Probability distribution of similarity estimates for the uniformly random sequence null model – e.g. experimental values deviating from this model would indicate real homology, as in Fig. 4. The dark lines represent the numerical solution for the bi-directional over-determination, P3 (Equation 11), for different dimensionalities, n, identified by numbers in the plot. The gray lines represent the numerical solution for the same values of n, for the uni-directional over-determination, P1 (Equation 9). The solution for the dimensionality of the two stanzas, n = log2(19) = 4.25, is highlighted by a thick line, for both P3 (thick dark line) and P1 (thick gray line).
Figure 4
Figure 4
Cumulative distribution of bi-directional similarity, D, between the two stanzas and comparison of genomic and proteomic sequences of E. coli threonine gene A, thrA (2463 base pairs for the genomic sequence and 820 aminoacids for the proteomic sequence), with B, thrB (933 base pairs for the genomic sequence and 310 aminoacids for the proteomic sequence). The null model expectation, that of uniform random distribution of units, is represented by dashed lines, obtained using Eq. 11. for n = 2 (half dimensionality of USM state space for DNA) and n = 4.3 (half dimensionality of USM state space for proteins, n = 4.32, and for the two stanzas, n = 4.25). The solid lines represent the actual cumulative distribution of D values.
Figure 5
Figure 5
Comparison of uni-directional and bi-directional USM implementations for DNA sequences. The similarity matrices for, respectively, df and D values between two portions of E. coli K-12 MG1655 threonine gene A (thrA, genome positions 337–2799) and threonine gene B (thrB, genome positions 2801–3733) are presented. The numbers in the axis identify the position in the gene. Actual values of df and D are shown for the framed region on the table to the right. a) The df values were obtained by a unidirectional implementation of the USM procedure (Eq. 4). By comparing this figure with a similar analysis reported previously [12] for the same sequences (Fig. 10 of that report) it can be seen that they are nearly indistinguishable, even if the exact values vary. The equivalence between unidirectional USM for n = 2 and CGR highlights the property that CGR is a special case of USM. The fact that the latter can be implemented for any value of n or any number of unique units justifies the Universal naming; b) In this plot the same sequences were compared using bidirectional USM, and, accordingly, generate a matrix of D values (Eq.8, 11). It is clearly apparent, and as already noted for Figure 2c, that D-similarity between any two homologous units is an estimate of the length of the entire homologous segment.

References

    1. Román-Roldán R, Bernaola-Galván P, Oliver JL. Application of information theory to DNA sequence analysis: a review. Pattern Recognition. 1996;29:1187–1194. doi: 10.1016/0031-3203(95)00145-X. - DOI
    1. Nady A. Recent investigations into global characteristics of long DNA sequences. Indian Journal of Biochemistry and Biophysics. 1994;31:149–155. - PubMed
    1. Tiòo P. Spatial representation of symbolic sequences through iterative function systems. IEEE Transactions on Systems, Man, and Cybernetics – Part A: Systems and Humans. 1999;29:386–393. doi: 10.1109/3468.769757. - DOI
    1. Enright MC, Knox K, Griffiths D, Crook DWM, Spratt BG. Multilocus sequence typing of Streptococcus pneumoniae directly from cerebrospinal fluid. Eur. 2001;19:627–630. doi: 10.1007/s100960000321. - DOI - PubMed
    1. Jeffrey HJ. Chaos game representation of gene structure. Nucleic Acid Res. 1990;18:2163–2170. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources