Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 1996 Fall;3(3):345-60.
doi: 10.1089/cmb.1996.3.345.

Over- and underrepresentation of short DNA words in herpesvirus genomes

Affiliations

Over- and underrepresentation of short DNA words in herpesvirus genomes

M Y Leung et al. J Comput Biol. 1996 Fall.

Abstract

The relative abundance and rarity of DNA words have been recognized in previous biological studies to have implications for the regulation, repair, and evolutionary mechanisms of a genome. In this paper, we review several different measures of abundance and rarity of DNA words, including z-scores, representation ratios, and cross-ratios, that have appeared in the recent literature, and examine the concordance among them using the human cytomegalovirus genome sequence. We then rank all words of length k = 2, ..., 5 of seven herpesvirus genomes according to their abundance, as measured by one of the z-scores based upon a stationary Markov model of order k-2. Using a simple metric on the ranks of 2-words of the seven herpesvirus sequences, we construct an evolutionary tree. Several 3-words are observed to be consistently over- or underrepresented in all seven herpesviruses. Furthermore, clusters of some of the most over- and underrepresented 4- and 5-words in the genomes are identified with functional sites such as the origins of replication and regulatory signals of individual viruses.

PubMed Disclaimer

Figures

FIG. 1
FIG. 1
(a) Normal qq plot of zL-scores of the 3-word TTG in 100 simulated Markov DNA sequences each with 229,354 bases and transition probabilities estimated from the human cytomegalovirus (HCMV) by maximum likelihood. (b) Normal qq plot of zL-scores of all 64 3-words in one of the simulated Markov sequences.
FIG. 2
FIG. 2
The zL-scores of all (a) 2-words and (b) 3-words in lexicographical order for the 100 simulated Markov DNA sequences. The broken curves envelope the simulated zL-scores. Solid dots are the zL-scores of the HCMV sequence.
FIG. 3
FIG. 3
Normal qq plots of the zL-scores of the HCMV sequence for all 2-, 3-, 4-, and 5-words.
FIG. 4
FIG. 4
Scatter plots of 4-word ranks in HCMV ordered by (a) zL-score versus cross-ratio t (the two words showing most severe difference in zL- and t-rankings are identified); (b) zL-score versus representation ratio r.
FIG. 5
FIG. 5
Tree (a), derived from the 2-word zL-ranks, tends to group the viruses in the same family together. In contrast, the 2-word frequency based tree (b) reflects more of the similarity in base composition of the viruses.
FIG. 6
FIG. 6
Word counts in a sliding window of length equal to 0.5% of the genome (rounded to the nearest hundred) are kept at each occurrence of the word. A typical sliding window plot of the extremal 4- and 5-words looks like graph (a). Graphs (b) through (n) show those extremal words with unusual clusters. These clusters are characterized to be significant (p < 0.001) by the r-scan statistics (Dembo and Karlin, 1992). An explanation of the application of scan statistics in evaluating clusters of special sequence patterns in a genome is given by Leung et al. (1994).

Comment in

Similar articles

Cited by

References

    1. Agresti A. Categorical Data Analysis. John Wiley; New York: 1990.
    1. Bhagwat AS, McClelland M. DNA mismatch correction by very short patch repair may have altered the abundance of oligonucleotides in the E. coli genome. Nucl Acids Res. 1992;20(7):1663–1668. - PMC - PubMed
    1. Billingsley P. Probability and Measure. 3. John Wiley; New York: 1995.
    1. Blaisdell BE. Markov Chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding. J Mol Evol. 1985;21:278–288. - PubMed
    1. Brendel V, Beckmann JS, Trifonov EN. Linguistics of nucleotide sequences: Morphology and comparison of vocabularies. J Biomol Struct Dyn. 1986;4(1):11–21. - PubMed

Publication types

LinkOut - more resources