Genome inhomogeneity is determined mainly by WW and SS dinucleotides
- PMID: 2004273
- DOI: 10.1093/bioinformatics/7.1.39
Genome inhomogeneity is determined mainly by WW and SS dinucleotides
Abstract
According to the hypothesis of the modular structure of DNA, genomes consist of modules of various nature which may differ in statistical characteristics. Statistical analysis helps in revealing the differences in statistical characteristics and predicting the modular structure. In this connection the question about the contribution of each word of length l (l-tuple) to the inhomogeneity of genetic text arises. The notion of stationary (i.e. relatively evenly distributed over a genome) versus non-stationary l-tuples has been introduced previously. In this paper, the dinucleotide distributions for all long sequences from GenBank were analyzed and it was shown that non-stationary dinucleotides are closely associated with polyW and polyS tracts (W denotes 'weak' nucleotides A or T, while S stands for the 'strong' nucleotides G or C). Thus, genome inhomogeneity is shown to be determined mainly by AA, TT, GG, CC, AT, TA, GC and CG dinucleotides. It has been demonstrated that neither 'codon usage' nor the 'isochore model' can account for this phenomenon.
Similar articles
-
Linguistics of nucleotide sequences. II: Stationary words in genetic texts and the zonal structure of DNA.J Biomol Struct Dyn. 1989 Apr;6(5):1027-38. doi: 10.1080/07391102.1989.10506529. J Biomol Struct Dyn. 1989. PMID: 2531597
-
Estimating the repeat structure and length of DNA sequences using L-tuples.Genome Res. 2003 Aug;13(8):1916-22. doi: 10.1101/gr.1251803. Genome Res. 2003. PMID: 12902383 Free PMC article.
-
[Statistical characteristics in primary structures of functional regions of Escherichia coli genome. II. Non-stationary Markov chains].Mol Biol (Mosk). 1986 Jul-Aug;20(4):1024-33. Mol Biol (Mosk). 1986. PMID: 3531811 Russian.
-
seq++: analyzing biological sequences with a range of Markov-related models.Bioinformatics. 2005 Jun 1;21(11):2783-4. doi: 10.1093/bioinformatics/bti389. Epub 2005 Mar 17. Bioinformatics. 2005. PMID: 15774554
-
Monte Carlo estimation of total variation distance of Markov chains on large spaces, with application to phylogenetics.Stat Appl Genet Mol Biol. 2013 Mar 26;12(1):39-48. doi: 10.1515/sagmb-2012-0023. Stat Appl Genet Mol Biol. 2013. PMID: 23459470
Cited by
-
Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach.IEEE Access. 2020 Oct 15;8:195263-195273. doi: 10.1109/ACCESS.2020.3031387. eCollection 2020. IEEE Access. 2020. PMID: 34976561 Free PMC article.
-
Symmetry observations in long nucleotide sequences.Nucleic Acids Res. 1993 Jun 25;21(12):2797-800. doi: 10.1093/nar/21.12.2797. Nucleic Acids Res. 1993. PMID: 8332488 Free PMC article. No abstract available.
-
Comparative DNA sequence features in two long Escherichia coli contigs.Nucleic Acids Res. 1993 Aug 11;21(16):3875-84. doi: 10.1093/nar/21.16.3875. Nucleic Acids Res. 1993. PMID: 8367304 Free PMC article.
-
Information contents and dinucleotide compositions of plant intron sequences vary with evolutionary origin.Plant Mol Biol. 1992 Sep;19(6):1057-64. doi: 10.1007/BF00040537. Plant Mol Biol. 1992. PMID: 1511130 Review.
-
The frequency of two-base tracts in eukaryotic genomes.J Mol Evol. 1993 Aug;37(2):123-30. doi: 10.1007/BF02407347. J Mol Evol. 1993. PMID: 8411201
MeSH terms
Substances
LinkOut - more resources
Miscellaneous