Informatics for unveiling hidden genome signatures
- PMID: 12671005
- PMCID: PMC430167
- DOI: 10.1101/gr.634603
Informatics for unveiling hidden genome signatures
Abstract
With the increasing amount of available genome sequences, novel tools are needed for comprehensive analysis of species-specific sequence characteristics for a wide variety of genomes. We used an unsupervised neural network algorithm, a self-organizing map (SOM), to analyze di-, tri-, and tetranucleotide frequencies in a wide variety of prokaryotic and eukaryotic genomes. The SOM, which can cluster complex data efficiently, was shown to be an excellent tool for analyzing global characteristics of genome sequences and for revealing key combinations of oligonucleotides representing individual genomes. From analysis of 1- and 10-kb genomic sequences derived from 65 bacteria (a total of 170 Mb) and from 6 eukaryotes (460 Mb), clear species-specific separations of major portions of the sequences were obtained with the di-, tri-, and tetranucleotide SOMs. The unsupervised algorithm could recognize, in most 10-kb sequences, the species-specific characteristics (key combinations of oligonucleotide frequencies) that are signature features of each genome. We were able to classify DNA sequences within one and between many species into subgroups that corresponded generally to biological categories. Because the classification power is very high, the SOM is an efficient and fundamental bioinformatic strategy for extracting a wide range of genomic information from a vast amount of sequences.
Figures







Similar articles
-
A novel bioinformatic strategy for unveiling hidden genome signatures of eukaryotes: self-organizing map of oligonucleotide frequency.Genome Inform. 2002;13:12-20. Genome Inform. 2002. PMID: 14571370
-
Self-Organizing Map (SOM) unveils and visualizes hidden sequence characteristics of a wide range of eukaryote genomes.Gene. 2006 Jan 3;365:27-34. doi: 10.1016/j.gene.2005.09.040. Epub 2005 Dec 20. Gene. 2006. PMID: 16364569
-
A novel bioinformatics method for efficient knowledge discovery by BLSOM from big genomic sequence data.Biomed Res Int. 2014;2014:765648. doi: 10.1155/2014/765648. Epub 2014 Apr 3. Biomed Res Int. 2014. PMID: 24804244 Free PMC article.
-
Comparative Genomics for Prokaryotes.Methods Mol Biol. 2018;1704:55-78. doi: 10.1007/978-1-4939-7463-4_3. Methods Mol Biol. 2018. PMID: 29277863 Review.
-
AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome.Genes Genet Syst. 2021 Dec 16;96(4):165-176. doi: 10.1266/ggs.21-00025. Epub 2021 Sep 27. Genes Genet Syst. 2021. PMID: 34565757 Review.
Cited by
-
AI-based search for convergently expanding, advantageous mutations in SARS-CoV-2 by focusing on oligonucleotide frequencies.PLoS One. 2022 Aug 31;17(8):e0273860. doi: 10.1371/journal.pone.0273860. eCollection 2022. PLoS One. 2022. PMID: 36044525 Free PMC article.
-
Directional and reoccurring sequence change in zoonotic RNA virus genomes visualized by time-series word count.Sci Rep. 2016 Nov 3;6:36197. doi: 10.1038/srep36197. Sci Rep. 2016. PMID: 27808119 Free PMC article.
-
Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses.BMC Genomics. 2006 Jan 18;7:8. doi: 10.1186/1471-2164-7-8. BMC Genomics. 2006. PMID: 16417644 Free PMC article.
-
Unsupervised explainable AI for molecular evolutionary study of forty thousand SARS-CoV-2 genomes.BMC Microbiol. 2022 Mar 10;22(1):73. doi: 10.1186/s12866-022-02484-3. BMC Microbiol. 2022. PMID: 35272618 Free PMC article.
-
Differentiation of regions with atypical oligonucleotide composition in bacterial genomes.BMC Bioinformatics. 2005 Oct 14;6:251. doi: 10.1186/1471-2105-6-251. BMC Bioinformatics. 2005. PMID: 16225667 Free PMC article.
References
-
- Abe T., Kanaya, S., Kinouchi, M., Kudo, Y., Mori, H., Matsuda, H., Carlos, D.C., and Ikemura, T. 1999. Gene classification method based on batch-learning SOM. Genome Inform. Ser. 10: 314-315.
-
- Andersson S.G. and Sharp, P.M. 1996. Codon usage in the Mycobacterium tuberculosis complex. Microbiology 142: 915-925. - PubMed
-
- Bernardi G. 1989. The isochore organization of the human genome. Annu. Rev. Genet. 23: 637-661. - PubMed
-
- Bernardi G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J., Cuny, G., Meunier-Rotival, M., and Rodier, F. 1985. The mosaic genome of warm-blooded vertebrates. Science 228: 953-958. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Miscellaneous