AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome
- PMID: 34565757
- DOI: 10.1266/ggs.21-00025
AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome
Abstract
In genetics and related fields, huge amounts of data, such as genome sequences, are accumulating, and the use of artificial intelligence (AI) suitable for big data analysis has become increasingly important. Unsupervised AI that can reveal novel knowledge from big data without prior knowledge or particular models is highly desirable for analyses of genome sequences, particularly for obtaining unexpected insights. We have developed a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions that can reveal various novel genome characteristics. Here, we explain the data mining by the BLSOM: an unsupervised AI. As a specific target, we first selected SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) because a large number of viral genome sequences have been accumulated via worldwide efforts. We analyzed more than 0.6 million sequences collected primarily in the first year of the pandemic. BLSOMs for short oligonucleotides (e.g., 4-6-mers) allowed separation into known clades, but longer oligonucleotides further increased the separation ability and revealed subgrouping within known clades. In the case of 15-mers, there is mostly one copy in the genome; thus, 15-mers that appeared after the epidemic started could be connected to mutations, and the BLSOM for 15-mers revealed the mutations that contributed to separation into known clades and their subgroups. After introducing the detailed methodological strategies, we explain BLSOMs for various topics, such as the tetranucleotide BLSOM for over 5 million 5-kb fragment sequences derived from almost all microorganisms currently available and its use in metagenome studies. We also explain BLSOMs for various eukaryotes, including fishes, frogs and Drosophila species, and found a high separation ability among closely related species. When analyzing the human genome, we found enrichments in transcription factor-binding sequences in centromeric and pericentromeric heterochromatin regions. The tDNAs (tRNA genes) could be separated according to their corresponding amino acid.
Keywords: COVID-19; artificial intelligence; metagenome; oligonucleotide composition; self-organizing map.
Similar articles
-
Comparative genomic analysis of the human genome and six bat genomes using unsupervised machine learning: Mb-level CpG and TFBS islands.BMC Genomics. 2022 Jul 8;23(1):497. doi: 10.1186/s12864-022-08664-9. BMC Genomics. 2022. PMID: 35804296 Free PMC article.
-
Mb-level CpG and TFBS islands visualized by AI and their roles in the nuclear organization of the human genome.Genes Genet Syst. 2020 Apr 22;95(1):29-41. doi: 10.1266/ggs.19-00027. Epub 2020 Mar 12. Genes Genet Syst. 2020. PMID: 32161227
-
Unsupervised explainable AI for molecular evolutionary study of forty thousand SARS-CoV-2 genomes.BMC Microbiol. 2022 Mar 10;22(1):73. doi: 10.1186/s12866-022-02484-3. BMC Microbiol. 2022. PMID: 35272618 Free PMC article.
-
A Novel Bioinformatics Strategy to Analyze Microbial Big Sequence Data for Efficient Knowledge Discovery: Batch-Learning Self-Organizing Map (BLSOM).Microorganisms. 2013 Nov 20;1(1):137-157. doi: 10.3390/microorganisms1010137. Microorganisms. 2013. PMID: 27694768 Free PMC article. Review.
-
Role of biological Data Mining and Machine Learning Techniques in Detecting and Diagnosing the Novel Coronavirus (COVID-19): A Systematic Review.J Med Syst. 2020 May 25;44(7):122. doi: 10.1007/s10916-020-01582-x. J Med Syst. 2020. PMID: 32451808 Free PMC article.
Cited by
-
AI-based search for convergently expanding, advantageous mutations in SARS-CoV-2 by focusing on oligonucleotide frequencies.PLoS One. 2022 Aug 31;17(8):e0273860. doi: 10.1371/journal.pone.0273860. eCollection 2022. PLoS One. 2022. PMID: 36044525 Free PMC article.
-
Comparative genomic analysis of the human genome and six bat genomes using unsupervised machine learning: Mb-level CpG and TFBS islands.BMC Genomics. 2022 Jul 8;23(1):497. doi: 10.1186/s12864-022-08664-9. BMC Genomics. 2022. PMID: 35804296 Free PMC article.
-
Unsupervised AI reveals insect species-specific genome signatures.PeerJ. 2024 Mar 6;12:e17025. doi: 10.7717/peerj.17025. eCollection 2024. PeerJ. 2024. PMID: 38464746 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Molecular Biology Databases
Research Materials
Miscellaneous