Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 31;22(15):5730.
doi: 10.3390/s22155730.

Convolutional Neural Network Applied to SARS-CoV-2 Sequence Classification

Affiliations

Convolutional Neural Network Applied to SARS-CoV-2 Sequence Classification

Gabriel B M Câmara et al. Sensors (Basel). .

Abstract

COVID-19, the illness caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus belonging to the Coronaviridade family, a single-strand positive-sense RNA genome, has been spreading around the world and has been declared a pandemic by the World Health Organization. On 17 January 2022, there were more than 329 million cases, with more than 5.5 million deaths. Although COVID-19 has a low mortality rate, its high capacities for contamination, spread, and mutation worry the authorities, especially after the emergence of the Omicron variant, which has a high transmission capacity and can more easily contaminate even vaccinated people. Such outbreaks require elucidation of the taxonomic classification and origin of the virus (SARS-CoV-2) from the genomic sequence for strategic planning, containment, and treatment of the disease. Thus, this work proposes a high-accuracy technique to classify viruses and other organisms from a genome sequence using a deep learning convolutional neural network (CNN). Unlike the other literature, the proposed approach does not limit the length of the genome sequence. The results show that the novel proposal accurately distinguishes SARS-CoV-2 from the sequences of other viruses. The results were obtained from 1557 instances of SARS-CoV-2 from the National Center for Biotechnology Information (NCBI) and 14,684 different viruses from the Virus-Host DB. As a CNN has several changeable parameters, the tests were performed with forty-eight different architectures; the best of these had an accuracy of 91.94 ± 2.62% in classifying viruses into their realms correctly, in addition to 100% accuracy in classifying SARS-CoV-2 into its respective realm, Riboviria. For the subsequent classifications (family, genera, and subgenus), this accuracy increased, which shows that the proposed architecture may be viable in the classification of the virus that causes COVID-19.

Keywords: CNN; COVID-19; SARS-CoV-2; deep learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Examples of viral genomes in the form of images created by the process described in Section 2.2: (a) Paramecium bursaria Chlorella virus (NC_043234); (b) Gordonia phage Obliviate (NC_031237); (c) SARS-CoV-2 USA 2020 (MT251977); and (d) SARS-CoV-2 Wuhan-Hu-1 (MN908947).
Figure 2
Figure 2
Graphical representation of proposed convolutional neural network for virus genome classification.
Figure 3
Figure 3
Overview of the proposed approach.
Figure 4
Figure 4
Number of samples in each realm and with unclassified label before treatment.
Figure 5
Figure 5
Confusion matrices for Experiment1, which show results for the classification of the viruses in their respective realms (top) and SARS-CoV-2 in the Riboviria realm (bottom).
Figure 6
Figure 6
Confusion matrix for validation of the CNN using Dataset4.
Figure 7
Figure 7
Confusion matrix for the test of the CNN for the classification of the SARS-CoV-2 in its correct family (Coronaviridae).
Figure 8
Figure 8
Histogram of sample distribution for the Exp3 before normalization.
Figure 9
Figure 9
Confusion matrices for validation of the CNN using the genus inside (Coronaviridae) the family (left) and, for classification of SARS-CoV-2, in its correct genus (Betacoronavirus).
Figure 10
Figure 10
Histogram of sample distribution inside the Betacoronavirus subgenus. The number of samples without classification (Unclassified) is more than 15,000 times bigger than the ones with a defined subgenus.
Figure 11
Figure 11
Confusion matrices for CNN validation using the subgenus samples within the Betacoronavirus genus (top) and, for SARS-CoV-2 classification, in its correct subgenus, Sarbecovirus (bottom).

Similar articles

Cited by

References

    1. Woo P.C.Y., Huang Y., Lau S.K.P., Yuen K.Y. Coronavirus Genomics and Bioinformatics Analysis. Viruses. 2010;2:1804–1820. doi: 10.3390/v2081803. - DOI - PMC - PubMed
    1. Cui J., Li F., Shi Z.L. Origin and evolution of pathogenic coronaviruses. Nat. Rev. Microbiol. 2019;17:181–192. doi: 10.1038/s41579-018-0118-9. - DOI - PMC - PubMed
    1. Zhou P., Yang X.L., Wang X.G., Hu B., Zhang L., Zhang W., Si H.R., Zhu Y., Li B., Huang C.L., et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579:270–273. doi: 10.1038/s41586-020-2012-7. - DOI - PMC - PubMed
    1. Wu F., Zhao S., Yu B., Chen Y.M., Wang W., Song Z.G., Hu Y., Tao Z.W., Tian J.H., Pei Y.Y., et al. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579:265–269. doi: 10.1038/s41586-020-2008-3. - DOI - PMC - PubMed
    1. Andersen K.G., Rambaut A., Lipkin W.I., Holmes E.C., Garry R.F. The proximal origin of SARS-CoV-2. Nat. Med. 2020;26:450–452. doi: 10.1038/s41591-020-0820-9. - DOI - PMC - PubMed

Supplementary concepts