Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct 15:8:195263-195273.
doi: 10.1109/ACCESS.2020.3031387. eCollection 2020.

Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach

Affiliations

Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach

Gciniwe S Dlamini et al. IEEE Access. .

Abstract

The world is grappling with the COVID-19 pandemic caused by the 2019 novel SARS-CoV-2. To better understand this novel virus and its relationship with other pathogens, new methods for analyzing the genome are required. In this study, intrinsic dinucleotide genomic signatures were analyzed for whole genome sequence data of eight pathogenic species, including SARS-CoV-2. The genome sequences were transformed into dinucleotide relative frequencies and classified using the extreme gradient boosting (XGBoost) model. The classification models were trained to a) distinguish between the sequences of all eight species and b) distinguish between sequences of SARS-CoV-2 that originate from different geographic regions. Our method attained 100% in all performance metrics and for all tasks in the eight-species classification problem. Moreover, the models achieved 67% balanced accuracy for the task of classifying the SARS-CoV-2 sequences into the six continental regions and achieved 86% balanced accuracy for the task of classifying SARS-CoV-2 samples as either originating from Asia or not. Analysis of the dinucleotide genomic profiles of the eight species revealed a similarity between the SARS-CoV-2 and MERS-CoV viral sequences. Further analysis of SARS-CoV-2 viral sequences from the six continents revealed that samples from Oceania had the highest frequency of TT dinucleotides as well as the lowest CG frequency compared to the other continents. The dinucleotide signatures of AC, AG,CA, CT, GA, GT, TC, and TG were well conserved across most genomes, while the frequencies of other dinucleotide signatures varied considerably. Altogether, the results from this study demonstrate the utility of dinucleotide relative frequencies for discriminating and identifying similar species.

Keywords: Alignment-free sequence analysis; COVID-19; XGBoost; dinucleotide frequencies; feature representations; genomic signatures; human pathogens; machine learning.

PubMed Disclaimer

Figures

FIGURE 1.
FIGURE 1.
Generalized flow diagram showing the methodology.
FIGURE 2.
FIGURE 2.
a) PCA and b) t-SNE visualizations of the eight pathogenic species.
FIGURE 3.
FIGURE 3.
a) PCA and b) t-SNE visualizations of the seven continents of origin for the SARS-CoV-2 dataset.
FIGURE 4.
FIGURE 4.
Dendrogram created from 10 randomly sampled sequences from all classes in the between species analysis.
FIGURE 5.
FIGURE 5.
Dendrogram created from 10 randomly sampled sequences from all classes in the within species analysis.
FIGURE 6.
FIGURE 6.
Within species XGBoost confusion matrix for a) binary problem b) multi-class classification problem.

Similar articles

Cited by

References

    1. Lai M. M. C. and Cavanagh D., “The molecular biology of coronaviruses,” in Advances in Virus Research, vol. 48. Amsterdam, The Netherlands: Elsevier, 1997, pp. 1–100, doi: 10.1016/S0065-3527(08)60286-9. - DOI - PMC - PubMed
    1. Channappanavar R. and Perlman S., “Pathogenic human coronavirus infections: Causes and consequences of cytokine storm and immunopathology,” in Semin Immunopathol., vol. 39, no. 5, pp. 529–539, 2017, doi: 10.1007/s00281-017-0629-x. - DOI - PMC - PubMed
    1. Ksiazek T. G.et al., “A novel coronavirus associated with severe acute respiratory syndrome,” New England J. Med., vol. 348, no. 20, pp. 1953–1966, May 2003, doi: 10.1056/NEJMoa030781. - DOI - PubMed
    1. Zaki A. M., Van Boheemen S., Bestebroer T. M., Osterhaus A. D., and Fouchier R. A., “Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia,” New England J. Med., vol. 367, no. 19, pp. 1814–1820, 2012, doi: 10.1056/NEJMoa1211721. - DOI - PubMed
    1. WHO. WHO coronavirus Disease (COVID-19) Dashboard [Online Dashboard]. Accessed: Jul. 23, 2020. [Online]. Available: https://covid19.who.int/

LinkOut - more resources