Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul;18(1):253-268.
Epub 2020 Jun 30.

Identification of Geographic Specific SARS-Cov-2 Mutations by Random Forest Classification and Variable Selection Methods

Affiliations

Identification of Geographic Specific SARS-Cov-2 Mutations by Random Forest Classification and Variable Selection Methods

Manoj Kandpal et al. Stat Appl. 2020 Jul.

Abstract

RNA viral genomes have very high mutations rates. As infection spreads in the host populations, different viral lineages emerge acquiring independent mutations that can lead to varied infection and death rates in different parts of the world. By application of Random Forest classification and feature selection methods, we developed an analysis pipeline for identification of geographic specific mutations and classification of different viral lineages, focusing on the missense-variants that alter the function of the encoded proteins. We applied the pipeline on publicly available SARS-CoV-2 datasets and demonstrated that the analysis pipeline accurately identified country or region-specific viral lineages and specific mutations that discriminate different lineages. The results presented here can help designing country-specific diagnostic strategies and prioritizing the mutations for functional interpretation and experimental validations.

Keywords: Classification; Coronavirus; Feature selection; Random forest; SARS-CoV-2.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. ROC curve between classes for (a) training set
USA-NY (Class 1); China (Class 2); Italy, Spain (Class 3); India (Class 4)
Figure 2:
Figure 2:
Model features and their importance
Figure 3:
Figure 3:. Pruned Tree representation of CART model, generated using 42 features selected by Random forest feature selection method.
The gene name and UniProt Protein Products or Polypeptide Chains (in parentheses) in which the mutation is located is mentioned at the bottom of each mutation in the tree.

References

    1. Amin M, Sorour MK and Kasry A (2020). Comparing the Binding Interactions in the Receptor Binding Domains of SARS-CoV-2 and SARS-CoV. Journal of Physical Chemistry Letters, 11, 4897–4900. doi:10.1021/acs.jpclett.0c01064. - DOI - PubMed
    1. Breiman L (1984). Classification and Regression Trees. Wadsworth International Group, Belmont, California.
    1. Breiman L (2001). Random Forests. Machine Learning 45, 5–32.
    1. Breiman L and Cutler A (2001). Random Forests. https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm.
    1. Cheng AS, Jin VX, Fan M, Smith LT et al. (2006). Combinatorial analysis of transcription factor partners reveals recruitment of c-MYC to estrogen receptor-alpha responsive promoters. Molecular Cell, 21, 393–404. - PubMed

LinkOut - more resources