Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May 28;12(6):806.
doi: 10.3390/life12060806.

Identifying COVID-19 Severity-Related SARS-CoV-2 Mutation Using a Machine Learning Method

Affiliations

Identifying COVID-19 Severity-Related SARS-CoV-2 Mutation Using a Machine Learning Method

Feiming Huang et al. Life (Basel). .

Abstract

SARS-CoV-2 shows great evolutionary capacity through a high frequency of genomic variation during transmission. Evolved SARS-CoV-2 often demonstrates resistance to previous vaccines and can cause poor clinical status in patients. Mutations in the SARS-CoV-2 genome involve mutations in structural and nonstructural proteins, and some of these proteins such as spike proteins have been shown to be directly associated with the clinical status of patients with severe COVID-19 pneumonia. In this study, we collected genome-wide mutation information of virulent strains and the severity of COVID-19 pneumonia in patients varying depending on their clinical status. Important protein mutations and untranslated region mutations were extracted using machine learning methods. First, through Boruta and four ranking algorithms (least absolute shrinkage and selection operator, light gradient boosting machine, max-relevance and min-redundancy, and Monte Carlo feature selection), mutations that were highly correlated with the clinical status of the patients were screened out and sorted in four feature lists. Some mutations such as D614G and V1176F were shown to be associated with viral infectivity. Moreover, previously unreported mutations such as A320V of nsp14 and I164ILV of nsp14 were also identified, which suggests their potential roles. We then applied the incremental feature selection method to each feature list to construct efficient classifiers, which can be directly used to distinguish the clinical status of COVID-19 patients. Meanwhile, four sets of quantitative rules were set up, which can help us to more intuitively understand the role of each mutation in differentiating the clinical status of COVID-19 patients. Identified key mutations linked to virologic properties will help better understand the mechanisms of infection and will aid in the development of antiviral treatments.

Keywords: SARS-CoV-2; decision rules; feature selection; machine learning; mutation.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Flowchart of the whole analytical procedure of this research. Genome-wide mutation features of patients were obtained from the GISAID database and the Coronavirus Antiviral Research Database. Each patient was classified as “mild” or “severe” according to clinical status. Four lists of features were obtained after Boruta as well as four feature ranking algorithms. Subsequently, the optimal classifiers and the corresponding optimal features were obtained using the IFS method. The classification rules were mined by the optimal DT classifiers to obtain the classification basis for distinguishing the clinical status of different patients.
Figure 2
Figure 2
IFS curves of four classification algorithms based on the LASSO feature list. The four classification algorithms yield the highest F1-measure values of 0.798, 0.732, 0.776 and 0.775 when the top 57, 57, 6 and 6 features in the list are used.
Figure 3
Figure 3
IFS curves of four classification algorithms based on the LightGBM feature list. The four classification algorithms yield the highest F1-measure values of 0.803, 0.758, 0.785 and 0.783 when the top 24, 52, 24 and 24 features in the list are used.
Figure 4
Figure 4
IFS curves of four classification algorithms based on the MCFS feature list. The four classification algorithms yield highest F1-measure values of 0.800, 0.745, 0.760 and 0.758 when the top 43, 55, 10 and 10 features in the list are used.
Figure 5
Figure 5
IFS curves of four classification algorithms based on the mRMR feature list. The four classification algorithms yield the highest F1-measure values of 0.797, 0.759, 0.757 and 0.756 when the top 53, 52, 24 and 23 features in the list are used.
Figure 6
Figure 6
Box plot to show the performance of the optimal classifier based on different classification algorithms and feature lists. Each optimal classifier provided a similar performance in the different feature lists and the optimal DT classifier provided the highest performance.
Figure 7
Figure 7
Venn diagram of the optimal feature subsets on four feature lists. Fifteen features occur in all four feature subsets, indicating their importance to differentiate the clinical status of patients.
Figure 8
Figure 8
Summary of COVID-19 severity-related mutations. UTR—untranslated region; NSP—nonstructural protein; SP—structural protein; S protein—spike protein; N protein—envelope protein.
Figure 9
Figure 9
Location of some spike protein mutations in the genome. NTD—N-terminal domain; RBD—receptor-binding domain; RBM—receptor-binding motif; SD1—subdomain 1; SD2—subdomain 2; S1/S2—S1/S2 cleavage region.

Similar articles

Cited by

References

    1. CSG International The species severe acute respiratory syndrome-related coronavirus: Classifying 2019-ncov and naming it SARS-CoV-2. Nat. Microbiol. 2020;5:536. doi: 10.1038/s41564-020-0695-z. - DOI - PMC - PubMed
    1. Zhou B., Thao T.T.N., Hoffmann D., Taddeo A., Ebert N., Labroussaa F., Pohlmann A., King J., Steiner S., Kelly J.N. SARS-CoV-2 spike d614g change enhances replication and transmission. Nature. 2021;592:122–127. doi: 10.1038/s41586-021-03361-1. - DOI - PubMed
    1. Hou Y.J., Chiba S., Halfmann P., Ehre C., Kuroda M., Dinnon K.H., Leist S.R., Schäfer A., Nakajima N., Takahashi K. SARS-CoV-2 d614g variant exhibits efficient replication ex vivo and transmission in vivo. Science. 2020;370:1464–1468. doi: 10.1126/science.abe8499. - DOI - PMC - PubMed
    1. Pachetti M., Marini B., Benedetti F., Giudici F., Mauro E., Storici P., Masciovecchio C., Angeletti S., Ciccozzi M., Gallo R.C. Emerging SARS-CoV-2 mutation hot spots include a novel rna-dependent-rna polymerase variant. J. Transl. Med. 2020;18:179. doi: 10.1186/s12967-020-02344-6. - DOI - PMC - PubMed
    1. Cui J., Li F., Shi Z.-L. Origin and evolution of pathogenic coronaviruses. Nat. Rev. Microbiol. 2019;17:181–192. doi: 10.1038/s41579-018-0118-9. - DOI - PMC - PubMed

LinkOut - more resources