. 2022 Oct:149:105969.

doi: 10.1016/j.compbiomed.2022.105969. Epub 2022 Aug 17.

Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning

Bahrad A Sokhansanj¹, Gail L Rosen²

Affiliations

¹ Ecological and Evolutionary Signal Processing & Informatics Laboratory, Drexel University, 3100 Chestnut St., Philadelphia, PA, 19104, United States of America. Electronic address: bahrad@molhealtheng.com.
² Ecological and Evolutionary Signal Processing & Informatics Laboratory, Drexel University, 3100 Chestnut St., Philadelphia, PA, 19104, United States of America. Electronic address: glr26@drexel.edu.

PMID: 36041271
PMCID: PMC9384346
DOI: 10.1016/j.compbiomed.2022.105969

Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning

Bahrad A Sokhansanj et al. Comput Biol Med. 2022 Oct.

. 2022 Oct:149:105969.

doi: 10.1016/j.compbiomed.2022.105969. Epub 2022 Aug 17.

Authors

Bahrad A Sokhansanj¹, Gail L Rosen²

Affiliations

¹ Ecological and Evolutionary Signal Processing & Informatics Laboratory, Drexel University, 3100 Chestnut St., Philadelphia, PA, 19104, United States of America. Electronic address: bahrad@molhealtheng.com.
² Ecological and Evolutionary Signal Processing & Informatics Laboratory, Drexel University, 3100 Chestnut St., Philadelphia, PA, 19104, United States of America. Electronic address: glr26@drexel.edu.

PMID: 36041271
PMCID: PMC9384346
DOI: 10.1016/j.compbiomed.2022.105969

Abstract

Epidemiological studies show that COVID-19 variants-of-concern, like Delta and Omicron, pose different risks for severe disease, but they typically lack sequence-level information for the virus. Studies which do obtain viral genome sequences are generally limited in time, location, and population scope. Retrospective meta-analyses require time-consuming data extraction from heterogeneous formats and are limited to publicly available reports. Fortuitously, a subset of GISAID, the global SARS-CoV-2 sequence repository, includes "patient status" metadata that can indicate whether a sequence record is associated with mild or severe disease. While GISAID lacks data on comorbidities relevant to severity, such as obesity and chronic disease, it does include metadata for age and sex to use as additional attributes in modeling. With these caveats, previous efforts have demonstrated that genotype-patient status models can be fit to GISAID data, particularly when country-of-origin is used as an additional feature. But are these models robust and biologically meaningful? This paper shows that, in fact, temporal and geographic biases in sequences submitted to GISAID, as well as the evolving pandemic response, particularly reduction in severe disease due to vaccination, create complex issues for model development and interpretation. This paper poses a potential solution: efficient mixed effects machine learning using GPBoost, treating country as a random effect group. Training and validation using temporally split GISAID data and emerging Omicron variants demonstrates that GPBoost models are more predictive of the impact of spike protein mutations on patient outcomes than fixed effect XGBoost, LightGBM, random forests, and elastic net logistic regression models.

Keywords: Bioinformatics; COVID-19; Machine learning; SARS-CoV-2; Viral genomics.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Fig. 1**
**Overview of age, sample collection date, and country metadata trends in GISAID data.(A – Upper Left)** Mean case severity, where 0 is Mild and 1 is Severe, which equates to the probability of a severe case) by patient age in the GISAID database. The bars show the count of samples for each age. Increasing age trends with increasing severity, as expected, with differences at extremely low and old ages characterized by low sample counts. **(B – Upper Right)** Mean clinical severity (probability of severe case) by sample collection date recorded in the GISAID data. For clarity, data have been binned over time periods; the bars indicate the number of samples. Over time, the proportion of severe cases has declined, although that trend has been less consistent since Fall 2021. **(C – Lower Left)** Proportion of sequences in the GISAID patient data set (sequences with patient metadata) for principal variants, including B.1 (the ancestral lineage with the D614G which emerged in Northern Italy and New York in February–March 2020) and its sublineages, Alpha (B.1.1.7 and “Q” sublineages), Beta (B.1.351), Gamma (P.1), Delta (B.1.617.2 and AY sublineages), and the two major Omicron lineages, BA.1 and BA.2 (and their sublineages). The bars indicate the mean case severity for each date bin. The trends of sequential lineage waves in GISAID patient data appear to be consistent with the larger GISAID data set, i.e., showing successive Alpha, Delta, and Omicron waves. **(D – Lower Right)** Mean case severity of samples separated by GISAID metadata for the country where the sequence was collected for selected countries. The total number of sequences in the GISAID patient data set per country is shown within parentheses in the legend. Fluctuations in severity observed in countries appear due to systemic issues or differences in where samples are collected (e.g., in hospitals or outside settings) at different times.

**Fig. 2**
**Patient age and gender metadata trends in GISAID data.(A – Left)** Mean clinical severity over time for patients in different age groups, showing that the overall trends are generally consistent across age groups, with older patients having mean severity as shown in Panel A. **(B – Middle)** Mean clinical severity, separating male and female samples, showing consistent trends across gender with male patients generally having a somewhat higher ratio of severe cases. **(C - Right)** Number of mild and severe cases across all samples split by gender, showing that there are more mild cases than severe among samples from female patients.

**Fig. 3**
**GISAID patient status metadata trends over time.(A – Upper Left)** Fraction of cases categorized as Hospitalized or Released (from hospital) over time, binning dates as indicated by the bars. The definitions of hospitalizations and releases based on patient metadata are provided in 1 and 1. **(B – Upper Right)** Fraction of samples annotated as being from dead individuals in the GISAID patient status metadata field, binned by date as in Panel A. The cases in Panels A and B are collectively classified as Severe. **C - Lower Left** Fraction of cases categorized as Mild according to 1. **(D – Lower Right)** Fraction of cases categorized as Asymptomatic or Screening according to 1. Panels C and D are collectively classified as Mild. The subgroups of Mild and Severe classifications show similar trends, showing that the overall trends in Fig. 1 are not due to changes in how metadata are described and characterized.

**Fig. 4**
**Comparison of classification metrics for different machine learning methods.** Metrics are computed on test samples collected from December 26, 2021 through April 10, 2022, for models trained on samples from July 17 through December 25, 2021. The top row shows, at left, the accuracy of the Mild/Severe classification and, at right, the balanced F1-score, which is the harmonic mean of precision (true positives divided by all positive predictions) and recall (true positive rate, i.e., sensitivity). The middle row shows the precision for the Mild and Severe class predictions separately, and the bottom row shows the recall. Metrics are shown for models trained with country metadata used as a feature and without, as indicated in the labeled axes below, except for GPBoost, which takes into account the country metadata by using it as the groups of random effects. All models otherwise use age, gender, and each sequence position as a feature. Error bars show the standard deviations across three runs with different random number seeds, and in some cases are not visible. Statistics for GPBoost are computed based on the mean of the response. GPBoost and LightGBM/XGBoost including country as a feature consistently outperform other methods.

**Fig. 5**
**Receiver operator characteristic curves for best-performing modeling methods.** ROC curves were obtained using the scikit-learn package version 1.0.2 for test samples and trained models as described for Fig. 4 for XGBoost and LightGBM (with and without country metadata) and GPBoost (using country as a random effects group). The data are shown for one run; run-to-run variation was found to be insignificant. GPBoost performs better than either LightGBM or XGBoost, unless country metadata are used for the latter methods.

**Fig. 6**
**Comparison of SHAP dependence plots to severity for sequence positions 452 and 681 for the best-performing models.** LightGBM and XGBoost with country as a feature are compared to the GPBoost mixed effect model, trained aon data as described in 4. The predicted SHAP values for each of the samples used to generate the SHAP estimate (sequences collected from March 8 through April 10, 2022) are plotted for the 452 and 281 sequence positions in the left and right columns respectively, showing the SHAP values for predictions with sequences of the indicated amino acid at that position, i.e. L (ancestral), leucine, M, methionine, and R, arginine for residue 452; P (ancestral), H, R, and Y for residue 681; and ‘*’ for missing amino acid in the sample. A positive SHAP value indicates that an amino acid change is positively related to increased severity. The interaction of the patient age feature is shown by the coloring of the points, where more red points are from older patients and blue points from younger. GPBoost indicates increased severity as expected from validated experiments of L $\to$ R for this time period.

**Fig. 7**
**Predictions of Omicron subvariant severity.** Trained GPBoost, LightGBM, and XGBoost models are simulated for representative BA.1, BA.2, BA.2.12.1, BA.4, and BA.5 sequences from a 60 year-old male patient obtained in the United States, France, and Mexico. The GISAID accession numbers of the sequences for the sequences are: EPI_ISL_6590782 (BA.1), EPI_ISL_7852877 (BA.2), EPI_ISL_12048110 (BA.2.12.1), EPI_ISL_11674447 (BA.4), and EPI_ISL_12029894 (BA.5). The predictions shown here are for models trained on training data as shown in Fig. 4, Fig. 6 where country is a feature for LightGBM and XGBoost. The GPBoost predictions shown here are for the mean of the model response, and it does not vary by country, since country is not a fixed effect in the mixed effects model trained using GPBoost. By contrast, LightGBM and XGBoost predictions fluctuate significantly by simulated country. Emerging Omicron subvariants are uniformly predicted to be more severe than BA.1.

**Fig. 8**
**SHAP force plot showing impact of features on BA.2.12.1 severity prediction by GPBoost.** The “force plot” is a visualization which shows, based on SHAP values estimating the log-odds contribution of features to the model prediction, how much a specific feature tends to weigh the decision between binary classes. This plot is based on a simulated 30 year old male patient, and thus the Age feature tends to weigh the model towards a Mild prediction for this sample. Other features tend to weigh towards a more Severe prediction, such as mutations at sites characteristic of BA.2, including positions 371 and 408.

See this image and copyright information in PMC

References

1. Shu Y., McCauley J. GISAID: Global initiative on sharing all influenza data – from vision to reality. Eurosurveillance. 2017;22(13):30494. - PMC - PubMed
1. Khare S., Gurry C., Freitas L., Schultz M.B., Bach G., Diallo A., Akite N., Ho J., Lee R.T., Yeo W., Maurer-Stroh S., GISAID Core Curation Team GISAID’s role in pandemic response. China CDC Wkly. 2021;3(49):1049–1051. - PMC - PubMed
1. O’Toole A., Scher E., Underwood A., Jackson B., Hill V., McCrone J.T., Colquhoun R., Ruis C., Abu-Dahab K., Taylor B., Yeats C., du Plessis L., Maloney D., Medd N., Attwood S.W., Aanensen D.M., Holmes E.C., Pybus O.G., Rambaut A. Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evol. 2021;7(2):veab064. - PMC - PubMed
1. Rambaut A., Holmes E.C., O’Toole A., Hill V., McCrone J.T., Ruis C., du Plessis L., Pybus O.G. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat. Microbiol. 2020;5(11):1403–1407. - PMC - PubMed
1. Parums D.V. Editorial: revised world health organization (WHO) terminology for variants of concern and variants of interest of SARS-CoV-2. Med. Sci. Monit. : Int. Med. J. Exp. Clin. Res. 2021;27:e933622–1–e933622–2. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Supplementary concepts

Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning

Affiliations

Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Supplementary concepts

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous