Using intrahost single nucleotide variant data to predict SARS-CoV-2 detection cycle threshold values

doi:10.1371/journal.pone.0312686

. 2024 Oct 30;19(10):e0312686.

doi: 10.1371/journal.pone.0312686. eCollection 2024.

Using intrahost single nucleotide variant data to predict SARS-CoV-2 detection cycle threshold values

Lea Duesterwald^{1

2}, Marcus Nguyen^{2

3

4}, Paul Christensen^{5

6}, S Wesley Long^{5

6}, Randall J Olsen^{5

6}, James M Musser^{5

6}, James J Davis^{2

3

4}

Affiliations

¹ College of Engineering, Cornell University, Ithaca, NY, United States of America.
² Northwestern-Argonne Institute for Science and Engineering, Evanston, IL, United States of America.
³ Data Science and Learning Division, Argonne National Laboratory, Lemont, IL, United States of America.
⁴ Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, United States of America.
⁵ Laboratory of Human Molecular and Translational Human Infectious Diseases, Center for Infectious Diseases, Houston Methodist Research Institute and Department of Pathology and Genomic Medicine, Houston Methodist Hospital, Houston, TX, United States of America.
⁶ Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York City, NY, United States of America.

PMID: 39475880
PMCID: PMC11524481
DOI: 10.1371/journal.pone.0312686

Using intrahost single nucleotide variant data to predict SARS-CoV-2 detection cycle threshold values

Lea Duesterwald et al. PLoS One. 2024.

. 2024 Oct 30;19(10):e0312686.

doi: 10.1371/journal.pone.0312686. eCollection 2024.

Authors

Lea Duesterwald^{1

2}, Marcus Nguyen^{2

3

4}, Paul Christensen^{5

6}, S Wesley Long^{5

6}, Randall J Olsen^{5

6}, James M Musser^{5

6}, James J Davis^{2

3

4}

Affiliations

¹ College of Engineering, Cornell University, Ithaca, NY, United States of America.
² Northwestern-Argonne Institute for Science and Engineering, Evanston, IL, United States of America.
³ Data Science and Learning Division, Argonne National Laboratory, Lemont, IL, United States of America.
⁴ Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, United States of America.
⁵ Laboratory of Human Molecular and Translational Human Infectious Diseases, Center for Infectious Diseases, Houston Methodist Research Institute and Department of Pathology and Genomic Medicine, Houston Methodist Hospital, Houston, TX, United States of America.
⁶ Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York City, NY, United States of America.

PMID: 39475880
PMCID: PMC11524481
DOI: 10.1371/journal.pone.0312686

Abstract

Over the last four years, each successive wave of the COVID-19 pandemic has been caused by variants with mutations that improve the transmissibility of the virus. Despite this, we still lack tools for predicting clinically important features of the virus. In this study, we show that it is possible to predict the PCR cycle threshold (Ct) values from clinical detection assays using sequence data. Ct values often correspond with patient viral load and the epidemiological trajectory of the pandemic. Using a collection of 36,335 high quality genomes, we built models from SARS-CoV-2 intrahost single nucleotide variant (iSNV) data, computing XGBoost models from the frequencies of A, T, G, C, insertions, and deletions at each position relative to the Wuhan-Hu-1 reference genome. Our best model had an R2 of 0.604 [0.593-0.616, 95% confidence interval] and a Root Mean Square Error (RMSE) of 5.247 [5.156-5.337], demonstrating modest predictive power. Overall, we show that the results are stable relative to an external holdout set of genomes selected from SRA and are robust to patient status and the detection instruments that were used. This study highlights the importance of developing modeling strategies that can be applied to publicly available genome sequence data for use in disease prevention and control.

Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Histograms showing the distributions of the 20 most common variants in the dataset.**

**Fig 2. Histograms showing the distributions of Ct values in the dataset.**
Ct values were sorted into bins of 3 with an inclusive lower bound and exclusive upper bound.

**Fig 3. Schematic of the matrix generation for the XGBoost models.**
Pileup files were generated by aligning reads against the reference genome, and the iSNV frequencies of A,T,G,C, insertions, and deletions at each position were computed per genome and normalized with respect to read depth. Normalized iSNV values and the one-hot-encoded clinical detection instruments were used to create the matrix that was used to generate the XGBoost models.

**Fig 4. Scatterplots of predicted versus actual Ct values.**
Scatterplots were constructed for models trained and tested on the following instruments: a) All instruments, b) Alinity, c) Panther, d) Cepheid. Points are colored by variant with samples of the 10 most frequently occurring variants colored via the key shown in the right, and samples of other variants colored gray. The line y = x is shown across the center diagonal of the figure for reference. Data are from a single fold.

**Fig 5. Confusion matricies for models by instrument using a single fold.**
Model predictions were binned into Ct value ranges of 3 cycles with an inclusive lower bound and exclusive upper bound. Coloring and values in each cell represent the fraction of the actual Ct values predicted in the given interval. Empty cells with no predictions or actual values in that range are gray. Confusion matrices were constructed for models trained and tested on the following instruments: a) All instruments, b) Alinity-only, c) Panther-only, and d) Cepheid-only.

**Fig 6**
Dot plot depicting the average Ct value (A) and XGBoost feature information gain (B) for each position and character used by the all-instrument model. Each base at a given position is colored according to the key. Genomic positions correspond to the SARS-CoV-2 Wuhan-Hu-1 reference genome. For each position, only genomes where ≥40% of the characters in the column corresponded to a given nucleotide were used to generate the average gain in order to reduce noise in the image. Additionally, only statistically significant bases are included, significance was computed based on the 95% confidence interval of the average Ct value of genomes with a given base and those without. No INDEL features met this significance requirement. The spike protein corresponds to genomic coordinates 21563–25384.

**Fig 7**
The A) R² (red line), and B) RMSE (blue line) for a holdout set of Omicron genomes using models trained on increasing percentages of Omicron genomes in the training set. The green dashed line depicts the R² and RMSE for the training set, which contains no Omicron genomes.

See this image and copyright information in PMC

References

1. WHO COVID-19 Dashboard Geneva: World Health Organization; 2020 [cited 2022 09/06/2022]. Available from: https://covid19.who.int/.
1. Anonymous. SARS-CoV-2 Variant Classifications and Definitions: Centers for Disease Control and Prevention, National Center for Immunization and Respiratory Diseases (NCIRD), Division of Viral Diseases; [10–24–2022]. Available from: https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classificatio....
1. Salehi-Vaziri M, Fazlalipour M, Seyed Khorrami SM, Azadmanesh K, Pouriayevali MH, Jalali T, et al.. The ins and outs of SARS-CoV-2 variants of concern (VOCs). Archives of Virology. 2022:1–18. doi: 10.1007/s00705-022-05365-2 - DOI - PMC - PubMed
1. SARS-CoV-2 Variant Classifications and Definitions: Centers for Disease Control and Prevention; 2022 [cited 2022 09/06/2022]. Available from: https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classificatio....
1. DeGrace MM, Ghedin E, Frieman MB, Krammer F, Grifoni A, Alisoltani A, et al.. Defining the risk of SARS-CoV-2 variants on immune protection. Nature. 2022;605(7911):640–52. doi: 10.1038/s41586-022-04690-5 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Supplementary concepts

Actions

Grants and funding

75N93019C00076/AI/NIAID NIH HHS/United States

LinkOut - more resources

Full Text Sources
- PubMed Central
- Public Library of Science
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

[1] WHO COVID-19 Dashboard Geneva: World Health Organization; 2020 [cited 2022 09/06/2022]. Available from: https://covid19.who.int/.

[2] WHO COVID-19 Dashboard Geneva: World Health Organization; 2020 [cited 2022 09/06/2022]. Available from: https://covid19.who.int/.

[3] Anonymous. SARS-CoV-2 Variant Classifications and Definitions: Centers for Disease Control and Prevention, National Center for Immunization and Respiratory Diseases (NCIRD), Division of Viral Diseases; [10–24–2022]. Available from: https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classificatio....

[4] Anonymous. SARS-CoV-2 Variant Classifications and Definitions: Centers for Disease Control and Prevention, National Center for Immunization and Respiratory Diseases (NCIRD), Division of Viral Diseases; [10–24–2022]. Available from: https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classificatio....

[5] Salehi-Vaziri M, Fazlalipour M, Seyed Khorrami SM, Azadmanesh K, Pouriayevali MH, Jalali T, et al.. The ins and outs of SARS-CoV-2 variants of concern (VOCs). Archives of Virology. 2022:1–18. doi: 10.1007/s00705-022-05365-2 - DOI - PMC - PubMed

[6] Salehi-Vaziri M, Fazlalipour M, Seyed Khorrami SM, Azadmanesh K, Pouriayevali MH, Jalali T, et al.. The ins and outs of SARS-CoV-2 variants of concern (VOCs). Archives of Virology. 2022:1–18. doi: 10.1007/s00705-022-05365-2 - DOI - PMC - PubMed

[7] SARS-CoV-2 Variant Classifications and Definitions: Centers for Disease Control and Prevention; 2022 [cited 2022 09/06/2022]. Available from: https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classificatio....

[8] SARS-CoV-2 Variant Classifications and Definitions: Centers for Disease Control and Prevention; 2022 [cited 2022 09/06/2022]. Available from: https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classificatio....

[9] DeGrace MM, Ghedin E, Frieman MB, Krammer F, Grifoni A, Alisoltani A, et al.. Defining the risk of SARS-CoV-2 variants on immune protection. Nature. 2022;605(7911):640–52. doi: 10.1038/s41586-022-04690-5 - DOI - PMC - PubMed

[10] DeGrace MM, Ghedin E, Frieman MB, Krammer F, Grifoni A, Alisoltani A, et al.. Defining the risk of SARS-CoV-2 variants on immune protection. Nature. 2022;605(7911):640–52. doi: 10.1038/s41586-022-04690-5 - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Using intrahost single nucleotide variant data to predict SARS-CoV-2 detection cycle threshold values

Affiliations

Using intrahost single nucleotide variant data to predict SARS-CoV-2 detection cycle threshold values

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

Supplementary concepts

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

Supplementary concepts

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous