Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 30;19(10):e0312686.
doi: 10.1371/journal.pone.0312686. eCollection 2024.

Using intrahost single nucleotide variant data to predict SARS-CoV-2 detection cycle threshold values

Affiliations

Using intrahost single nucleotide variant data to predict SARS-CoV-2 detection cycle threshold values

Lea Duesterwald et al. PLoS One. .

Abstract

Over the last four years, each successive wave of the COVID-19 pandemic has been caused by variants with mutations that improve the transmissibility of the virus. Despite this, we still lack tools for predicting clinically important features of the virus. In this study, we show that it is possible to predict the PCR cycle threshold (Ct) values from clinical detection assays using sequence data. Ct values often correspond with patient viral load and the epidemiological trajectory of the pandemic. Using a collection of 36,335 high quality genomes, we built models from SARS-CoV-2 intrahost single nucleotide variant (iSNV) data, computing XGBoost models from the frequencies of A, T, G, C, insertions, and deletions at each position relative to the Wuhan-Hu-1 reference genome. Our best model had an R2 of 0.604 [0.593-0.616, 95% confidence interval] and a Root Mean Square Error (RMSE) of 5.247 [5.156-5.337], demonstrating modest predictive power. Overall, we show that the results are stable relative to an external holdout set of genomes selected from SRA and are robust to patient status and the detection instruments that were used. This study highlights the importance of developing modeling strategies that can be applied to publicly available genome sequence data for use in disease prevention and control.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Histograms showing the distributions of the 20 most common variants in the dataset.
Fig 2
Fig 2. Histograms showing the distributions of Ct values in the dataset.
Ct values were sorted into bins of 3 with an inclusive lower bound and exclusive upper bound.
Fig 3
Fig 3. Schematic of the matrix generation for the XGBoost models.
Pileup files were generated by aligning reads against the reference genome, and the iSNV frequencies of A,T,G,C, insertions, and deletions at each position were computed per genome and normalized with respect to read depth. Normalized iSNV values and the one-hot-encoded clinical detection instruments were used to create the matrix that was used to generate the XGBoost models.
Fig 4
Fig 4. Scatterplots of predicted versus actual Ct values.
Scatterplots were constructed for models trained and tested on the following instruments: a) All instruments, b) Alinity, c) Panther, d) Cepheid. Points are colored by variant with samples of the 10 most frequently occurring variants colored via the key shown in the right, and samples of other variants colored gray. The line y = x is shown across the center diagonal of the figure for reference. Data are from a single fold.
Fig 5
Fig 5. Confusion matricies for models by instrument using a single fold.
Model predictions were binned into Ct value ranges of 3 cycles with an inclusive lower bound and exclusive upper bound. Coloring and values in each cell represent the fraction of the actual Ct values predicted in the given interval. Empty cells with no predictions or actual values in that range are gray. Confusion matrices were constructed for models trained and tested on the following instruments: a) All instruments, b) Alinity-only, c) Panther-only, and d) Cepheid-only.
Fig 6
Fig 6
Dot plot depicting the average Ct value (A) and XGBoost feature information gain (B) for each position and character used by the all-instrument model. Each base at a given position is colored according to the key. Genomic positions correspond to the SARS-CoV-2 Wuhan-Hu-1 reference genome. For each position, only genomes where ≥40% of the characters in the column corresponded to a given nucleotide were used to generate the average gain in order to reduce noise in the image. Additionally, only statistically significant bases are included, significance was computed based on the 95% confidence interval of the average Ct value of genomes with a given base and those without. No INDEL features met this significance requirement. The spike protein corresponds to genomic coordinates 21563–25384.
Fig 7
Fig 7
The A) R2 (red line), and B) RMSE (blue line) for a holdout set of Omicron genomes using models trained on increasing percentages of Omicron genomes in the training set. The green dashed line depicts the R2 and RMSE for the training set, which contains no Omicron genomes.

Similar articles

References

    1. WHO COVID-19 Dashboard Geneva: World Health Organization; 2020 [cited 2022 09/06/2022]. Available from: https://covid19.who.int/.
    1. Anonymous. SARS-CoV-2 Variant Classifications and Definitions: Centers for Disease Control and Prevention, National Center for Immunization and Respiratory Diseases (NCIRD), Division of Viral Diseases; [10–24–2022]. Available from: https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classificatio....
    1. Salehi-Vaziri M, Fazlalipour M, Seyed Khorrami SM, Azadmanesh K, Pouriayevali MH, Jalali T, et al.. The ins and outs of SARS-CoV-2 variants of concern (VOCs). Archives of Virology. 2022:1–18. doi: 10.1007/s00705-022-05365-2 - DOI - PMC - PubMed
    1. SARS-CoV-2 Variant Classifications and Definitions: Centers for Disease Control and Prevention; 2022 [cited 2022 09/06/2022]. Available from: https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classificatio....
    1. DeGrace MM, Ghedin E, Frieman MB, Krammer F, Grifoni A, Alisoltani A, et al.. Defining the risk of SARS-CoV-2 variants on immune protection. Nature. 2022;605(7911):640–52. doi: 10.1038/s41586-022-04690-5 - DOI - PMC - PubMed

Supplementary concepts