Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 4;15(1):23961.
doi: 10.1038/s41598-025-09911-1.

Tracing the evolutionary pathway of SARS-CoV-2 through RNA sequencing analysis

Affiliations

Tracing the evolutionary pathway of SARS-CoV-2 through RNA sequencing analysis

Mostafa Rezapour et al. Sci Rep. .

Abstract

The COVID-19 pandemic, driven by the Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2), has underscored the need to understand the virus's evolution due to its global health impact. This study employed RNA sequencing (RNA-Seq) to analyze gene expression differences across multiple SARS-CoV-2 variants. We used publicly available datasets from the Gene Expression Omnibus (GEO) with IDs GSE157103, GSE171110, GSE189039, and GSE201530, which contain RNA-Seq data extracted from white blood cells, whole blood, or PBMCs of individuals infected with the Original Wuhan variant (both hospitalized and non-hospitalized), the French variant (hospitalized), the Beta variant (hospitalized), and the Omicron variant (moderate and mild cases), along with COVID-negative controls. Our first objective was to examine differences in gene expression dynamics using Generalized Linear Models with Quasi-Likelihood F-tests and the Magnitude-Altitude Scoring (GLMQL-MAS) technique, followed by Gene Ontology (GO) and pathway analyses. Our second objective was to employ Cross-MAS to identify a robust set of genes indicative of SARS-CoV-2 infection regardless of the variant and to assess their classification performance. GO and pathway analyses revealed a significant evolutionary shift in how SARS-CoV-2 interacts with the host. Early variants such as the Original Wuhan and French cases primarily affected pathways related to viral replication, including Eukaryotic Translation Elongation and Viral mRNA Translation. In contrast, later variants like Beta and Omicron showed a strategic shift toward modulating and evading the host immune response, engaging immune-related pathways such as Interferon Alpha/Beta signaling and Cytokine signaling in the immune system. To evaluate the classification potential of the identified genes, we tested them on held-out datasets GSE152418, PMC8202013, GSE161731, and GSE166190, which contain RNA-Seq data from whole blood or PBMCs of COVID-positive and healthy individuals. Using top-ranked genes such as IFI27, CDC20, RRM2, HJURP, and CDC45 in linear models including logistic regression and linear SVM, we achieved 97.31% accuracy, with precision and recall rates of 0.97 and 0.99, respectively. These signatures also achieved perfect classification (100% accuracy, precision, and recall) in two additional datasets: GSE294888, which includes blood-derived plasmacytoid dendritic cells (pDCs) and type 2 conventional dendritic cells (DC2s) stimulated with Delta or Omicron variants, and GSE239595, which features Omicron-infected nasopharyngeal tissue. These findings demonstrate the potential of transcriptomic signatures for variant-agnostic COVID-19 detection and provide a foundation for flexible diagnostic and therapeutic approaches in response to SARS-CoV-2 evolution.

Keywords: Diagnostic biomarkers; Gene Ontology (GO); Gene expression analysis; Machine learning; Pathway analysis; RNA sequencing (RNA-Seq); SARS-CoV-2 variants.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
This figure depicts the analysis process employed in this study, starting from the use of RNA-Seq datasets to analyze gene expression dynamics across multiple SARS-CoV-2 variants, including the original Wuhan, severe hospitalized French cases, severe Beta cases, and moderate Omicron cases. The workflow incorporates GLMQL-MAS technique to identify differences in gene expression and pathway analyses, which aims to provide insights into the virus’s evolution and molecular differences. The second phase involves identifying common significant gene signatures across all variants using Cross-MAS and assessing these genes’ potential as signatures via principal component analysis and linear classifiers.
Fig. 2
Fig. 2
All panels in this figure apply the same criteria for identifying significant genes: BH-adjusted p-value < 0.05 and formula image. (a) Volcano plot of differentially expressed genes in the Original Wuhan non-hospitalized cohort (GSE157103), highlighting significantly upregulated (red) and downregulated (blue) genes. The top 10 upregulated and top 10 downregulated genes, selected based on the highest Magnitude-Altitude Scores (MAS), are annotated, with u1-u10 denoting the top-ranked upregulated genes and d1-d10 representing the top-ranked downregulated genes. (b) Summary of the total number of significant upregulated and downregulated genes, along with the top-ranked genes in each direction, for the non-hospitalized cohort (GSE157103). Panels (a) and (b) present equivalent information in terms of the top 10 annotated genes and the total number of significant upregulated and downregulated genes. (c) Total number of significant genes in the Original Wuhan hospitalized group (GSE157103). (d) Total number of significant genes in the French dataset (GSE171110). (e) Total number of significant genes in the Beta dataset (GSE189039). (f) Total number of significant genes in the Omicron dataset (GSE201530). (g) Network plot showing pairwise overlaps of significant genes across all five datasets. Each node is labeled with the total number of significant genes for that variant. Edge labels indicate the number of overlapping significant genes between dataset pairs. Node size and edge color are used for visualization purposes only and do not encode quantitative values.
Fig. 3
Fig. 3
(a) Top 10 enriched Gene Ontology (GO) biological process terms (q < 0.05) for each dataset based on upregulated genes. Dot size reflects the number of genes associated with each term (Count), and color indicates statistical significance (-log(q-value)). (b) Top enriched Reactome pathways from each dataset, with dot size representing the number of intersecting genes and color denoting -log(p-value).
Fig. 4
Fig. 4
Distribution of significant GO terms across different patient groups. Panel (a) depicts the common and unique significant GO terms associated with upregulated BU-significance genes (formula image), including detailed top-5 descriptions for select combinations. Panel (b) focuses on BU-significance genes with downregulated expression (formula image).
Fig. 5
Fig. 5
(a) Distribution and ranking of BH-significant genes with LogFC formula image 1 across five patient groups: Original-Non ICU, Original-ICU, French-ICU, Beta-severe, and Omicron-Moderate, and it highlights the top-5 genes per group. (b) Overview of BH-significant genes with LogFC formula image -1, which showcases commonalities and differences among the groups. (c) Top 20 enriched biological terms from a functional enrichment analysis of 37 significant genes, visually represented based on their -log10 (p-value) to emphasize the most relevant pathways and processes. (d) Protein–protein interaction network constructed from the Cross-MAS selected genes and shows only nodes with a connectivity degree of five or more to underline key biological interactions within the data.
Fig. 6
Fig. 6
(a,b) PCA visualizations show the impact of the Cross-MAS selected genes on distinguishing COVID-negative from positive cases in ICU patients infected with the original Wuhan strain. (ce) Separation of COVID-negative and positive cases using the first two principal components of Cross-MAS selected genes across French- hospitalized/ICU, Beta-hospitalized, and Omicron-moderate variants. (f) Cumulative aggregated confusion matrix for logistic regression and SVM analyses using the first 5 PCs of the selected genes across training datasets GSE157103, GSE171110, GSE189039, and GSE201530.
Fig. 7
Fig. 7
(ad) Individual confusion matrices, along with the corresponding accuracy, precision, and recall metrics for logistic regression and SVM with a linear kernel, applied to held-out test datasets GSE152418, PMC8202013, GSE161731, and GSE166190. Each panel illustrates the models’ performance on a separate dataset. (e) Aggregated confusion matrices and combined metrics of accuracy, precision, and recall from the datasets mentioned.
Fig. 8
Fig. 8
Classification performance of Cross-MAS selected genes in datasets GSE294888 and GSE239595. Panels (a) and (b) show PCA plots of DC2 and pDC populations, respectively, from GSE294888, stimulated in vitro with SARS-CoV-2 Delta or Omicron BA.1, along with non-infected controls. Panels (c) and (d) display confusion matrices from logistic regression models using the first two principal components, which demonstrates perfect classification between non-infected and variant-stimulated cells in both pDCs and DC2s (accuracy = 100%, recall = 1.00, precision = 1.00). Panel (e) shows PCA of nasopharyngeal (NP) samples from SARS-CoV-2 Omicron-positive patients and healthy controls from GSE239595. Panel (f) presents the corresponding confusion matrix, again indicating perfect classification performance (accuracy = 100%, recall = 1.00, precision = 1.00).

References

    1. Shi, Yu. et al. An overview of COVID-19. J. Zhejiang Univ. Sci.21, 343 (2020). - PMC - PubMed
    1. Ciotti, M. et al. The COVID-19 pandemic. Crit. Rev. Clin. Lab. Sci.57(6), 365–388 (2020). - PubMed
    1. Kamble, S., Joshi, A., Kamble, R. & Kumari, S. Influence of COVID-19 pandemic on psychological status: an elaborate review. Cureus4, 10 (2022). - PMC - PubMed
    1. Rezapour, M. & Hansen, L. A machine learning analysis of COVID-19 mental health data. Sci. Rep.12(1), 14965 (2022). - PMC - PubMed
    1. Rezapour M., Niazi MKK., Gurcan MN., Machine learning-based analytics of the impact of the Covid-19 pandemic on alcohol consumption habit changes among United States healthcare workers, Sci. Rep., 13, 6003. (2023). - PMC - PubMed

Supplementary concepts