. 2025 Jan 17;11(1):veae115.

doi: 10.1093/ve/veae115. eCollection 2025.

SARS-CoV-2 CoCoPUTs: analyzing GISAID and NCBI data to obtain codon statistics, mutations, and free energy over a multiyear period

Affiliations

¹ Hemostasis Branch 1, Division of Hemostasis, Office of Plasma Protein Therapeutics CMC, Office of Therapeutic Products, Center for Biologics Evaluation and Research, Food and Drug Administration, 10903 New Hampshire Ave, Silver Spring, MD 20993, USA.
² High-performance Integrated Virtual Environment (HIVE), Center for Biologics Evaluation and Research (CBER), Food and Drug Administration (FDA), 10903 New Hampshire Ave, Silver Spring, MD 20993, USA.
³ Rockville, MD 20853, USA.
⁴ Afeka Tel-Aviv Academic College of Engineering, Mivtsa Kadesh St 38, Tel Aviv-Yafo 6998812, Israel.
⁵ Department of Biological, Geological and Environmental Sciences, Center for Gene Regulation in Health and Disease, Cleveland State University, 2121 Euclid Avenue, SR 259, Cleveland, OH 44115, USA.
⁶ Department of Biochemistry and Center for RNA Science and Therapeutics, School of Medicine, Case Western Reserve University, 10900 Euclid Ave, Cleveland, OH 44106, USA.

PMID: 39882309
PMCID: PMC11776705
DOI: 10.1093/ve/veae115

SARS-CoV-2 CoCoPUTs: analyzing GISAID and NCBI data to obtain codon statistics, mutations, and free energy over a multiyear period

Nigam H Padhiar et al. Virus Evol. 2025.

. 2025 Jan 17;11(1):veae115.

doi: 10.1093/ve/veae115. eCollection 2025.

Authors

Affiliations

¹ Hemostasis Branch 1, Division of Hemostasis, Office of Plasma Protein Therapeutics CMC, Office of Therapeutic Products, Center for Biologics Evaluation and Research, Food and Drug Administration, 10903 New Hampshire Ave, Silver Spring, MD 20993, USA.
² High-performance Integrated Virtual Environment (HIVE), Center for Biologics Evaluation and Research (CBER), Food and Drug Administration (FDA), 10903 New Hampshire Ave, Silver Spring, MD 20993, USA.
³ Rockville, MD 20853, USA.
⁴ Afeka Tel-Aviv Academic College of Engineering, Mivtsa Kadesh St 38, Tel Aviv-Yafo 6998812, Israel.
⁵ Department of Biological, Geological and Environmental Sciences, Center for Gene Regulation in Health and Disease, Cleveland State University, 2121 Euclid Avenue, SR 259, Cleveland, OH 44115, USA.
⁶ Department of Biochemistry and Center for RNA Science and Therapeutics, School of Medicine, Case Western Reserve University, 10900 Euclid Ave, Cleveland, OH 44106, USA.

PMID: 39882309
PMCID: PMC11776705
DOI: 10.1093/ve/veae115

Abstract

A consistent area of interest since the beginning of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been the sequence composition of the virus and how it has changed over time. Many resources have been developed for the storage and analysis of SARS-CoV-2 data, such as GISAID (Global Initiative on Sharing All Influenza Data), NCBI, Nextstrain, and outbreak.info. However, relatively little has been done to compile codon usage data, codon-level mutation data, and secondary structure data into a single database. Here, we assemble the aforementioned data and many additional virus attributes in a new database entitled SARS-CoV-2 CoCoPUTs. We begin with an overview of the composition and overlap between two of the largest sources of SARS-CoV-2 sequence data: GISAID and NCBI Virus (GenBank). We then evaluate different types of sequence curation strategies to reduce the dataset of millions of sequences to only one sequence per Pango lineage variant. We then performed specific analyses on the coding sequences (CDSs), including calculating codon usage, codon pair usage, dinucleotides, junction dinucleotides, mutations, GC content, effective number of codons (ENCs), and effective number of codon pairs (ENCPs). We have also performed whole-genome secondary RNA structure prediction calculations for each variant, using the LinearPartition software and modified selective 2'-hydroxyl acylation analyzed by primer extension (SHAPE) data that are available online. Finally, we compiled all the data into our resource, SARS-CoV-2 CoCoPUTs, and paired many of the resulting statistics with variant proportion data over time in order to derive trends in viral evolution. Although the overall codon usage of SARS-CoV-2 did not change drastically, in line with the previous literature on this subject, we did observe that while overall GC% content decreased, GC% of the third position in the codon was more positive relative to overall GC% content between February 2021 and July 2023. Over the same interval, we noted that both synonymous and nonsynonymous mutations increased in number, with nonsynonymous mutations outpacing synonymous mutations at a rate of 3:1. We noted that the predicted whole-genome secondary structures nearly all contained the previously described virus-activated inhibitor of translation (VAIT) stem loops, validating for the first time their existence in a whole-genome secondary structure prediction for many SARS-CoV-2 variants (as opposed to previous local secondary structure predictions). We also separately produced a synonymous mutation-deprived set of SARS-CoV-2 variant sequences and repeated the secondary structure calculations on this set. This revealed an interesting trend of reduced ensemble free energy compared to the unaltered variant structures, indicating that synonymous mutations play a role in increasing the free energy of viral RNA molecules. These data both validate previous studies describing increases in viral free energy in human viruses over time and indicate a possible role for synonymous mutations in viral biology.

Keywords: SARS-CoV-2; VAIT; bioinformatics; codon usage; secondary structure.

Published by Oxford University Press 2025. This work is written by (a) US Government employee(s) and is in the public domain in the US.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
Beginning with raw sequence data and metadata from GISAID and NCBI, we QC sequences using Nextclade software, split sequences into individual FASTA files for each Pango variant, identify the overlap (matched metadata and sequence identity) between GISAID and NCBI to create a deduplicated combined sequence resource, use a pseudo-ancestral sequence curation strategy to reduce the large cohort of sequences to one sequence per variant, and finally perform our sequence composition and secondary structure/free energy calculations for presentation on our website and within this manuscript.

**Figure 2.**
(a) Five different categories of “matching” between NCBI and GISAID sequences prior to QC, ranging from a complete sequence and metadata match to a complete sequence and metadata mismatch; (b), the regional distribution (by continent) of the entire combined dataset; (c–d) the regional distribution of sequences from GISAID, NCBI, and the overlap (only) between the two.

**Figure 3.**
(a) The submissions per month from GISAID, NCBI, and the overlap between them; (b) The monthly submissions by country, where only the top 10 countries by total volume of submissions are included, while all other countries are grouped into the “All Others” category; (c) The total submissions by country, again showing only the top 10; (d) The same countries as a percentage of the total human population (based on publicly available UN data).

**Figure 4.**
(a) The initial landing page, with the option to search multiple SARS-CoV-2 strains in a single query or split them across multiple queries; the former will result in their results being aggregated together, whereas the latter will result in them being compared against each other; (b) One of the “Results” panels, which in this case shows a bar graph of codon usage between two different queries (e.g. lineage “A.2 and AL.1” and lineage “AY.4.”

**Figure 5.**
(a–c) Respectively, the percent abundance of each individual codon, dinucleotide, or junction dinucleotide was first calculated for each variant, after which the difference in abundance was calculated compared to the wild-type reference, and these difference values were computed for each time point based on the variant composition in that month; (d) The difference in overall GC content (GC%) as well as the difference in GC content for the first, second, and third positions within each codon (GC1–3%) compared to the wild-type for each time point, based on the variant proportions at that time point.

**Figure 6.**
(a-b) Changes relative to wild-type in CAI and CPAI (respectively) over a 46-month period for both the whole-genome CDS (first row) and all individual genes (subsequent rows); (c) Changes relative to wild-type in ENC, ENC with GC correction, ENCP, and ENCP with GC correction over the same time period; for all panels, red color is used to depict points where the CAI or CPAI exceeded the maximum range of the color bar (only applies to ORF8).

**Figure 7.**
(a) Changes in free energy over a 46-month period for both unmodified SARS-CoV-2 variants and a special “synonymous removed” set of variant sequences; (b) changes in both synonymous and nonsynonymous mutations over time; (c) autocorrelation statistic for the unmodified free energy time series.

**Figure 8.**
(a—f) The number of nucleotides attributable to the 5ʹ unpaired end, stems, interior loops, multiloops, hairpin loops, and the 3ʹ unpaired end (respectively) over time.

**Figure 9.**
(a) The secondary structure conservation in our cohort of thousands of SARS-CoV-2 variants in our dataset; (b) the number of stem-loops identified over time over a 46-month period, with each time point representing the number of stem-loops for the unique distribution of SARS-CoV-2 variants at that time.

See this image and copyright information in PMC

References

1. Alexaki A, Kames J, Holcomb DD et al. Codon and Codon-Pair Usage Tables (CoCoPUTs): facilitating genetic variation analyses and recombinant gene design. J Mol Biol 2019;431:2434–41. doi: 10.1016/j.jmb.2019.04.021 - DOI - PubMed
1. Athey J, Alexaki A, Osipova E et al. A new and updated resource for codon usage tables. BMC Bioinf. 2017;18:391. doi: 10.1186/s12859-017-1793-7 - DOI - PMC - PubMed
1. Azgari C, Kilinc Z, Turhan B et al. The mutation profile of SARS-CoV-2 is primarily shaped by the host antiviral defense. Viruses 2021;13:394. doi: 10.3390/v13030394 - DOI - PMC - PubMed
1. Bai H, Ata G, Sun Q et al. Natural selection pressure exerted on “Silent” mutations during the evolution of SARS-CoV-2: evidence from codon usage and RNA structure. Virus Res 2023;323:198966. doi: 10.1016/j.virusres.2022.198966 - DOI - PMC - PubMed
1. Bashor L, Gagne RB, Bosco-Lauth AM et al. SARS-CoV-2 evolution in animals suggests mechanisms for rapid variant selection. Proc Natl Acad Sci USA 2021;118:e2105253118. doi: 10.1073/pnas.2105253118 - DOI - PMC - PubMed

Grants and funding

R01 HL151392/HL/NHLBI NIH HHS/United States

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SARS-CoV-2 CoCoPUTs: analyzing GISAID and NCBI data to obtain codon statistics, mutations, and free energy over a multiyear period

Affiliations

SARS-CoV-2 CoCoPUTs: analyzing GISAID and NCBI data to obtain codon statistics, mutations, and free energy over a multiyear period

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous