Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 9;38(12):5211-5224.
doi: 10.1093/molbev/msab246.

Recovery of Deleted Deep Sequencing Data Sheds More Light on the Early Wuhan SARS-CoV-2 Epidemic

Affiliations

Recovery of Deleted Deep Sequencing Data Sheds More Light on the Early Wuhan SARS-CoV-2 Epidemic

Jesse D Bloom. Mol Biol Evol. .

Erratum in

Abstract

The origin and early spread of SARS-CoV-2 remains shrouded in mystery. Here, I identify a data set containing SARS-CoV-2 sequences from early in the Wuhan epidemic that has been deleted from the NIH's Sequence Read Archive. I recover the deleted files from the Google Cloud and reconstruct partial sequences of 13 early epidemic viruses. Phylogenetic analysis of these sequences in the context of carefully annotated existing data further supports the idea that the Huanan Seafood Market sequences are not fully representative of the viruses in Wuhan early in the epidemic. Instead, the progenitor of currently known SARS-CoV-2 sequences likely contained three mutations relative to the market viruses that made it more similar to SARS-CoV-2's bat coronavirus relatives.

Keywords: COVID-19; SARS-CoV-2; Sequence Read Archive; forensic bioinformatics; phylogenetics.

PubMed Disclaimer

Conflict of interest statement

The author consults for Moderna on SARS-CoV-2 evolution and epidemiology, consults for Flagship Labs 77 on viral evolution and deep mutational scanning, and has the potential to receive a share of IP revenue as an inventor on a Fred Hutch licensed technology/patent (application WO2020006494) related to deep mutational scanning of viral proteins.

Figures

Fig. 1.
Fig. 1.
Accessions from deep sequencing project PRJNA612766 have been removed from the SRA. Shown is the result of searching for “SRR11313485” in the SRA search toolbar. This result has been digitally archived on the Wayback Machine at https://web.archive.org/web/20210502131630/ https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11313485.
Fig. 2.
Fig. 2.
The reported collection dates of SARS-CoV-2 sequences in GISAID versus their relative mutational distances from the RaTG13 bat coronavirus outgroup. Mutational distances are relative to the putative progenitor proCoV2 inferred by Kumar et al. (2021), which itself differs from RaTG13 by 1,132 mutations—so a sequence with a relative mutational distance of 2 actually has 1,134 differences from RaTG13. Note that the lower-right point in the middle (green) panel corresponds to a sequence (Guangdong/FS-30-P00502/2020) reportedly collected in late February that is actually two mutations more similar to RaTG13 than proCoV2. The plot shows only sequences in GISAID collected no later than February 28, 2020. Sequences that the joint WHO-China report (WHO 2021) describes as being associated with the Wuhan Seafood Market are plotted with squares. Points are slightly jittered on the y-axis. Go to https://jbloom.github.io/SARS-CoV-2_PRJNA612766/deltadist.html for an interactive version of this plot that enables toggling of the outgroup to RpYN06 and RmYN02, mouseovers to see details for each point including strain name and mutations relative to proCoV2, and adjustment of the y-axis jittering. Static versions of the plot with RpYN06 and RmYN02 outgroups are in supplementary fig. S3, Supplementary Material online.
Fig. 3.
Fig. 3.
Phylogenetic trees of SARS-CoV-2 sequences in GISAID collected before February 2020. The trees are identical except they are rooted to make the progenitor each of the three sequences with highest identity to the RaTG13 bat coronavirus outgroup. Nodes are shown as pie charts with areas proportional to the number of observations of that sequence and colored by where the viruses were collected. The mutations on each branch are labeled, with mutations toward the nucleotide identity in the outgroup in purple. The labels at the top of each tree give the first known virus identical to each putative progenitor, as well as mutations in that progenitor relative to proCoV2 (Kumar et al. 2021) and Wuhan-Hu-1. The monophyletic group containing C28144T is collapsed into a node labeled “clade B” in concordance with the naming scheme of Rambaut et al. (2020); this clade contains Wuhan-Hu-1. Singleton mutations (mutations observed only once in the sequence set) are removed as described in more detail in the Methods. Supplementary figs. S4 and S5, Supplementary Material online, show identical results are obtained if the outgroup is RpYN06 or RmYN02.
Fig. 4.
Fig. 4.
Relative mutational distance from RaTG13 bat coronavirus outgroup calculated only over the region of the SARS-CoV-2 genome covered by sequences from the deleted data set (21,570–29,550). Because the calculated distances here are only over a portion of the genome, there are more negative points than in fig. 2. The plot shows sequences in GISAID collected before February 2020, as well as the 13 early Wuhan epidemic sequences in table 1. Mutational distance is calculated relative to proCoV2, and points are jittered on the y-axis. Go to https://jbloom.github.io/SARS-CoV-2_PRJNA612766/deltadist_jitter.html for an interactive version of this plot that enables toggling the outgroup to RpYN06 or RmYN02, mouseovers to see details for each point, and adjustment of jittering.
Fig. 5.
Fig. 5.
Phylogenetic trees like those in fig. 3 with the addition of the early Wuhan epidemic sequences from the deleted data set, and Guangdong patients infected in Wuhan prior to January 5 annotated separately. Because the deleted sequences are partial, they cannot all be placed unambiguously on the tree. Therefore, they are added to each compatible node proportional to the number of sequences already in that node. The deleted sequences with C28144T (clade B) or C29095T (putative progenitor in middle tree) can be placed relatively unambiguously as defining mutations occur in the sequenced region, but those that lack either of these mutations are compatible with a large number of nodes including the proCoV2 putative progenitor. Supplementary figs. S4 and S5, Supplementary Material online, demonstrate that the results are identical if RpYN06 or RmYN02 is instead used as the outgroup.
Fig. 6.
Fig. 6.
A redacted version of the e-mails from Wuhan University to the SRA staff requesting deletion of the sequencing data. This e-mail was provided to me by the NIH’s NCBI Director Stephen Sherry on June, 19 2021, the day after I e-mailed the NIH an advance copy of this manuscript. The redactions and highlighting were done by the NIH, and I am showing the e-mail exactly as it was provided to me.

References

    1. Bedford T, Greninger AL, Roychoudhury P, Starita LM, Famulare M, Huang M-L, Nalla A, Pepper G, Reinhardt A, Xie H, et al. ; Seattle Flu Study Investigators. 2020. Cryptic transmission of SARS-CoV-2 in Washington state. Science. 370(6516):571–575. - PMC - PubMed
    1. Boni MF, Lemey P, Jiang X, Lam TT-Y, Perry BW, Castoe TA, Rambaut A, Robertson DL. 2020. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Nat Microbiol. 5(11):1408–1417. - PubMed
    1. Chan JF-W, Yuan S, Kok K-H, To KK-W, Chu H, Yang J, Xing F, Liu J, Yip CC-Y, Poon RW-S, et al. 2020. A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster. Lancet. 395(10223):514–523. - PMC - PubMed
    1. Chen N, Zhou M, Dong X, Qu J, Gong F, Han Y, Qiu Y, Wang J, Liu Y, Wei Y, et al. 2020. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. Lancet. 395(10223):507–513. - PMC - PubMed
    1. Chen S, Zhou Y, Chen Y, Gu J. 2018. fastp: an ultra-fast all-in-one fastq preprocessor. Bioinformatics. 34(17):i884–i890. - PMC - PubMed

Publication types