. 2021 Dec 9;38(12):5211-5224.

doi: 10.1093/molbev/msab246.

Recovery of Deleted Deep Sequencing Data Sheds More Light on the Early Wuhan SARS-CoV-2 Epidemic

Jesse D Bloom¹

Affiliations

PMID: 34398234
PMCID: PMC8436388
DOI: 10.1093/molbev/msab246

Recovery of Deleted Deep Sequencing Data Sheds More Light on the Early Wuhan SARS-CoV-2 Epidemic

Jesse D Bloom. Mol Biol Evol. 2021.

. 2021 Dec 9;38(12):5211-5224.

doi: 10.1093/molbev/msab246.

Author

Jesse D Bloom¹

Affiliation

¹ Fred Hutchinson Cancer Research Center, Howard Hughes Medical Institute, Seattle, WA.

PMID: 34398234
PMCID: PMC8436388
DOI: 10.1093/molbev/msab246

Erratum in

Correction to: Recovery of Deleted Deep Sequencing Data Sheds More Light on the Early Wuhan SARS-CoV-2 Epidemic.
[No authors listed] [No authors listed] Mol Biol Evol. 2023 Sep 1;40(9):msad201. doi: 10.1093/molbev/msad201. Mol Biol Evol. 2023. PMID: 37772800 Free PMC article. No abstract available.

Abstract

The origin and early spread of SARS-CoV-2 remains shrouded in mystery. Here, I identify a data set containing SARS-CoV-2 sequences from early in the Wuhan epidemic that has been deleted from the NIH's Sequence Read Archive. I recover the deleted files from the Google Cloud and reconstruct partial sequences of 13 early epidemic viruses. Phylogenetic analysis of these sequences in the context of carefully annotated existing data further supports the idea that the Huanan Seafood Market sequences are not fully representative of the viruses in Wuhan early in the epidemic. Instead, the progenitor of currently known SARS-CoV-2 sequences likely contained three mutations relative to the market viruses that made it more similar to SARS-CoV-2's bat coronavirus relatives.

Keywords: COVID-19; SARS-CoV-2; Sequence Read Archive; forensic bioinformatics; phylogenetics.

PubMed Disclaimer

Conflict of interest statement

The author consults for Moderna on SARS-CoV-2 evolution and epidemiology, consults for Flagship Labs 77 on viral evolution and deep mutational scanning, and has the potential to receive a share of IP revenue as an inventor on a Fred Hutch licensed technology/patent (application WO2020006494) related to deep mutational scanning of viral proteins.

Figures

**Fig. 1.**
Accessions from deep sequencing project PRJNA612766 have been removed from the SRA. Shown is the result of searching for “SRR11313485” in the SRA search toolbar. This result has been digitally archived on the Wayback Machine at https://web.archive.org/web/20210502131630/ https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11313485.

**Fig. 2.**
The reported collection dates of SARS-CoV-2 sequences in GISAID versus their relative mutational distances from the RaTG13 bat coronavirus outgroup. Mutational distances are relative to the putative progenitor proCoV2 inferred by Kumar et al. (2021), which itself differs from RaTG13 by 1,132 mutations—so a sequence with a relative mutational distance of 2 actually has 1,134 differences from RaTG13. Note that the lower-right point in the middle (green) panel corresponds to a sequence (Guangdong/FS-30-P00502/2020) reportedly collected in late February that is actually two mutations more similar to RaTG13 than proCoV2. The plot shows only sequences in GISAID collected no later than February 28, 2020. Sequences that the joint WHO-China report (WHO 2021) describes as being associated with the Wuhan Seafood Market are plotted with squares. Points are slightly jittered on the y-axis. Go to https://jbloom.github.io/SARS-CoV-2_PRJNA612766/deltadist.html for an interactive version of this plot that enables toggling of the outgroup to RpYN06 and RmYN02, mouseovers to see details for each point including strain name and mutations relative to proCoV2, and adjustment of the y-axis jittering. Static versions of the plot with RpYN06 and RmYN02 outgroups are in supplementary fig. S3, Supplementary Material online.

**Fig. 3.**
Phylogenetic trees of SARS-CoV-2 sequences in GISAID collected before February 2020. The trees are identical except they are rooted to make the progenitor each of the three sequences with highest identity to the RaTG13 bat coronavirus outgroup. Nodes are shown as pie charts with areas proportional to the number of observations of that sequence and colored by where the viruses were collected. The mutations on each branch are labeled, with mutations toward the nucleotide identity in the outgroup in purple. The labels at the top of each tree give the first known virus identical to each putative progenitor, as well as mutations in that progenitor relative to proCoV2 (Kumar et al. 2021) and Wuhan-Hu-1. The monophyletic group containing C28144T is collapsed into a node labeled “clade B” in concordance with the naming scheme of Rambaut et al. (2020); this clade contains Wuhan-Hu-1. Singleton mutations (mutations observed only once in the sequence set) are removed as described in more detail in the Methods. Supplementary figs. S4 and S5, Supplementary Material online, show identical results are obtained if the outgroup is RpYN06 or RmYN02.

**Fig. 4.**
Relative mutational distance from RaTG13 bat coronavirus outgroup calculated *only* over the region of the SARS-CoV-2 genome covered by sequences from the deleted data set (21,570–29,550). Because the calculated distances here are only over a portion of the genome, there are more negative points than in fig. 2. The plot shows sequences in GISAID collected before February 2020, as well as the 13 early Wuhan epidemic sequences in table 1. Mutational distance is calculated relative to proCoV2, and points are jittered on the y-axis. Go to https://jbloom.github.io/SARS-CoV-2_PRJNA612766/deltadist_jitter.html for an interactive version of this plot that enables toggling the outgroup to RpYN06 or RmYN02, mouseovers to see details for each point, and adjustment of jittering.

**Fig. 5.**
Phylogenetic trees like those in fig. 3 with the addition of the early Wuhan epidemic sequences from the deleted data set, and Guangdong patients infected in Wuhan prior to January 5 annotated separately. Because the deleted sequences are partial, they cannot all be placed unambiguously on the tree. Therefore, they are added to each compatible node proportional to the number of sequences already in that node. The deleted sequences with C28144T (clade B) or C29095T (putative progenitor in middle tree) can be placed relatively unambiguously as defining mutations occur in the sequenced region, but those that lack either of these mutations are compatible with a large number of nodes including the proCoV2 putative progenitor. Supplementary figs. S4 and S5, Supplementary Material online, demonstrate that the results are identical if RpYN06 or RmYN02 is instead used as the outgroup.

**Fig. 6.**
A redacted version of the e-mails from Wuhan University to the SRA staff requesting deletion of the sequencing data. This e-mail was provided to me by the NIH’s NCBI Director Stephen Sherry on June, 19 2021, the day after I e-mailed the NIH an advance copy of this manuscript. The redactions and highlighting were done by the NIH, and I am showing the e-mail exactly as it was provided to me.

See this image and copyright information in PMC

References

1. Bedford T, Greninger AL, Roychoudhury P, Starita LM, Famulare M, Huang M-L, Nalla A, Pepper G, Reinhardt A, Xie H, et al. ; Seattle Flu Study Investigators. 2020. Cryptic transmission of SARS-CoV-2 in Washington state. Science. 370(6516):571–575. - PMC - PubMed
1. Boni MF, Lemey P, Jiang X, Lam TT-Y, Perry BW, Castoe TA, Rambaut A, Robertson DL. 2020. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Nat Microbiol. 5(11):1408–1417. - PubMed
1. Chan JF-W, Yuan S, Kok K-H, To KK-W, Chu H, Yang J, Xing F, Liu J, Yip CC-Y, Poon RW-S, et al. 2020. A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster. Lancet. 395(10223):514–523. - PMC - PubMed
1. Chen N, Zhou M, Dong X, Qu J, Gong F, Han Y, Qiu Y, Wang J, Liu Y, Wei Y, et al. 2020. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. Lancet. 395(10223):507–513. - PMC - PubMed
1. Chen S, Zhou Y, Chen Y, Gu J. 2018. fastp: an ultra-fast all-in-one fastq preprocessor. Bioinformatics. 34(17):i884–i890. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Recovery of Deleted Deep Sequencing Data Sheds More Light on the Early Wuhan SARS-CoV-2 Epidemic

Affiliation

Recovery of Deleted Deep Sequencing Data Sheds More Light on the Early Wuhan SARS-CoV-2 Epidemic

Author

Affiliation

Erratum in

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous