This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Nov 5:2024.04.29.591666.

doi: 10.1101/2024.04.29.591666.

Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny

Martin Hunt^{1

2

3

4}, Angie S Hinrichs⁵, Daniel Anderson¹, Lily Karim^{5

6}, Bethany L Dearlove⁷, Jeff Knaggs^{1

2

3

4}, Bede Constantinides^{2

4}, Philip W Fowler^{2

3

4}, Gillian Rodger^{2

4}, Teresa Street^{2

3}, Sheila Lumley^{2

8}, Hermione Webster², Theo Sanderson⁹, Christopher Ruis^{10

11}, Benjamin Kotzen¹², Nicola de Maio¹, Lucas N Amenga-Etego¹³, Dominic S Y Amuzu¹³, Martin Avaro¹⁴, Gordon A Awandare¹³, Reuben Ayivor-Djanie^{15

16}, Timothy Barkham¹⁷, Matthew Bashton¹⁸, Elizabeth M Batty^{19

20}, Yaw Bediako¹³, Denise De Belder²¹, Estefania Benedetti¹⁴, Andreas Bergthaler⁷, Stefan A Boers²², Josefina Campos²¹, Rosina Afua Ampomah Carr^{16

23}, Yuan Yi Constance Chen¹⁷, Facundo Cuba²¹, Maria Elena Dattero¹⁴, Wanwisa Dejnirattisai²⁴, Alexander Dilthey²⁵, Kwabena Obeng Duedu^{16

26}, Lukas Endler⁷, Ilka Engelmann²⁷, Ngiambudulu M Francisco²⁸, Jonas Fuchs²⁹, Etienne Z Gnimpieba³⁰, Soraya Groc³¹, Jones Gyamfi^{16

32}, Dennis Heemskerk²², Torsten Houwaart²⁵, Nei-Yuan Hsiao³³, Matthew Huska³⁴, Martin Hölzer³⁴, Arash Iranzadeh³⁵, Hanna Jarva³⁶, Chandima Jeewandara³⁷, Bani Jolly^{38

39}, Rageema Joseph³⁵, Ravi Kant^{40

41

42}, Karrie Ko Kwan Ki⁴³, Satu Kurkela³⁶, Maija Lappalainen³⁶, Marie Lataretu³⁴, Jacob Lemieux¹², Chang Liu^{44

45}, Gathsaurie Neelika Malavige³⁷, Tapfumanei Mashe⁴⁶, Juthathip Mongkolsapaya^{20

44

45}, Brigitte Montes³¹, Jose Arturo Molina Mora⁴⁷, Collins M Morang'a¹³, Bernard Mvula⁴⁸, Niranjan Nagarajan^{49

50}, Andrew Nelson⁵¹, Joyce M Ngoi¹³, Joana Paula da Paixão²⁸, Marcus Panning²⁹, Tomas Poklepovich²¹, Peter K Quashie¹³, Diyanath Ranasinghe³⁷, Mara Russo¹⁴, James Emmanuel San^{52

53}, Nicholas D Sanderson^{2

3}, Vinod Scaria^{39

54}, Gavin Screaton², October Michael Sessions⁵⁵, Tarja Sironen^{40

41}, Abay Sisay⁵⁶, Darren Smith¹⁸, Teemu Smura^{40

41}, Piyada Supasa^{44

45}, Chayaporn Suphavilai⁴⁹, Jeremy Swann², Houriiyah Tegally⁵⁷, Bryan Tegomoh^{58

59

60}, Olli Vapalahti^{40

41}, Andreas Walker⁶¹, Robert J Wilkinson^{9

62

63}, Carolyn Williamson³⁵, Xavier Zair⁵⁵; IMSSC2 Laboratory Network Consortium; Tulio de Oliveira^{57

64}, Timothy Ea Peto², Derrick Crook², Russell Corbett-Detig^{5

6}, Zamin Iqbal^{1

65}

Affiliations

¹ European Molecular Biology Laboratory - European Bioinformatics Institute, Hinxton, UK.
² Nuffield Department of Medicine, University of Oxford, Oxford, UK.
³ National Institute of Health Research Oxford Biomedical Research Centre, John Radcliffe Hospital, Headley Way, Oxford, UK.
⁴ Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance, University of Oxford, Oxford, UK.
⁵ Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA.
⁶ Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA.
⁷ Institute for Hygiene and Applied Immunology, Center for Pathophysiology, Infectiology and Immunology, Medical University of Vienna, Vienna 1090, Austria.
⁸ Department of Infectious Diseases and Microbiology, John Radcliffe Hospital, Oxford, UK.
⁹ Francis Crick Institute, London, UK.
¹⁰ Victor Phillip Dahdaleh Heart & Lung Research Institute, University of Cambridge, Cambridge, UK.
¹¹ Department of Veterinary Medicine, University of Cambridge, Cambridge, UK.
¹² Department of Infectious Diseases, Massachusetts General Hospital., Boston, Massachusetts, USA.
¹³ West African Centre for Cell Biology of Infectious Pathogens (WACCBIP), University of Ghana, Accra, Ghana.
¹⁴ Servicio de Virus Respiratorios, Instituto Nacional Enfermedades Infecciosas, ANLIS "Dr. Carlos G. Malbrán", Buenos Aires, Argentina.
¹⁵ Laboratory for Medical Biotechnology and Biomanufacturing, International Centre for Genetic Engineering and Biotechnology, Tristie, Italy.
¹⁶ Department of Biomedical Sciences, University of Health and Allied Sciences, Ho, Ghana.
¹⁷ Tan Tock Seng Hospital, Singapore.
¹⁸ The Hub for Biotechnology in the Built Environment, Department of Applied Sciences, Faculty of Health and Life Sciences, Northumbria University, Newcastle upon Tyne, NE1 8ST, UK.
¹⁹ Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford, UK.
²⁰ Mahidol-Oxford Tropical Medicine Research Unit, Bangkok, Thailand.
²¹ Unidad Operativa Centro Nacional de Genómica y Bioinformática, ANLIS "Dr. Carlos G. Malbrán", Buenos Aires, Argentina.
²² Dept. Medical Microbiology, Leiden University Medical Center, Albinusdreef 2, 2333 ZA, Leiden, The Netherlands.
²³ Department of Computational Medicine and Bioinformatics, University of Michigan, Michigan, Ann Arbor, MI, USA.
²⁴ Division of Emerging Infectious Disease, Research Department, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkoknoi, Bangkok 10700, Thailand.
²⁵ Institute of Medical Microbiology and Hospital Hygiene, University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
²⁶ College of Life Sciences, Birmingham City University, Birmingham, UK.
²⁷ Pathogenesis and Control of Chronic and Emerging Infections, Univ Montpellier, INSERM, Etablissement Français du Sang, Virology Laboratory, CHU Montpellier, Montpellier, France.
²⁸ Grupo de Investigação Microbiana e Imunológica, Instituto Nacional de Investigação em Saúde (National Institute for Health Research), Luanda, Angola.
²⁹ Institute of Virology, Freiburg University Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany.
³⁰ Biomedical Engineering Department, University of South Dakota, Sioux Falls, SD 57107.
³¹ Virology Laboratory, CHU Montpellier, Montpellier, France.
³² School of Health and Life Sciences, Teesside University, Middlesbrough, UK.
³³ Divison of Medical Virology, University of Cape Town and National Health Laboratory Service.
³⁴ Genome Competence Center (MF1), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany.
³⁵ Computational Biology Division, University of Cape Town.
³⁶ HUS Diagnostic Center, Clinical Microbiology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland.
³⁷ Allergy Immunology and Cell Biology Unit, Department of Immunology and Molecular Medicine, University of Sri Jayewardenepura, Nugegoda, Sri Lanka.
³⁸ Karkinos Healthcare Private Limited (KHPL), Aurbis Business Parks, Bellandur, Bengaluru, Karnataka, 560103, India.
³⁹ Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, India.
⁴⁰ Department of Veterinary Biosciences, University of Helsinki, 00014 Helsinki, Finland.
⁴¹ Department of Virology, University of Helsinki, 00014 Helsinki, Finland.
⁴² Department of Tropical Parasitology, Institute of Maritime and Tropical Medicine, Medical University of Gdansk, 81-519 Gdynia, Poland.
⁴³ Department of Microbiology, Singapore General Hospital, Singapore.
⁴⁴ Chinese Academy of Medical Science (CAMS) Oxford Institute (COI), University of Oxford, Oxford, UK.
⁴⁵ Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK.
⁴⁶ Health System Strengthening Unit, World Health Organisation, Harare, Zimbabwe.
⁴⁷ Centro de investigación en Enfermedades Tropicales & Facultad de Microbiología, Universidad de Costa Rica, Costa Rica.
⁴⁸ Public Health Institute of Malawi, Ministry of Health, Malawi.
⁴⁹ Genome Institute of Singapore, Agency for Science, Technology and Research (A*STAR), Singapore.
⁵⁰ Yong Loo Lin School of Medicine, National University of Singapore, Singapore.
⁵¹ Department of Applied Sciences, Faculty of Health and Life Sciences, Northumbria University, Newcastle upon Tyne, NE1 8ST, UK.
⁵² Duke Human Vaccine Institute, Duke University, Durham, NC 27710.
⁵³ University of KwaZulu Natal, Durban, South Africa, 4001.
⁵⁴ Vishwanath Cancer Care Foundation (VCCF), Neelkanth Business Park Kirol Village, West Mumbai, Maharashtra, 400086, India.
⁵⁵ Saw Swee Hock School of Public Health, National Univeristy of Singapore.
⁵⁶ Department of Medical Laboratory Sciences, College of Health Sciences, Addis Ababa University, P.O.Box 1176, Addis Ababa, Ethiopia.
⁵⁷ Centre for Epidemic Response and Innovation (CERI), Stellenbosch University, South Africa.
⁵⁸ Centre de Coordination des Opérations d'Urgences de Santé Publique, Ministere de Sante Publique, Cameroun.
⁵⁹ University of California, Berkeley, Berkeley, California, USA.
⁶⁰ Nebraska Department of Health and Human Services, Lincoln, Nebraska, USA.
⁶¹ Institute of Virology, University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
⁶² Centre for Infectious Diseases Research in Africa, University of Cape Town.
⁶³ Imperial College London, UK.
⁶⁴ KwaZulu-Natal Research Innovation and Sequencing Platform (KRISP), University of KwaZulu-Natal, South Africa.
⁶⁵ Milner Centre for Evolution, University of Bath, UK.

PMID: 38746185
PMCID: PMC11092452
DOI: 10.1101/2024.04.29.591666

Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny

Martin Hunt et al. bioRxiv. 2024.

[Preprint]. 2024 Nov 5:2024.04.29.591666.

doi: 10.1101/2024.04.29.591666.

Authors

Affiliations

¹ European Molecular Biology Laboratory - European Bioinformatics Institute, Hinxton, UK.
² Nuffield Department of Medicine, University of Oxford, Oxford, UK.
³ National Institute of Health Research Oxford Biomedical Research Centre, John Radcliffe Hospital, Headley Way, Oxford, UK.
⁴ Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance, University of Oxford, Oxford, UK.
⁵ Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA.
⁶ Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA.
⁷ Institute for Hygiene and Applied Immunology, Center for Pathophysiology, Infectiology and Immunology, Medical University of Vienna, Vienna 1090, Austria.
⁸ Department of Infectious Diseases and Microbiology, John Radcliffe Hospital, Oxford, UK.
⁹ Francis Crick Institute, London, UK.
¹⁰ Victor Phillip Dahdaleh Heart & Lung Research Institute, University of Cambridge, Cambridge, UK.
¹¹ Department of Veterinary Medicine, University of Cambridge, Cambridge, UK.
¹² Department of Infectious Diseases, Massachusetts General Hospital., Boston, Massachusetts, USA.
¹³ West African Centre for Cell Biology of Infectious Pathogens (WACCBIP), University of Ghana, Accra, Ghana.
¹⁴ Servicio de Virus Respiratorios, Instituto Nacional Enfermedades Infecciosas, ANLIS "Dr. Carlos G. Malbrán", Buenos Aires, Argentina.
¹⁵ Laboratory for Medical Biotechnology and Biomanufacturing, International Centre for Genetic Engineering and Biotechnology, Tristie, Italy.
¹⁶ Department of Biomedical Sciences, University of Health and Allied Sciences, Ho, Ghana.
¹⁷ Tan Tock Seng Hospital, Singapore.
¹⁸ The Hub for Biotechnology in the Built Environment, Department of Applied Sciences, Faculty of Health and Life Sciences, Northumbria University, Newcastle upon Tyne, NE1 8ST, UK.
¹⁹ Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford, UK.
²⁰ Mahidol-Oxford Tropical Medicine Research Unit, Bangkok, Thailand.
²¹ Unidad Operativa Centro Nacional de Genómica y Bioinformática, ANLIS "Dr. Carlos G. Malbrán", Buenos Aires, Argentina.
²² Dept. Medical Microbiology, Leiden University Medical Center, Albinusdreef 2, 2333 ZA, Leiden, The Netherlands.
²³ Department of Computational Medicine and Bioinformatics, University of Michigan, Michigan, Ann Arbor, MI, USA.
²⁴ Division of Emerging Infectious Disease, Research Department, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkoknoi, Bangkok 10700, Thailand.
²⁵ Institute of Medical Microbiology and Hospital Hygiene, University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
²⁶ College of Life Sciences, Birmingham City University, Birmingham, UK.
²⁷ Pathogenesis and Control of Chronic and Emerging Infections, Univ Montpellier, INSERM, Etablissement Français du Sang, Virology Laboratory, CHU Montpellier, Montpellier, France.
²⁸ Grupo de Investigação Microbiana e Imunológica, Instituto Nacional de Investigação em Saúde (National Institute for Health Research), Luanda, Angola.
²⁹ Institute of Virology, Freiburg University Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany.
³⁰ Biomedical Engineering Department, University of South Dakota, Sioux Falls, SD 57107.
³¹ Virology Laboratory, CHU Montpellier, Montpellier, France.
³² School of Health and Life Sciences, Teesside University, Middlesbrough, UK.
³³ Divison of Medical Virology, University of Cape Town and National Health Laboratory Service.
³⁴ Genome Competence Center (MF1), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany.
³⁵ Computational Biology Division, University of Cape Town.
³⁶ HUS Diagnostic Center, Clinical Microbiology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland.
³⁷ Allergy Immunology and Cell Biology Unit, Department of Immunology and Molecular Medicine, University of Sri Jayewardenepura, Nugegoda, Sri Lanka.
³⁸ Karkinos Healthcare Private Limited (KHPL), Aurbis Business Parks, Bellandur, Bengaluru, Karnataka, 560103, India.
³⁹ Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, India.
⁴⁰ Department of Veterinary Biosciences, University of Helsinki, 00014 Helsinki, Finland.
⁴¹ Department of Virology, University of Helsinki, 00014 Helsinki, Finland.
⁴² Department of Tropical Parasitology, Institute of Maritime and Tropical Medicine, Medical University of Gdansk, 81-519 Gdynia, Poland.
⁴³ Department of Microbiology, Singapore General Hospital, Singapore.
⁴⁴ Chinese Academy of Medical Science (CAMS) Oxford Institute (COI), University of Oxford, Oxford, UK.
⁴⁵ Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK.
⁴⁶ Health System Strengthening Unit, World Health Organisation, Harare, Zimbabwe.
⁴⁷ Centro de investigación en Enfermedades Tropicales & Facultad de Microbiología, Universidad de Costa Rica, Costa Rica.
⁴⁸ Public Health Institute of Malawi, Ministry of Health, Malawi.
⁴⁹ Genome Institute of Singapore, Agency for Science, Technology and Research (A*STAR), Singapore.
⁵⁰ Yong Loo Lin School of Medicine, National University of Singapore, Singapore.
⁵¹ Department of Applied Sciences, Faculty of Health and Life Sciences, Northumbria University, Newcastle upon Tyne, NE1 8ST, UK.
⁵² Duke Human Vaccine Institute, Duke University, Durham, NC 27710.
⁵³ University of KwaZulu Natal, Durban, South Africa, 4001.
⁵⁴ Vishwanath Cancer Care Foundation (VCCF), Neelkanth Business Park Kirol Village, West Mumbai, Maharashtra, 400086, India.
⁵⁵ Saw Swee Hock School of Public Health, National Univeristy of Singapore.
⁵⁶ Department of Medical Laboratory Sciences, College of Health Sciences, Addis Ababa University, P.O.Box 1176, Addis Ababa, Ethiopia.
⁵⁷ Centre for Epidemic Response and Innovation (CERI), Stellenbosch University, South Africa.
⁵⁸ Centre de Coordination des Opérations d'Urgences de Santé Publique, Ministere de Sante Publique, Cameroun.
⁵⁹ University of California, Berkeley, Berkeley, California, USA.
⁶⁰ Nebraska Department of Health and Human Services, Lincoln, Nebraska, USA.
⁶¹ Institute of Virology, University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
⁶² Centre for Infectious Diseases Research in Africa, University of Cape Town.
⁶³ Imperial College London, UK.
⁶⁴ KwaZulu-Natal Research Innovation and Sequencing Platform (KRISP), University of KwaZulu-Natal, South Africa.
⁶⁵ Milner Centre for Evolution, University of Bath, UK.

PMID: 38746185
PMCID: PMC11092452
DOI: 10.1101/2024.04.29.591666

Abstract

The SARS-CoV-2 genome occupies a unique place in infection biology - it is the most highly sequenced genome on earth (making up over 20% of public sequencing datasets) with fine scale information on sampling date and geography, and has been subject to unprecedented intense analysis. As a result, these phylogenetic data are an incredibly valuable resource for science and public health. However, the vast majority of the data was sequenced by tiling amplicons across the full genome, with amplicon schemes that changed over the pandemic as mutations in the viral genome interacted with primer binding sites. In combination with the disparate set of genome assembly workflows and lack of consistent quality control (QC) processes, the current genomes have many systematic errors that have evolved with the virus and amplicon schemes. These errors have significant impacts on the phylogeny, and therefore over the last few years, many thousands of hours of researchers time has been spent in "eyeballing" trees, looking for artefacts, and then patching the tree. Given the huge value of this dataset, we therefore set out to reprocess the complete set of public raw sequence data in a rigorous amplicon-aware manner, and build a cleaner phylogeny. Here we provide a global tree of 4,471,579 samples, built from a consistently assembled set of high quality consensus sequences from all available public data as of June 2024, viewable at https://viridian.taxonium.org. Each genome was constructed using a novel assembly tool called Viridian (https://github.com/iqbal-lab-org/viridian), developed specifically to process amplicon sequence data, eliminating artefactual errors and mask the genome at low quality positions. We provide simulation and empirical validation of the methodology, and quantify the improvement in the phylogeny. We hope the tree, consensus sequences and Viridian will be a valuable resource for researchers.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest Gavin Screaton sits on the GSK Vaccines Scientific Advisory Board, consults for AstraZeneca, and is a founding member of RQ Biotechnology.

Figures

**Figure 1:. Assemblers which wrongly default to the reference base in the absence of data cause reversions in the phylogeny.**
a) Cartoon phylogeny built from perfect genomes, with leaves coloured by genotype at a specific position X (purple - ancestral base, green - derived base). Just one mutation at this site, shown as a white star, is needed to explain the data. b) Cartoon showing the effect of assembly software assuming that a genome is identical to the reference genome when there is no data - here the amplicon containing position X is dropped in the lowest-but-one genome on the tree, creating one lone purple leaf. The tool which infers the phylogeny looks for a parsimonious explanation for this colour distribution, and concludes it was caused by a mutation (white star) followed by a “reversion” back to the ancestral base (red star). Errors in assembly caused by reference-bias tend to create enrichments of reversions. c) Part of the current UShER SARS-CoV-2 phylogeny, coloured by genotype at genome position 22813 (spike codon 417). Blow-up shows multiple reversions back to the ancestral purple. A non-exhaustive set of artefactual mutations (reversions, unreversions, re-reversions etc) are shown with red stars, where there is a flip back and forth from green to/from purple.

**Figure 2:**
Timeline of the SARS-CoV-2 pandemic from December 2019 to July 2023, with selected events relating to problems with sequencing and consensus calling labelled a-e. Releases of ARTIC primers schemes (versions 1, 2, 3, 4, 4.1, 5.3.2) are marked with green triangles. a) Primer dimers cause amplicon dropouts [10] and 28% of GISAID [11] sequences deposited in September 2020 have at least one gap of length at least 200bp [12]. b) A 9bp deletion in the primer binding region of ARTIC V3 amplicon 73 causes missing data [13]. c) Dropouts causing artefacts at Spike 95 and 142 [14]. d) ARTIC v4 roll out triggers artifactual mutations in some pipelines [15]. e) Omicron samples cause ARTIC v4 amplicon dropout, triggering the update to ARTIC v4.1 [16].

**Figure 3:. Errors across the genome in consensus sequences from the “Early Omicron” African dataset, split by sequencing technology and amplicon scheme.**
Plots show the percent of consensus sequences with an error (y-axis), taking the maximum value in windows of length 50bp (x-axis). Error here is defined as where the consensus sequence has an A/C/G/T call, the read depth passes Viridian’s default filters (see methods), and the reads support a different A/C/G/T call. Results are shown for Viridian, the original assemblies, and for the ARTIC-ILM and ARTIC-ONT assembly workflows.

**Figure 4:. Most variable sites cause fewer reversions in the Viridian tree than the GenBank tree.**
a) Plot showing how many positions in the genome (y axis) have at least N reversions (x axis) in each tree (Viridian in blue, GenBank in red). Viridian curve drops faster, having fewer positions that create many reversions. b) Scatterplot comparing count of reversion mutations found in GenBank Dataset (y axis) and Viridian dataset (x axis). Note (0,0) is slightly indented from the origin of the plot. Each point represents a position of the SARS-CoV-2 genome. Three points below the line y=x are highlighted (labelled by genomic coordinates: 22786, 8835, 15521) where Viridian has particularly high numbers of reversions, and one (labelled 21987) for GenBank. c) Blow up of dotted square from panel b) showing vast majority of variable sites in the genome lie above the line y=x.

**Figure 5:. Comparison of uncertainty in growth estimates for different lineages when based on either the Viridian or Genbank tree.**
Panels a) (left) and b) (right) plot the same data in two ways; each point represents one lineage. Panel a) plots the difference in standard deviation of posterior density of relative growth rate estimate ΔlogR (i.e. standard deviation using the Viridian tree minus standard deviation using the Genbank tree). Negative values here show that on average, the Viridian tree yields lower uncertainty than the Genbank tree. Panel b) shows the standard deviation of the posterior density of relative growth rate estimate ΔlogR based on the GenBank tree (left) and Viridian tree (right). The median standard deviation of strain growth rate using the Genbank tree is 2.967, while the median standard deviation using the Viridian tree is 0.859. This difference is statistically significant (p < 0.01, paired t-test). Box-plots show first and third quartiles (lower and upper boundaries of box), and whiskers are set to the farthest point that is within 1.5 times the inter-quartile range from the box. Legend labels denote parent lineage.

**Figure 6:**
Overview of the Viridian pipeline, from input sequencing reads to output files.

**Figure 7:**
Method to score an amplicon scheme, using mapped fragments. a) Example of one mapped fragment, where its left end is 3bp from the start of the primer, and its right end is 0bp from the end of the right primer. b) The plot generated from the fragment in a). The right end of the fragment increments the counter for zero distance from a primer, and the left end of the fragment increments the counter for 3bp distance from a primer. The information from all fragments in the sample is added in this way, to make the distribution of distances from nearest primer ends. c) The cumulative plot from b) after adding all fragments. d) Plot c) is normalised by taking distance to primer end as a percentage of the mean amplicon length (x axis), and fragment counts as percent of total fragments (y axis). The red line indicates a typical curve where the reads match the scheme, whereas the blue line shows a scheme that does not match. The scheme’s score is the sum of differences between the calculated line and the y = x line (shown as a dashed line).

**Figure 8:**
Example scheme identification score plot from Viridian. Made from run accession ERR8959196, which is Nanopore reads sequenced using ARTIC-V4.1 primers.

**Figure 9:**
Consensus sequence construction methods. See main text for details. a) The starting point is primer and amplicon positions, and reads mapped to the consensus sequence. b) The consensus sequence of each amplicon is generated independently, using Racon. c) The amplicon sequences are overlapped using perfect matches (if they exist), making contigs. d) The contigs are scaffolded against the reference genome, adding gaps where needed.

**Figure 10:**
Consensus sequence pileup/masking methods. Two amplicons are shown with fragments (either illumina read pairs, or unpaired nanopore reads) mapped to the consensus. The fragments from amplicon 1 contribute to pileup at B-E, and do not count towards the primer regions A-B or E-F. Similarly, the fragments from amplicon 2 contribute to coverage at D-G (but not to C-D or G-H).

See this image and copyright information in PMC

References

1. Turakhia Yatish, De Maio Nicola, Thornlow Bryan, Gozashti Landen, Lanfear Robert, Walker Conor R., Hinrichs Angie S., Fernandes Jason D., Borges Rui, Slodkowicz Greg, Weilguny Lukas, Haussler David, Goldman Nick, and Russell Corbett-Detig. Stability of SARS-CoV-2 phylogenies. PLOS Genetics, 16(11):e1009175, November 2020. - PMC - PubMed
1. De Maio Nicola, Walker Conor, Borges Rui, Weilguny Lukas, Slodkowicz Greg, and Goldman Nick. Issues with sars-cov-2 sequencing data, https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473. May 2020.
1. Henn Matthew R., Boutwell Christian L., Charlebois Patrick, Lennon Niall J., Power Karen A., Macalalad Alexander R., Berlin Aaron M., Malboeuf Christine M., Ryan Elizabeth M., Gnerre Sante, Zody Michael C., Erlich Rachel L., Green Lisa M., Berical Andrew, Wang Yaoyu, Casali Monica, Streeck Hendrik, Bloom Allyson K., Dudek Tim, Tully Damien, Newman Ruchi, Axten Karen L., Gladden Adrianne D., Battis Laura, Kemper Michael, Zeng Qiandong, Shea Terrance P., Gujja Sharvari, Zedlack Carmen, Gasser Olivier, Brander Christian, Hess Christoph, Günthard Huldrych F., Brumme Zabrina L., Brumme Chanson J., Bazner Suzane, Rychert Jenna, Tinsley Jake P., Mayer Ken H., Rosenberg Eric, Pereyra Florencia, Levin Joshua Z., Young Sarah K., Jessen Heiko, Altfeld Marcus, Birren Bruce W., Walker Bruce D., and Allen Todd M.. Whole Genome Deep Sequencing of HIV-1 Reveals the Impact of Early Minor Variants Upon Immune Recognition During Acute Infection. PLOS Pathogens, 8(3):e1002529, March 2012. - PMC - PubMed
1. Holmes Edward. Novel 2019 coronavirus genome, https://virological.org/t/novel-2019-coronavirus-genome/319/1. January 2020.
1. Wu Fan, Zhao Su, Yu Bin, Chen Yan-Mei, Wang Wen, Song Zhi-Gang, Hu Yi, Tao Zhao-Wu, Tian Jun-Hua, Pei Yuan-Yuan, Yuan Ming-Li, Zhang Yu-Ling, Dai Fa-Hui, Liu Yi, Wang Qi-Min, Zheng Jiao-Jiao, Xu Lin, Holmes Edward C., and Zhang Yong-Zhen. A new coronavirus associated with human respiratory disease in China. Nature, 579(7798):265–269, March 2020. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny

Affiliations

Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous