Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 11;23(1):121.
doi: 10.1186/s12864-022-08358-2.

Pango lineage designation and assignment using SARS-CoV-2 spike gene nucleotide sequences

Affiliations

Pango lineage designation and assignment using SARS-CoV-2 spike gene nucleotide sequences

Áine O'Toole et al. BMC Genomics. .

Abstract

Background: More than 2 million SARS-CoV-2 genome sequences have been generated and shared since the start of the COVID-19 pandemic and constitute a vital information source that informs outbreak control, disease surveillance, and public health policy. The Pango dynamic nomenclature is a popular system for classifying and naming genetically-distinct lineages of SARS-CoV-2, including variants of concern, and is based on the analysis of complete or near-complete virus genomes. However, for several reasons, nucleotide sequences may be generated that cover only the spike gene of SARS-CoV-2. It is therefore important to understand how much information about Pango lineage status is contained in spike-only nucleotide sequences. Here we explore how Pango lineages might be reliably designated and assigned to spike-only nucleotide sequences. We survey the genetic diversity of such sequences, and investigate the information they contain about Pango lineage status.

Results: Although many lineages, including the main variants of concern, can be identified clearly using spike-only sequences, some spike-only sequences are shared among tens or hundreds of Pango lineages. To facilitate the classification of SARS-CoV-2 lineages using subgenomic sequences we introduce the notion of designating such sequences to a "lineage set", which represents the range of Pango lineages that are consistent with the observed mutations in a given spike sequence.

Conclusions: We find that many lineages, including the main variants-of-concern, can be reliably identified by spike alone and we define lineage-sets to represent the lineage precision that can be achieved using spike-only nucleotide sequences. These data provide a foundation for the development of software tools that can assign newly-generated spike nucleotide sequences to Pango lineage sets.

Keywords: Genomic surveillance; Lineage; Pango; SARS-CoV-2; Spike.

PubMed Disclaimer

Conflict of interest statement

MEA and EJK are employees of AstraZeneca and own stock.

Figures

Fig. 1
Fig. 1
Genetic diversity in the spike protein of all available SARS-CoV-2 genome sequences. The vertical axis of each plot shows the number of designated sequences that exhibit mutations at each amino acid position (log10-scaled). A Distribution of non-synonymous variation across amino acid positions of the spike protein (horizontal axis). B Distribution of synonymous variation across nucleotide sites in the spike protein (horizontal axis). C Distribution insertion and deletion variation (indels) across nucleotide sites in the spike protein (horizontal axis). Each point represents the 5′ nucleotide site at the start of each indel. In each plot, genetic variation is determined by comparison with a lineage A reference sequence (Wuhan/WH04/2020, EPI_ISL_406801)
Fig. 2
Fig. 2
Genetic diversity in the spike protein of SARS-CoV-2 sequences for each of the Pango lineages that correspond to the four main variants of concern. Each coloured dot shows the genomic position of a non-synonymous (green), synonymous (beige), or indel (orange) mutation. The vertical axis shows the number of designated sequences that exhibit non-synonymous variation at each amino acid position (log10-scaled). In each plot, mutations were defined by comparing sequences with a common reference strain, Wuhan/WH04/2020 (EPI_ISL_406801). A Lineage B.1.1.7 (alpha). B Lineage B.1.351 (beta). C Lineage P.1 (gamma). D Lineage B.1.617.2 (delta)
Fig. 3
Fig. 3
Histogram of the proportion of spike protein nucleotide sites that are ambiguous (i.e. contain at least one IUPAC ambiguity code). The distribution is calculated for all SARS-CoV-2 sequences in GISAID that (i) have been designated to a Pango lineage, and (ii) have N at < 5% of sites across the whole genome, excluding UTRs
Fig. 4
Fig. 4
We observed 8 Pango lineages in which every designated sequence contains at least one ambiguity code in its spike nucleotide sequence. The number of sequences designated to these lineages is typically small (range 6-31). For each sequence in these 8 lineages, the y-axis shows the proportion of spike nucleotide sites in that sequence represented by an IUPAC ambiguity code (i.e. not A, C, G, T, or a gap)
Fig. 5
Fig. 5
Distribution of the number of lineages that a given SNS is observed in. Most SNS are found in only one lineage. However, a few SNSs can be found in many lineages
Fig. 6
Fig. 6
Number of distinct spike nucleotide sequences (SNS) in designated Pango lineages with complete spike sequences. Some of the very large lineages, for instance B.1.1.7 (count = 3893), B.1.177 (count = 2974), and B.1 (count = 2763) have many SNS, whereas the majority of lineages have few
Fig. 7
Fig. 7
Distribution of the number of Pango lineages in each lineage set, using a mutation frequency threshold of X = 95%. Three hundred thirty-seven sets contain only a single lineage and so are uniquely distinguishable by their consensus spike haplotype. Twenty-eight sets contain two Pango lineages, and so on
Fig. 8
Fig. 8
Phylogeny consisting of a single tip per designated Pango lineage. Shading represents lineage sets, coloured by major lineages they represent. Significant lineages or those with many sublineages shown in the central panel. Lineage sets containing more than 10 lineages are indicated on the right, with their respective spike constellation described
Fig. 9
Fig. 9
Plot showing the number of mutations in the CSH for a given lineage as a function of the mutation frequency threshold (X) used to define the CSH. Lineages shown are those that contain > 20 designated sequences with complete spike nucleotide sequences and > 5 spike mutations. The lineages that correspond to the four main VOCs are coloured individually, whilst all other included lineages are shown in green
Fig. 10
Fig. 10
Plot showing the trade-off arising from the choice of mutation frequency threshold, X. A The mean number of lineages in each lineage set increases as X increases from 0 to 100% (pink). The mean is calculated across all spike-only sequences that can be designated to a lineage set. The mean is dominated by one very large lineage set; green dots show the actual set sizes for all lineage sets. B The total number of sequences that can be designated to a lineage set increases as X increases from 0 to 100% (orange). The mean percentage of sequences per lineage that cannot be designated to a lineage set is also shown in order to normalise the effect of lineage size (green). Error bars represent 95 percentiles of this distribution

References

    1. Alm E, et al. Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European region, January to June 2020. Eurosurveillance. 2020;25:2001410. doi: 10.2807/1560-7917.ES.2020.25.32.2001410. - DOI - PMC - PubMed
    1. Faria N, et al. Genomics and epidemiology of the P.1 SARS-CoV-2 lineage in Manaus, Brazil. Science. 2021;372:815–821. doi: 10.1126/science.abh2644. - DOI - PMC - PubMed
    1. Geoghegan JL, et al. Genomic epidemiology reveals transmission patterns and dynamics of SARS-CoV-2 in Aotearoa New Zealand. Nat Commun. 2020;11:6351. doi: 10.1038/s41467-020-20235-8. - DOI - PMC - PubMed
    1. Hu B, et al. Characteristics of SARS-CoV-2 and COVID-19. Nat Rev Microbiol. 2021;19:141–154. doi: 10.1038/s41579-020-00459-7. - DOI - PMC - PubMed
    1. Jackson B, et al. Generation and transmission of inter-lineage recombinants in the SARS-CoV-2 pandemic. medRxiv. 2021. 10.1101/2021.06.18.21258689.

Substances