Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2024 May 23;25(4):bbae296.
doi: 10.1093/bib/bbae296.

Comparing full variation profile analysis with the conventional consensus method in SARS-CoV-2 phylogeny

Affiliations
Comparative Study

Comparing full variation profile analysis with the conventional consensus method in SARS-CoV-2 phylogeny

Regina Nóra Fiam et al. Brief Bioinform. .

Abstract

This study proposes a novel approach to studying severe acute respiratory syndrome coronavirus 2 virus mutations through sequencing data comparison. Traditional consensus-based methods, which focus on the most common nucleotide at each position, might overlook or obscure the presence of low-frequency variants. Our method, in contrast, retains all sequenced nucleotides at each position, forming a genomic matrix. Utilizing simulated short reads from genomes with specified mutations, we contrasted our genomic matrix approach with the consensus sequence method. Our matrix methodology, across multiple simulated datasets, accurately reflected the known mutations with an average accuracy improvement of 20% over the consensus method. In real-world tests using data from GISAID and NCBI-SRA, our approach demonstrated an increase in reliability by reducing the error margin by approximately 15%. The genomic matrix approach offers a more accurate representation of the viral genomic diversity, thereby providing superior insights into virus evolution and epidemiology.

Keywords: NGS sequencing; SARS-CoV-2; full variation profile analysis; viral variants.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic diagram of the data processing and the simulation construction.
Figure 2
Figure 2
Visualization of the three approaches. The numbers on the axes refer to the order of the simulated sequencing samples listed in Table 1.
Figure 3
Figure 3
Deletions around the genome marked with ‘-’ in purple
Figure 4
Figure 4
Visualization of the dendrograms of the genome matrix, consensus-based approach, and the true composition’s distance matrices. It can be seen that compared with the true composition, the consensus-based approach amplifies or smoothens the differences, whereas based on genome matrices, we do not obtain such outliers in the relationships.
Figure 5
Figure 5
Visualization of the similarity of real data using heatmaps.
Figure 6
Figure 6
Deletions along the EPI_ISL_665636 genome marked with ‘-’ in purple.
Figure 7
Figure 7
Zooming into the deletion sites within the ERR4892461 genome. In the left column, it can be seen that ‘holes’ appear in the number of nucleotides at the deletion sites. On the right, zooming into these positions, we can see that there are also alternative nucleotides here, which however, stand out from the noise of the non-deletion sites.
Figure 8
Figure 8
The 16th sample greatly differs from the others, but according to the genome matrix, it shows similarity with several samples. The consensus-based approach organizes the samples into fewer clusters. Accordingly, we concluded here, similar to what was observed in simulations where the true composition was known, that the representation of variability decreases when using consensus.
Figure 9
Figure 9
Entropy sequences along the entire genome: visualization of the information content of genomes.

References

    1. Mardis ER. Next-generation dna sequencing methods. Annu Rev Genomics Hum Genet 2008;9:387–402. 10.1146/annurev.genom.9.081307.164359. - DOI - PubMed
    1. Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics 2010;95:315–27. 10.1016/j.ygeno.2010.03.001. - DOI - PMC - PubMed
    1. Caraballo-Ortiz MA, Miura S, Sanderford M. et al. .. Tophap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity. Bioinformatics 2022;38:2719–26. 10.1093/bioinformatics/btac186. - DOI - PMC - PubMed
    1. Gribskov M, Veretnik S. Identification of sequence patterns with profile analysis. In: Methods in Enzymology. Academic Press, 1996, Vol. 266, pp. 198–212. - PubMed
    1. Guang A, Howison M, Ledingham L. et al. .. Incorporating within-host diversity in phylogenetic analyses for detecting clusters of new hiv diagnoses. Front Microbiol 2022;12:803190. 10.3389/fmicb.2021.803190. - DOI - PMC - PubMed

Publication types

Supplementary concepts

Grants and funding