Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2021 Sep 6:2021.09.03.458038.
doi: 10.1101/2021.09.03.458038.

Rescuing Low Frequency Variants within Intra-Host Viral Populations directly from Oxford Nanopore sequencing data

Affiliations

Rescuing Low Frequency Variants within Intra-Host Viral Populations directly from Oxford Nanopore sequencing data

Yunxi Liu et al. bioRxiv. .

Update in

Abstract

Infectious disease monitoring on Oxford Nanopore Technologies (ONT) platforms offers rapid turnaround times and low cost, exemplified by well over a half of million ONT SARS-COV-2 datasets. Tracking low frequency intra-host variants has provided important insights with respect to elucidating within host viral population dynamics and transmission. However, given the higher error rate of ONT, accurate identification of intra-host variants with low allele frequencies remains an open challenge with no viable solutions available. In response to this need, we present Variabel, a novel approach and first method designed for rescuing low frequency intra-host variants from ONT data alone. We evaluated Variabel on both within patient and across patient paired Illumina and ONT datasets; our results show that Variabel can accurately identify low frequency variants below 0.5 allele frequency, outperforming existing state-of-the-art ONT variant callers for this task. Variabel is open-source and available for download at: www.gitlab.com/treangenlab/variabel.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Illustration of Variabel algorithm and workflow.
a) Sequencing reads from ONT are aligned to the reference genome of SARS-CoV-2 with Minimap2, then variants are called based on the alignments using Lofreq. b) Cross-sample AF variantion filter identifies variants that are shared between samples. Variant calls with maximum AF less 0.65 and maximum AF variation less than 0.05 are classified as false calls. In this example, variant A and B pass the filter while variant C fails. c) Low-entropy filter calculates the Shannon’s entropy H for subsequences (Seq A and B) of the reference genome around the position where the indel call occurs. The length of the subsequences is determined by the length of the indel. Product of two entropy of the subsequences is used to determine whether the indel is a false positive or not. d) Workflow of Variabel for detecting intra-host variants for ONT sequences.
Figure 2.
Figure 2.. Variants called by lofreq with Illumina and ONT sequences before and after applying Variabel.
In each of the subfigures, the left plot shows the variant calls before applying Variabel and the right plot shows the variant calls after applying Variabel. The x-axis shows the position of the variant on the reference genome. The y-axis shows the minimum allele frequency of the variant found in multiple samples. Variants found in the Illumina sequences only are marked in blue, and variants found in the ONT sequences only are marked in red. Variants that are shared between both Illumina and ONT data are shown in green. The size of the dot represents the number of samples supporting the variant. a) Variant calls of the time series dataset. b) Variant calls of the cross patient dataset.
Figure 3.
Figure 3.. Intra-host variant detection on time series dataset.
a) Venn diagram showing counts of variant calls shared between Lofreq on illumina sequencing runs and Variabel and Clair3 on nanopore sequencing runs on the same samples from time series dataset b) False positive rates at different variant allele frequencies and cumulative count of false positive variant calls of Variabel and Clair3 for time series dataset, c) Precision, recall, and f-score comparison of Lofreq default, Clair3, and Variabel on both time series dataset and cross patient dataset. Significance between Clair3 and Variabel were calculated using the two-sided paired t-test. Significance labeling: n.s.(P>0.05), *(P≤0.05), **(P≤0.01), ***(P≤0.001).
Figure 4.
Figure 4.. Intra-host variant detection on cross patient dataset.
a) Venn diagram showing counts of variant calls shared between Lofreq on illumina sequencing runs and Variabel and Clair3 on nanopore sequencing runs on the same samples from cross patient dataset, b) Simulation of fraction of shared variants recovered from different sizes of collections of SARS-CoV-2 samples.

References

    1. Bull R. A. et al.Analytical validity of nanopore sequencing for rapid SARS-CoV-2 genome analysis. Nat. Commun. 11, 6272 (2020). - PMC - PubMed
    1. Nicholls S. M. et al.CLIMB-COVID: continuous integration supporting decentralised sequencing for SARS-CoV-2 genomic surveillance. Genome Biol. 22, 196 (2021). - PMC - PubMed
    1. Sapoval N. et al.SARS-CoV-2 genomic diversity and the implications for qRT-PCR diagnostics and transmission. Genome Res. 31, 635–644 (2021). - PMC - PubMed
    1. Kemp S. A. et al.SARS-CoV-2 evolution during treatment of chronic infection. Nature 592, 277–282 (2021). - PMC - PubMed
    1. V’kovski P., Kratzel A., Steiner S., Stalder H. & Thiel V. Coronavirus biology and replication: implications for SARS-CoV-2. Nat. Rev. Microbiol. 19, 155–170 (2021). - PMC - PubMed

Publication types