Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct 10;47(18):e104.
doi: 10.1093/nar/gkz657.

Long-read amplicon denoising

Affiliations

Long-read amplicon denoising

Venkatesh Kumar et al. Nucleic Acids Res. .

Abstract

Long-read next-generation amplicon sequencing shows promise for studying complete genes or genomes from complex and diverse populations. Current long-read sequencing technologies have challenging error profiles, hindering data processing and incorporation into downstream analyses. Here we consider the problem of how to reconstruct, free of sequencing error, the true sequence variants and their associated frequencies from PacBio reads. Called 'amplicon denoising', this problem has been extensively studied for short-read sequencing technologies, but current solutions do not always successfully generalize to long reads with high indel error rates. We introduce two methods: one that runs nearly instantly and is very accurate for medium length reads and high template coverage, and another, slower method that is more robust when reads are very long or coverage is lower. On two Mock Virus Community datasets with ground truth, each sequenced on a different PacBio instrument, and on a number of simulated datasets, we compare our two approaches to each other and to existing algorithms. We outperform all tested methods in accuracy, with competitive run times even for our slower method, successfully discriminating templates that differ by a just single nucleotide. Julia implementations of Fast Amplicon Denoising (FAD) and Robust Amplicon Denoising (RAD), and a webserver interface, are freely available.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Under a simple error model, with constant per-base error probabilities (P), the probability that a sequence will have no errors decreases exponentially with the sequence length, with the slope of this decrease determined by P.
Figure 2.
Figure 2.
This distance approximates edit distance as mutations are introduced, starting from the 2599 bp NL4-3 HIV-1 env sequence. When only substitutions are introduced, edit distance is extremely well approximated. When indels are introduced, our kmer distance underestimates edit distance. This is desirable behavior when the sequencing error process is dominated by indels, because they will be downweighted in our distance function.
Figure 3.
Figure 3.
Sequence Mutation Distance (SMD) accurately approximates the average error rate, when computed between a set of templates, and a set of sequences that are derived from the templates by some noisy process.
Figure 4.
Figure 4.
(A) Error rates (measured by Sequence Mutation Distance) of reconstructions against ground truth for a number of datasets. 2.6 kb MVC is a real sequencing dataset, using primer barcodes to obtain the ground truth clustering. P018 (∼2.6 kb) comprises reads simulated in silico from a set of templates obtained from an HIV+ donor, from a low diversity, early time point, and a later, more diverse, time point. The 9 kb dataset comprises a set of closely related templates, with long reads simulated from these, using a higher error rate profile. The dashed horizontal gray line shows the threshold for an expected error rate (by SMD) of 1 bp per sequence. Also shown below are run times of the various methods. (B) From the 2.6 kb MVC full dataset, we show a phylogeny depicting the ground truth templates, as well as the inferred templates for FAD and RAD.
Figure 5.
Figure 5.
Visualizing the structure of the FAD-denoised variants from the single-chain Fragment variable phage library, after 4 rounds of selection. Variant frequency is depicted with bubble size, and variants with ≤2.36% corrected kmer distance (the minimum distance between any pre- and post-selection variant) are connected in the network. We also show the largest 3 connected components, coloring each variant depending whether the scFv linker was short (blue) or long (red).

Similar articles

Cited by

References

    1. Rogers M.B., Song T., Sebra R., Greenbaum B.D., Hamelin M.-E., Fitch A., Twaddle A., Cui L., Holmes E.C., Boivin G. et al. .. Intrahost dynamics of antiviral resistance in influenza A virus reflect complex patterns of segment linkage, reassortment, and natural selection. MBio. 2015; 6:e02464-14. - PMC - PubMed
    1. Poon L.L., Song T., Rosenfeld R., Lin X., Rogers M.B., Zhou B., Sebra R., Halpin R.A., Guan Y., Twaddle A. et al. .. Quantifying influenza virus diversity and transmission in humans. Nat. Genet. 2016; 48:195. - PMC - PubMed
    1. Laird Smith M., Murrell B., Eren K., Ignacio C., Landais E., Weaver S., Phung P., Ludka C., Hepler L., Caballero G. et al. .. Rapid sequencing of complete env genes from primary HIV-1 samples. Virus Evolution. 2016; 2:vew018. - PMC - PubMed
    1. Landais E., Murrell B., Briney B., Murrell S., Rantalainen K., Berndsen Z.T., Ramos A., Wickramasinghe L., Smith M.L., Eren K. et al. .. HIV envelope glycoform heterogeneity and localized diversity govern the initiation and maturation of a V2 apex broadly neutralizing antibody lineage. Immunity. 2017; 47:990–1003. - PMC - PubMed
    1. Caskey M., Schoofs T., Gruell H., Settler A., Karagounis T., Kreider E.F., Murrell B., Pfeifer N., Nogueira L., Oliveira T.Y. et al. .. Antibody 10-1074 suppresses viremia in HIV-1-infected individuals. Nat. Med. 2017; 23:185–191. - PMC - PubMed

Publication types

Substances