Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 8;20(1):8.
doi: 10.1186/s13059-018-1618-7.

An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

Affiliations

An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

Nathan D Grubaugh et al. Genome Biol. .

Abstract

How viruses evolve within hosts can dictate infection outcomes; however, reconstructing this process is challenging. We evaluate our multiplexed amplicon approach, PrimalSeq, to demonstrate how virus concentration, sequencing coverage, primer mismatches, and replicates influence the accuracy of measuring intrahost virus diversity. We develop an experimental protocol and computational tool, iVar, for using PrimalSeq to measure virus diversity using Illumina and compare the results to Oxford Nanopore sequencing. We demonstrate the utility of PrimalSeq by measuring Zika and West Nile virus diversity from varied sample types and show that the accumulation of genetic diversity is influenced by experimental and biological systems.

Keywords: Amplicon sequencing; Intrahost evolution; SNP calling; Viral sequencing; West Nile; Zika.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Research on human subjects was conducted in compliance with existing regulations relating to the protection of human subjects and was evaluated and approved (#IRB-15-6664) by the Institutional Review Board/Ethics Review Committee at The Scripps Research Institute. Clinical samples were obtained from the Florida Department of Health (DOH) and Antibody Systems Inc. Samples collected in Florida were collected under a waiver of consent granted by the Florida DOH Human Research Protection Program. The work received a non-human subjects research designation (category 4 exemption) by the Florida DOH since this research was performed with leftover clinical diagnostic samples involving no more than minimal risk. Hence, written informed consent was not obtained. All samples were de-identified prior to receipt by the study investigators. The experimental methods used comply with the Helsinki Declaration.

Research involving Indian origin rhesus macaques was conducted at the California National Primate Research Center, and experimental infections of mice upon which Ae. aegypti mosquitoes fed were performed at the University of California, Davis, School of Veterinary Medicine. Both institutes are fully accredited by the Association for the Assessment and Accreditation of Laboratory Animal Care International. Animals were cared for in accordance with the National Research Council Guide for the Care and Use of Laboratory Animals and the Animal Welfare Act. Animal experiments were approved by the Institutional Animal Care and Use Committee of UC Davis (protocols #19211 and #19695 for rhesus macaques, protocol #19404 for mice). All macaques samples used in this study were from approved studies [70]; and none were generated specifically for this work.

Consent for publication

Not applicable.

Competing interests

NJL has received travel and accommodation expenses from Oxford Nanopore Technologies to attend meetings, and an honorarium to speak at an internal company meeting. NJL has previously received free-of-charge reagents and consumables in support of research projects from Oxford Nanopore Technologies. The other authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Measurement intrahost variant frequencies are more accurate at high frequencies and are susceptible to input concentrations and coverage depths. a We created genetically diverse virus populations by mixing two Zika virus isolates with 159 consensus nucleotide differences to test the effects of PCR amplification prior to sequencing to measure intrahost single-nucleotide variant (iSNV) frequencies. For these initial experiments, we amplified three ~ 400 bp regions of the Zika virus genome using primers without any mismatches to either of the mixed virus (shown as amplicons 5, 24, and 33). "Amplicon 5" contains 5 iSNV sites, "amplicon 24" contains 8 iSNV sites, and "amplicon 33" contains 5 iSNV sites. b We created virus populations containing 50%, 25%, 14%, 7%, 3%, 1.5%, and 0.8% virus #2 to test the impact of PCR amplification prior to sequencing on measuring ranges of iSNV frequencies. The data points represent individual iSNVs amplified and sequenced in triplicate from each population (colored by amplicon 5, 24, or 33 as shown in a. c We 10-fold serially diluted a mixed population containing 14% of virus #2 (expected, dotted line) from 100,000 to 10 copies to test the effects of input concentrations on accurate iSNV measurements. d We randomly downsampled the datasets generated from 1000 input virus RNA copies containing 3% virus #2 to set coverage depths (sequenced nucleotides [nt] per genome position) to determine the minimum coverage needed to yield accurate iSNV measurements. For c and d, the Levene’s test was used to assess equality among variances of iSNV measurements from each coverage depth (ns, not significant; *, p < 0.05). Data shown as means with standard deviations
Fig. 2
Fig. 2
Measures of intrahost variant frequencies are sensitive to primer mismatches. a To assess the impacts of primer mismatches on accurately measuring intrahost single-nucleotide variants (iSNVs), we sequenced a mixed Zika virus population using 35 overlapping PCR amplicons (see “Amplicon scheme” above panel). The virus population contained 10% virus #2 (Expected) and 1000 virus RNA copies were amplified and sequenced in triplicate. The amplicons and iSNVs are colored according to the number of mismatches in the primer sequences used to generate that amplicon. Data shown as means and ranges. b To account for unequal iSNV sites within each amplicon, the iSNV frequencies on each amplicon were averaged to produce a haplotype frequency for virus #2 mixed at 10% (Expected). Data shown as means and ranges. c We calculated the deviations between the measured and expected virus #2 haplotype frequencies (absolute value of the log2 fold change) to assess the bias introduced during PCR of amplicons containing primer mismatches to virus #2 (*, Welch’s t test, p < 0.05). Data shown as means and standard deviations. d We plotted the deviations from expected haplotype frequencies by the distance of mismatches from the 3′ end of the primer to investigate the impact of mismatch location. If more than one mismatch was present on a primer pair (orange), the data is shown using the closest mismatch to the 3′ end. Mismatches closer to the 3′ end of the primer are more likely decrease the accuracy of iSNV or haplotype measurements from that amplicon (correlation by Pearson r, p < 0.05). Data shown as the mean from all three replicates
Fig. 3
Fig. 3
False positive intrahost variants caused by sequencing errors can be removed by technical replicates and frequency cutoffs. We sequenced mixed Zika virus population containing 10% of virus #2 in triplicate, limited our analysis to the regions only covered by perfect PCR primer matches, and removed sites with intrahost single-nucleotide variant (iSNV) detected at > 1% frequency in either of the Zika virus isolates. This left us with 61 true positive (10% frequency) iSNV sites and 3940 sites not expected to be variable to investigate false positives (> 0.1% frequency). a The locations of false positives on the sequencing read position were mapped and shown as the distribution within 25 nt bins by percent of sites with false positive iSNV calls. Each color represents data from an independent replicate. Inset: read positions > 150 nt had a significantly higher false positive rate than positions < 150 nt (*, Wilcoxon test, p < 0.05). b The iSNV frequencies from each false positive were also plotted by position on the sequencing read. Each color represents data from an independent replicate. Inset: False positive iSNV frequencies were significantly higher at read positions > 150 nt than < 150 nt (*, Mann-Whitney test, p < 0.05). c True and false iSNVs were plotted by frequency for each individual replicate (A, B, and C) and combined as technical duplicates and triplicates showing the mean frequencies of iSNVs only found in all replicates. Data shown as means and standard deviations. The line indicates the proposed cutoff at 3% based on removing false positives from the replicate data while still in the range of high accuracy (Fig. 1b)
Fig. 4
Fig. 4
PCR amplification prior to sequencing leads to similar overall measurements of genetic diversity. a We compared our PrimalSeq that enriches for specific virus sequences to the current ‘gold standard’ for measuring intrahost genetic diversity, metagenomics; and we compared sequencing the amplicons using the Illumina and Oxford Nanopore platforms. The schematic outlines the general workflow for all approaches. b We sequenced our mixed Zika virus population (1000 virus RNA copies) containing 10% virus # 2 (Expected) in triplicate using both approaches and platforms to compare the accuracy of measuring known intrahost single-nucleotide variants (iSNVs). We only analyzed regions of the Zika virus #1 and #2 genomes (Fig. 1a) that were perfect matches to the PCR primer sequences, leaving 61 iSNV sites. Data shown as mean and range of triplicate tests. c We combined the frequency measurements for each iSNV site and replicate (n = 183) to compare the accuracy between the two approaches and platforms. Dashed line shows the expected true iSNV frequencies at 10%. Data shown as means and standard deviations. The mean frequencies were not significantly different (ns, Welch’s t test, p > 0.05), but the variances were not equal (*, Levene’s test, p < 0.05). d We analyzed the frequency of false positive iSNVs > 3% (cutoff determined in Fig. 1c) from each sequencing method and technical replicate (“A, B, C”) from 4173 sites that are expected to be true negatives. From our metagenomics and PCR-Illumina sequencing data, the same false positive iSNVs > 3% frequency are not found in multiple technical replicates, however, many are found in the PCR-Nanopore replicates (see Fig. 5). Dashed line shows the iSNV cutoff at 3%
Fig. 5
Fig. 5
High false discovery rates of intrahost variants using Nanopore sequencing. a iSNV false discovery rates (FDR) from Oxford Nanopore sequencing data. We analyzed 54 true positive and 4173 true negative sites, and determined the proportion of true and false positive iSNV calls from datasets containing 1, 2, or 3 technical replicates using either a 3% frequency cutoff or a logistic regression of iSNV frequency and strand bias. b A receiver operator characteristic (ROC) curve showing a logistic regression model that incorporates allele frequency and strand bias as features and the presence or absence of a iSNV as the response variable. The model was trained and tested using a 10-fold cross validation scheme. The model was performed using a frequency and strand bias threshold alone, and combining the two features. Post-filter iSNV frequencies of true and false positive calls using c a 3% cutoff or d a logistic regression of iSNV frequency and strand bias. Data shown as the means and 95% confidence intervals
Fig. 6
Fig. 6
Experimental and computational workflow for measuring intrahost virus diversity using PrimalSeq
Fig. 7
Fig. 7
PrimalSeq can be used to measure intrahost variants from a variety of sample types. a We sequenced technical duplicates of Zika virus populations (1000 virus RNA copies each) to identify intrahost single-nucleotide variants (iSNVs) > 3% within each sample. In vitro and in vivo samples were generated using Zika virus strain PRVABC59 (isolated from Puerto Rico, 2015) during infection of Ae. aegypti Aag2 cells (derived from embryos), human HeLa cells (derived from cervical epithelial cells), Ae. aegypti mosquitoes (orally infected), and Indian origin rhesus macaques (subcutaneously infected). For the in vitro and in vivo samples, where the reference population sequence is known, the iSNV frequencies were calculated by change in frequency from pre- to post-infection. Field Zika virus samples from pooled Ae. aegypti and human clinical samples were collected from Florida during the 2016 Zika virus outbreak. b Culex mosquitoes and dead American crows were collected from San Diego County, CA, during 2015 to sequence West Nile virus from field samples (10,000 virus RNA copies each). The iSNV frequencies from the field samples are the minor allele frequencies (maximum frequency = 0.5) because the reference virus sequence was not known. For both (a and b), analysis was limited to regions of the genome with > 400× coverage depth in the protein coding sequence and we masked amplicons with primer mismatches from our analysis (gray regions) for direct comparisons of intrahost genetic diversity
Fig. 8
Fig. 8
Intrahost virus genetic diversity is dependent on the experimental and biological system. Variants called from Zika and West Nile virus populations derived from in vitro, in vivo, and field studies (Fig. 6) were used to compare intrahost virus diversity from mosquito vectors (Ae. aegypti and Culex species) and vertebrate hosts (primates or birds). We compared a richness (the number of intrahost single-nucleotide variant [iSNV] sites; Fig. 7a), b complexity (uncertainty associated with randomly sampling an allele, measured by Shannon entropy [Sn]), and c distance (the sum of all iSNV frequencies). The mosquito and vertebrate-derived populations were compared using unpaired Mann-Whitney rank tests (ns, not significant; *, p < 0.05). Data shown as mean and standard deviation. d The proportion of Zika virus iSNVs detected in the Ae. aegypti and rhesus macaque in vivo samples were distributed by frequency. Bin width is 0.05. e Our combined data suggests that intrahost virus diversity is dependent upon the experimental system (i.e., in vitro, in vivo, or field samples)
Fig. 9
Fig. 9
Overview of iVar pipeline. iVar was used to construct two pipelines for calling intrahost single-nucleotide variants (iSNVs) from samples with and without a known reference sequence. The nodes in the chart are colored based on the usage of iVar, bwa, and SAMtools at each step. For samples with a known reference sequence, the primer sequences are trimmed from the sequenced reads, followed by quality trimming. A consensus sequence for the sample is called by merging the aligned BAM files from each replicate. The primer sequences are then aligned to this consensus sequence and mismatches are identified by iVar after performing variant calling on the aligned primers. The reads corresponding to the mismatched primers are removed from the aligned BAM file of each replicate to ensure that any bias introduced in the iSNV frequencies is removed. The iSNVs are then called for each replicate, individually, with a minimum frequency threshold of 3% and an intersection of the iSNVs across all the replicates are considered to be the “true” iSNVs. For samples with an unknown reference sequence, the iSNVs cannot be called directly using a reference sequence. In this case, after generating the consensus sequence, reads from each replicate are aligned back to this consensus sequence and these realigned BAM files are used for the same subsequent steps as in the case of samples with a known reference sequence

References

    1. Holland J, Spindler K, Horodyski F, Grabau E, Nichol S, VandePol S. Rapid evolution of RNA genomes. Science. 1982;215:1577–1585. doi: 10.1126/science.7041255. - DOI - PubMed
    1. Hensley SE, Das SR, Bailey AL, Schmidt LM, Hickman HD, Jayaraman A, et al. Hemagglutinin receptor binding avidity drives influenza A virus antigenic drift. Science. 2009;326:734–736. doi: 10.1126/science.1178258. - DOI - PMC - PubMed
    1. Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, Macalalad AR, et al. Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathog. 2012;8:e1002529. doi: 10.1371/journal.ppat.1002529. - DOI - PMC - PubMed
    1. Parameswaran P, Wang C, Trivedi SB, Eswarappa M, Montoya M, Balmaseda A, et al. Intrahost selection pressures drive rapid dengue virus microevolution in acute human infections. Cell Host Microbe. 2017;22:400–10.e5. doi: 10.1016/j.chom.2017.08.003. - DOI - PMC - PubMed
    1. Vignuzzi M, Stone JK, Arnold JJ, Cameron CE, Andino R. Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population. Nature. 2006;439:344–348. doi: 10.1038/nature04388. - DOI - PMC - PubMed

Publication types