Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Dec 13;14(12):e1006498.
doi: 10.1371/journal.pcbi.1006498. eCollection 2018 Dec.

Full-Length Envelope Analyzer (FLEA): A tool for longitudinal analysis of viral amplicons

Affiliations

Full-Length Envelope Analyzer (FLEA): A tool for longitudinal analysis of viral amplicons

Kemal Eren et al. PLoS Comput Biol. .

Abstract

Next generation sequencing of viral populations has advanced our understanding of viral population dynamics, the development of drug resistance, and escape from host immune responses. Many applications require complete gene sequences, which can be impossible to reconstruct from short reads. HIV env, the protein of interest for HIV vaccine studies, is exceptionally challenging for long-read sequencing and analysis due to its length, high substitution rate, and extensive indel variation. While long-read sequencing is attractive in this setting, the analysis of such data is not well handled by existing methods. To address this, we introduce FLEA (Full-Length Envelope Analyzer), which performs end-to-end analysis and visualization of long-read sequencing data. FLEA consists of both a pipeline (optionally run on a high-performance cluster), and a client-side web application that provides interactive results. The pipeline transforms FASTQ reads into high-quality consensus sequences (HQCSs) and uses them to build a codon-aware multiple sequence alignment. The resulting alignment is then used to infer phylogenies, selection pressure, and evolutionary dynamics. The web application provides publication-quality plots and interactive visualizations, including an annotated viral alignment browser, time series plots of evolutionary dynamics, visualizations of gene-wide selective pressures (such as dN/dS) across time and across protein structure, and a phylogenetic tree browser. We demonstrate how FLEA may be used to process Pacific Biosciences HIV env data and describe recent examples of its use. Simulations show how FLEA dramatically reduces the error rate of this sequencing platform, providing an accurate portrait of complex and variable HIV env populations. A public instance of FLEA is hosted at http://flea.datamonkey.org. The Python source code for the FLEA pipeline can be found at https://github.com/veg/flea-pipeline. The client-side application is available at https://github.com/veg/flea-web-app. A live demo of the P018 results can be found at http://flea.murrell.group/view/P018.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overview of the FLEA pipeline, broken into conceptual sub-pipelines.
The Quality and Consensus sub-pipelines process each time point separately. Duplicate steps in other time points are grayed out. CCS stands for “circular consensus sequences”; QCS for “quality-controlled sequences”, and HQCS for “high-quality consensus sequences”.
Fig 2
Fig 2. Quality and consensus sub-pipelines.
These steps are repeated independently on each time point. Numbers are reported from the analysis of sequences from the first time point (V03) of donor P018, which is three months post infection. Percentages give the fraction of sequences retained after filtering. Tasks indicate whether they use third-party tools USEARCH or MAFFT.
Fig 3
Fig 3. Hidden Markov model used for trimming poly-A and poly-T heads and tails.
A head and tail states have a small (p = 0.01) probability to emit non-A bases, and similarly for T. The body state emits all four bases with equal probability. The start, and stop states emit nothing.
Fig 4
Fig 4. Screenshot of the multidimensional scaling plot.
The embedding in two dimensions preserves pairwise evolutionary distances between HQCSs. Node area is proportional to copy number, and color corresponds to time point. The increasing genetic diversity of the population is visible as time goes on.
Fig 5
Fig 5. Screenshot of the evolutionary trajectory report.
Four evolutionary metrics (dS divergence, dN divergence, total divergence, and total diversity) and two phenotype metrics (length and possible N-linked glycosylation sites) are shown for gp160.
Fig 6
Fig 6. Screenshot of amino acid sequences viewer.
Sequences are grouped by identity, with aggregate copy number and population percentage shown to the right. An overview of the amplicon, optionally annotated with region names, provides fast access to different locations of the alignment. Selecting columns of the alignment interactively updates the amino acid dynamics plot, showing the dynamics of the selected motif over time. In this case, the trajectory shows changes in the N332 glycan supersite. Sites inferred by FUBAR to be undergoing positive selection are selectable.
Fig 7
Fig 7. Screenshots of the interactive three-dimensional Env structure, colored according to JS divergence (left) and dN/dS values (right).
Positions imputed to be undergoing more positive selection (dN/dS > 1) are darker red, and positions undergoing more purifying selection (dN/dS < 1) are darker blue. The right structure also shows motif positions highlighted in the sequence viewer.
Fig 8
Fig 8. Screenshot of dN/dS values mapped to protein positions and separated by time point.
Fig 9
Fig 9. Screenshot of the phylogenetic tree viewer.
Leaf node size corresponds to sequence copy number. Node color corresponds to time point. Since ancestral sequences have been inferred, ancestral nodes are colored according to the selected motif, which in this case is the N332 glycan supersite.
Fig 10
Fig 10. Comparison of true sequence abundances versus copy numbers inferred by FLEA for each time point of the simulated P018 data.
Each node represents one sequence, with the area denoting its relative abundance in the population. The true population (top) is colored green. For each true sequence, the matching HQCS sequences appears below it in blue. Red nodes denote false negatives and positives. The most common false negative for each time point is annotated with its abundance.

Similar articles

Cited by

References

    1. DeLeon O, Hodis H, O’Malley Y, Johnson J, Salimi H, Zhai Y, et al. Accurate predictions of population-level changes in sequence and structural properties of HIV-1 Env using a volatility-controlled diffusion model. PLOS Biology. 2017. 04;15(4):1–38. 10.1371/journal.pbio.2001549 - DOI - PMC - PubMed
    1. Fischer W, Ganusov VV, Giorgi EE, Hraber PT, Keele BF, Leitner T, et al. Transmission of single HIV-1 genomes and dynamics of early immune escape revealed by ultra-deep sequencing. PLOS ONE. 2010. 08;5(8):1–15. 10.1371/journal.pone.0012303 - DOI - PMC - PubMed
    1. Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, Macalalad AR, et al. Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLOS Pathogens. 2012. 03;8(3):1–14. 10.1371/journal.ppat.1002529 - DOI - PMC - PubMed
    1. Leung P, Bull R, Lloyd A, Luciani F. A bioinformatics pipeline for the analyses of viral escape dynamics and host immune responses during an infection. BioMed Research International. 2014;2014 10.1155/2014/680249 - DOI - PMC - PubMed
    1. McCloskey RM, Liang RH, Harrigan PR, Brumme ZL, Poon AFY. An evaluation of phylogenetic methods for reconstructing transmitted HIV variants using longitudinal clonal HIV sequence data. Journal of Virology. 2014. June;88(11):6181–6194. 10.1128/JVI.00483-14 - DOI - PMC - PubMed

Publication types