Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jun 22;10(6):e0129059.
doi: 10.1371/journal.pone.0129059. eCollection 2015.

Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data

Affiliations

Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data

Gordon M Daly et al. PLoS One. .

Abstract

The use of next generation sequencing (NGS) to identify novel viral sequences from eukaryotic tissue samples is challenging. Issues can include the low proportion and copy number of viral reads and the high number of contigs (post-assembly), making subsequent viral analysis difficult. Comparison of assembly algorithms with pre-assembly host-mapping subtraction using a short-read mapping tool, a k-mer frequency based filter and a low complexity filter, has been validated for viral discovery with Illumina data derived from naturally infected liver tissue and simulated data. Assembled contig numbers were significantly reduced (up to 99.97%) by the application of these pre-assembly filtering methods. This approach provides a validated method for maximizing viral contig size as well as reducing the total number of assembled contigs that require down-stream analysis as putative viral nucleic acids.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Reports of novel animal virus species in PubMed over the last two decades.
Fig 2
Fig 2. Illumina host / virus read subtraction by short read mapping algorithms (n/n = % of read / % identity).
Fig 3
Fig 3. Host mapping subtraction by reference set.
Fig 4
Fig 4. K-mer filtering of metagenomics Illumina read dataset (in brackets: % host sequence read subtraction and viral read subtraction respectively).
Fig 5
Fig 5. Optimal word size for viral assembly with multiple assemblers.
Fig 6
Fig 6. Effect of viral reference coverage of Illumina reads (red text), host mapping subtraction (Map) and k-mer filtering (K-mer) on viral contig size and reference coverage (post-assembly) using different assembly algorithms (Meta = metacortex).
Each assembly algorithm used at optimal k-mer size.
Fig 7
Fig 7. Effect of k-mer filtering (K-mer)/ mapper subtraction (Map) on post-assembly contig number using multiple optimized assemblers with the HCV 9x mean coverage Illumina read dataset.
Fig 8
Fig 8. Effect of viral reference coverage of Illumina reads (red text), host mapping subtraction (Map), k-mer filtering (k-mer) and low-complexity filtering (LC) on viral contig size and reference coverage (post-assembly) using CLC assembler (v.6) at optimal word size.
Fig 9
Fig 9. Effect of k-mer filtering (k-mer) / mapper subtraction (Map) and lowcomplexity filtering (LC) on post-assembly contig numbers.
Fig 10
Fig 10. a) Simulated dataset: effect of k-mer filtering (K-mer) & host mapping subtraction (Map) on viral contig size and reference coverage.
b) Simulated dataset: Effect of pre-assembly read filters on post-assembly N25-N90 (methods).
Fig 11
Fig 11. Human viral simulated dataset: effect of k-mer filtering (K-mer) & host mapping subtraction (Map) on post-assembly contig number.
Fig 12
Fig 12. Idiopathic hepatitis liver datasets: a) Read reduction following mapping subtraction and k-mer similarity filtering.
b) Effect of k-mer filtering (K-mer) & host mapping subtraction (Map) on post-assembly contig number. c) Effect of k-mer filtering (K-mer) & host mapping subtraction (Map) on viral contig size and reference coverage.
Fig 13
Fig 13. SURPI assembled contigs comparison: a) contig coverage of viral references (artificial metagenomics viral dataset) range and mean.
SURPI SD = 28, Mapper SD = 15.5, MAP+k-mer SD = 25.9. b) HCV viral infected liver tissue NGS datasets at 9x and 0.7x coverage with Largest viral assembled contig (blue) and total viral reference coverage of all contigs (red).

References

    1. Lipkin I. (2013) The changing face of pathogen discovery and surveillance. Nature Reviews Microbiology 11, 133–141. 10.1038/nrmicro2949 - DOI - PMC - PubMed
    1. Drosten C, Günther S, Preiser W, Van der Werf S, Brodt HR, Becker S et al. (2003) Identification of a novel Coronavirus in patients with severe acute respiratory syndrome N Engl J Med. 348, 1967–1976. - PubMed
    1. Palacios G, Druce J, Du L, Tran T, Birch C, Briese T et al. (2008) A new Arenavirus in a cluster of fatal transplant associated diseases. N Engl J Med. 358, 991–998. 10.1056/NEJMoa073785 - DOI - PubMed
    1. Feng H, Shuda M, Chang Y, Moore PS. (2008) Clonal integration of a polyomavirus in Human Merkel Cell Carcinoma. Science. 319, 1096–1100. 10.1126/science.1152586 - DOI - PMC - PubMed
    1. Hoffmann B, Scheuch M, Höper D, Jungblut R, Holsteg M, Schirrmeier H et al. (2012) Novel Orthobunyavirus in cattle, Europe, 2011. Emerg Infect Dis. 18 469–472. 10.3201/eid1803.111905 - DOI - PMC - PubMed

Publication types