Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 13;10(8):1026.
doi: 10.3390/pathogens10081026.

HPV DeepSeq: An Ultra-Fast Method of NGS Data Analysis and Visualization Using Automated Workflows and a Customized Papillomavirus Database in CLC Genomics Workbench

Affiliations

HPV DeepSeq: An Ultra-Fast Method of NGS Data Analysis and Visualization Using Automated Workflows and a Customized Papillomavirus Database in CLC Genomics Workbench

Jane Shen-Gunther et al. Pathogens. .

Abstract

Next-generation sequencing (NGS) has actualized the human papillomavirus (HPV) virome profiling for in-depth investigation of viral evolution and pathogenesis. However, viral computational analysis remains a bottleneck due to semantic discrepancies between computational tools and curated reference genomes. To address this, we developed and tested automated workflows for HPV taxonomic profiling and visualization using a customized papillomavirus database in the CLC Microbial Genomics Module. HPV genomes from Papilloma Virus Episteme were customized and incorporated into CLC "ready-to-use" workflows for stepwise data processing to include: (1) Taxonomic Analysis, (2) Estimate Alpha/Beta Diversities, and (3) Map Reads to Reference. Low-grade (n = 95) and high-grade (n = 60) Pap smears were tested with ensuing collective runtimes: Taxonomic Analysis (36 min); Alpha/Beta Diversities (5 s); Map Reads (45 min). Tabular output conversion to visualizations entailed 1-2 keystrokes. Biodiversity analysis between low- (LSIL) and high-grade squamous intraepithelial lesions (HSIL) revealed loss of species richness and gain of dominance by HPV-16 in HSIL. Integrating clinically relevant, taxonomized HPV reference genomes within automated workflows proved to be an ultra-fast method of virome profiling. The entire process named "HPV DeepSeq" provides a simple, accurate and practical means of NGS data analysis for a broad range of applications in viral research.

Keywords: HPV genotyping; bioinformatics; cervical cancer; deep sequencing; human papillomavirus; metagenome; next generation sequencing; taxonomic classification; virome.

PubMed Disclaimer

Conflict of interest statement

The Defense Health Agency (DHA) of the U.S. Department of Defense has licensed the customized HPV database described herein to QIAGEN Digital Insights. The inventor of the customized taxonomy is J.S.-G. No potential conflicts of interest were disclosed by the other authors. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Figures

Figure 1
Figure 1
Taxonomic profiling results based on the HPV Reference Index. (A) Abundance of HPV genotypes found in individual LSIL and HSIL samples are shown as stacked bars. HPV type-specific carcinogenicity (carcinogenic, possibly carcinogenic, and not carcinogenic) are colorized in shades of red, blue, and green, respectively (legend). Deep sequencing of HPV E6/E7 amplicons derived from each LSIL (n = 95) or HSIL (n = 60) sample identified 32 unique HPV genotypes with the top 20 shown (legend) and quantitated their composition (%) based on abundance (n) of mapped reads to total mapped reads. (B) HPV genotype composition of samples grouped by cytological grade i.e., LSIL and HSIL. Visualization and comparison of grouped samples (stacked bars) revealed the dominant genotypes in LSIL and HSIL as HPV-39 (46%) and HPV-16 (69%), respectively with significant changes in proportional composition (Baggerley’s test, * p-value (Bonferroni) < 0.001). (C) Sunburst plots visualize hierarchical data outwardly from parent to child nodes. Here, sunburst plots reveal distinct differences in the HPV communities according to seven taxonomic ranks, specifically, the last two ranks (genus/species and genotypes) between LSIL and HSIL.
Figure 2
Figure 2
Diversity analysis of HPV genotypes in LSIL and HSIL samples. (A) Alpha diversity rarefaction curves estimate the indices for HPV genotypes in LSIL (n = 95) and HSIL (n = 60) samples. (B) Summary statistics are shown as boxplots after categorical grouping. A total of 29 unique genotypes were found in LSIL versus 22 genotypes for HSIL. Species richness measured by Simpson’s index showed a reduction from LSIL (median = 0.146) to HSIL (median = 0.062) samples (Mann-Whitney test, * p < 0.001). The respective Shannon indices for LSIL (median = 0.454) and HSIL (median = 0.215) were indicative of reduced diversity with disease progression (Mann-Whitney test, * p < 0.001).
Figure 3
Figure 3
HPV community structures between LSIL and HSIL. 3D-Principal Coordinate Analysis (PCoA) plots of samples (n = 155) in distinct colors before (A) and after (B) grouping by cytological category. After grouping by LSIL and HSIL, HPV-16, -35, -39 were identified as the three most influential genotypes (vectors) in both HPV communities. Dissimilarity between the two HPV communities was HPV-16 (PCoA 1) being the most influential genotype in HSIL versus HPV-39 for LSIL (PCoA 3). β-diversity was measured by Bray-Curtis index (PERMANOVA, p < 0.05).
Figure 4
Figure 4
Clustered heat map of HPV abundance among LSIL/HSIL samples. (A) The heat map represents two-way hierarchal clustering of HPV over LSIL/HSIL samples (n = 155) and clustering of LSIL/HSIL samples over HPV. The dissimilarity measure i.e., Pearson’s distance (1- | Pearson correlation|) quantifies the dissimilarity in the variables of interest i.e., HPV type-specific abundance between individual samples. The agglomerative clustering method identifies the originating, most similar pair of clusters (gray shade). In this dataset, single (pure) infections of HPV-16 in HSIL and HPV-39 in LSIL of high abundance (rectangle) are the closest clusters. From this point, cluster divergence toward the left part of the heat map reveals increasingly, heterogeneous HPV infections predominantly in LSIL samples. The color scale shows Pearson’s distance between 0 (black) and 2 (green) indicating similar and dissimilar correlation coefficients, respectively. (B) Aggregated heat map of LSIL (n = 95) and HSIL (n = 60) samples reveal dissimilar, groupwise HPV profiles useful for data simplification and taxonomy development. The prevailing (abundant) genotypes for LSIL or HSIL are shown in red.
Figure 5
Figure 5
Read Mapping. (A) The HPV E6/E7 (660 bp) gene segment highlighted in blue on the circular prototypical HPV-16 genome (GenBank ID: K02718) is the target used for amplicon sequencing and genotyping. The Map Reads to Reference workflow output displays the reads mapped on to the linearized HPV reference genome. Zooming in from the whole genome window (top) allows viewing of the sequences down to the nucleotide level (bottom). The color-coding legend defines the corresponding read types and nucleotide mismatches. (B) NGS paired-sequence file size for each of the 155 study samples. The bar chart reveals the extent of file size variation between samples. The median () was 58.5 MB (range, 8.9–208.5). (C) Scatterplot between NGS paired-sequence file size (MB) and merged sequences (n) for the 155 study samples showed near-perfect linear correlation (R2 = 0.9945). The regression line (merged sequences = 3415 + 3868 × file size) is shown as (). (D) Merged sequences (n) and reads mapping time (s) for the study cohort (n = 155) were highly correlated (R2 = 0.7233) as shown by the scatterplot and regression line (mapping time = 5.7 + 4.44 × 10−5 × merged sequences) (). The equations above may be used jointly or independently to estimate total mapping time based on file size or number of sequences.
Figure 6
Figure 6
CLC Workflows, Tools, and Databases. (A) Microbial Genomics Module containing ready-to-use workflows and tools for abundance analysis (left). Customized HPV databases adapted from the PapillomaVirus Espiteme (PaVE) reference genomes (n = 219) for use in CLC (right). (B) Data QC and Taxonomic Profiling workflow incorporating the HPV Reference Index (● HPV) produces QC reports and HPV abundance tables from NGS reads of clinical samples. The Abundance Table output file (formula image) is utilized as the input file for downstream diversity analysis shown in (C). (C) Merge and Estimate Alpha and Beta Diversities workflow generates diversity plots and statistical results (left). Map Reads to Reference workflow incorporating the HPV Sequence List (● HPV) generates an alignment map of the reads on the reference HPV genomes (right).

Similar articles

Cited by

References

    1. Mammas I.N., Spandidos D.A. Four historic legends in human papillomaviruses research. J. BUON. 2015;20:658–661. - PubMed
    1. Durst M., Gissmann L., Ikenberg H., Hausen H.Z. A papillomavirus DNA from a cervical carcinoma and its prevalence in cancer biopsy samples from different geographic regions. Proc. Natl. Acad. Sci. USA. 1983;80:3812–3815. doi: 10.1073/pnas.80.12.3812. - DOI - PMC - PubMed
    1. Hausen H.Z. Cancers in humans: A lifelong search for contributions of infectious agents, autobiographic notes. Annu. Rev. Virol. 2019;6:1–28. doi: 10.1146/annurev-virology-092818-015907. - DOI - PubMed
    1. Javier R.T., Butel J.S. The history of tumor virology. Cancer Res. 2008;68:7693–7706. doi: 10.1158/0008-5472.CAN-08-3301. - DOI - PMC - PubMed
    1. International Agency for Research on Cancer . Monographs on the Evaluation of Carcinogenic Risks to Humans-Human Papillomaviruses. World Health Organization; Geneva, Switzerland: 2012. pp. 255–313.