Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May;29(5):819-830.
doi: 10.1101/gr.242529.118. Epub 2019 Mar 14.

A virome-wide clonal integration analysis platform for discovering cancer viral etiology

Affiliations

A virome-wide clonal integration analysis platform for discovering cancer viral etiology

Xun Chen et al. Genome Res. 2019 May.

Abstract

Oncoviral infection is responsible for 12%-15% of cancer in humans. Convergent evidence from epidemiology, pathology, and oncology suggests that new viral etiologies for cancers remain to be discovered. Oncoviral profiles can be obtained from cancer genome sequencing data; however, widespread viral sequence contamination and noncausal viruses complicate the process of identifying genuine oncoviruses. Here, we propose a novel strategy to address these challenges by performing virome-wide screening of early-stage clonal viral integrations. To implement this strategy, we developed VIcaller, a novel platform for identifying viral integrations that are derived from any characterized viruses and shared by a large proportion of tumor cells using whole-genome sequencing (WGS) data. The sensitivity and precision were confirmed with simulated and benchmark cancer data sets. By applying this platform to cancer WGS data sets with proven or speculated viral etiology, we newly identified or confirmed clonal integrations of hepatitis B virus (HBV), human papillomavirus (HPV), Epstein-Barr virus (EBV), and BK Virus (BKV), suggesting the involvement of these viruses in early stages of tumorigenesis in affected tumors, such as HBV in TERT and KMT2B (also known as MLL4) gene loci in liver cancer, HPV and BKV in bladder cancer, and EBV in non-Hodgkin's lymphoma. We also showed the capacity of VIcaller to identify integrations from some uncharacterized viruses. This is the first study to systematically investigate the strategy and method of virome-wide screening of clonal integrations to identify oncoviruses. Searching clonal viral integrations with our platform has the capacity to identify virus-caused cancers and discover cancer viral etiologies.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Discovering oncovirus candidates through identification of clonal viral integrations. (A) Identification of clonal viral integrations to eliminate viral sequence contaminations and to prove the involvement of the identified virus in the early stages of tumorigenesis. (B) Composition of VIcaller virome-wide genome reference library. (C) The simplified analytic workflow of VIcaller.
Figure 2.
Figure 2.
Applying VIcaller to simulation data sets. (A) Detection power and precision were measured by simulated (germline) viral integrations with depths from 1× to 150×. (B) Detection power for integrations with 5%, 25%, and 50% integration allele fractions. On average, 86 viral integrations were used for the calculation. (C) Accuracy of calculated integration allele fractions. (D) Relationship between detection power and insert sizes for paired-end sequence reads at different sequencing depths. (E) Relationship between detection power and lengths of integrated viral sequences. The viral integrations detected under different sequencing depths were combined for the calculation. Comparison of the detection power of VIcaller with existing tools for detecting 10 simulated HPV integrations (F) and 90 simulated virome-wide integrations (G). VirusSeq was only capable of detecting less than 20 human viruses; thus, the detection power was extremely low. It also ran out of server wall time at 60× sequencing depth. VirusFinder and Virus-Clip were not applicable for analyzing data containing the virome-wide integrations.
Figure 3.
Figure 3.
Virome-wide integrations detected in liver and cervical cancer genome data sets. (A) Comparison of the number of integration events identified in a metastatic cervical carcinoma sample by our VIcaller approach (light blue) and the HPV-specific approach of the original study (Liang et al. 2014) (light gray). (B) Sanger sequencing result for one example of the three HPV-18 integrations, newly detected by VIcaller, that existed in the tumor but not in the paired normal tissue. A 16-bp deletion on the human genome was found at the integration breakpoint. (C) Comparison of the number of HBV-human fusion transcripts identified in three HCC cell lines by the VIcaller virome-wide approach (light blue) and the HBV-specific approach described in the original study (light gray) (Lau et al. 2014). (D) Gel images from RT-PCR validation of the six fusion transcripts newly detected by VIcaller. (E) Sanger sequencing result of an example breakpoint of the six newly identified fusion transcripts. (F) Comparison of the number of HBV integration breakpoints identified in 88 HCC samples by our VIcaller virome-wide approach (light blue) and the HBV-specific approach described in the original study (light gray) (Sung et al. 2012). (G) Sequence read alignment of an HBV integration in the HCG2032978 gene, newly identified by VIcaller, which existed in the tumor but not in the paired normal tissue. Seven chimeric and seven split reads at the upstream breakpoint and four chimeric reads at the downstream breakpoint were found for this integration event. The integrated HBV sequence is ∼808 bp in length, starting from 3170 bp to 3182 bp, and then from 1 bp to ∼796 bp on the circular HBV genome. Black and red represent reads mapped to the human (hg19) and HBV (NC_003977.2) reference genomes, respectively. (H) Sequence read alignment of an adeno-associated virus 6 (AAV-6; AF028704.1) integration event detected by VIcaller (sample ID: 55T) that existed in the tumor but not in the paired normal tissue. Eight chimeric and five split reads were found across the two breakpoints. This integration is 212 bp in length, from 54 bp to 266 bp on the AAV-6 genome.
Figure 4.
Figure 4.
Characteristics of HBV integrations identified in tumors. (A,B) Sites of HBV integrations in two oncogenes: (A) TERT; (B) MLL4. (C) The integrated HBV sequences in TERT. The solid red lines above the HBV genome represent the integrated sequences with both breakpoints identified, whereas the dotted lines represent those with only one breakpoint identified. The HBV genes are in gray, and the promoters and enhancers are in red. (D) Comparison of integration allele fractions among HBV integrations in TERT, MLL4, and other chromosomal regions. (E) Integration allele fraction comparison of all HBV integrations in the samples with integrations in TERT. The top shows the highest integration allele fraction in each sample. The bottom left shows all HBV integrations in each sample, including those in TERT, and other regions (except MLL4). The bottom right shows the violin plot distributions of allele fractions of integrations in TERT compared to those in other regions (except MLL4). The result for MLL4 is shown in Supplemental Figure S9.
Figure 5.
Figure 5.
Identifying oncovirus candidates with integrations in bladder cancer, diffuse large B-cell lymphoma, and gastric adenocarcinoma samples. (A) Summary of identified integrations: The Cancer Genome Atlas (TCGA) Urothelial Bladder Carcinoma (BLCA); TCGA Stomach Adenocarcinoma (STAD); The Cancer Genome Characterization Initiative Diffuse Large B-Cell Lymphoma (DLBCL). (B) Sequence read alignment of a BKV (AB485698.1) integration event with 39% integration allele fraction detected in a bladder cancer sample TCGA-DK-A3IT. A total of 23 supporting read pairs, including 19 chimeric and four split reads, were found crossing the two breakpoints, supporting an integration, and 18 read pairs that support no integration were fully mapped to the human reference genome. (C) Sequence read alignment of an HPV-45 (EF202163.1) integration event with 30% integration allele fraction detected in a bladder cancer sample TCGA-BT-A20V. A total of 12 supporting read pairs, including 11 chimeric reads and one split read, were found crossing the two breakpoints, supporting an integration, and 14 read pairs were fully mapped to the human reference genome, supporting no integration. (D) Sequence read alignment of an EBV (AB828191.1) integration event with 18.6% integration allele fraction detected in a diffuse large B-cell lymphoma sample 09-33003. A total of 22 supporting read pairs, including 18 chimeric and four split reads, were found crossing the two breakpoints, supporting an integration, and 48 read pairs that support no integration were fully mapped to the human reference genome.
Figure 6.
Figure 6.
Viruses and integration events detected after removing the target viral genomes from our virome-wide database. (A) Percentage of simulated integrations detected after removing the HPV-18 (left) or MCV (right) references. (B) Integration allele fraction detected after removing the HPV-18 (left) or MCV (right) references. Three detected integration events that had >50% fractions are not shown in the figure, including two among the 91 HPV-18 events, and one among the 87 MCV events.

References

    1. Abend JR, Jiang M, Imperiale MJ. 2009. BK virus and human cancer: innocent until proven guilty. Semin Cancer Biol 19: 252–260. 10.1016/j.semcancer.2009.02.004 - DOI - PMC - PubMed
    1. Andre FE, Booy R, Bock HL, Clemens J, Datta SK, John TJ, Lee BW, Lolekha S, Peltola H, Ruff TA, et al. 2008. Vaccination greatly reduces disease, disability, death and inequity worldwide. Bull World Health Organ 86: 140–146. 10.2471/BLT.07.040089 - DOI - PMC - PubMed
    1. Benson G. 1999. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27: 573–580. 10.1093/nar/27.2.573 - DOI - PMC - PubMed
    1. Bhaduri A, Qu K, Lee CS, Ungewickell A, Khavari PA. 2012. Rapid identification of non-human sequences in high-throughput sequencing datasets. Bioinformatics 28: 1174–1175. 10.1093/bioinformatics/bts100 - DOI - PMC - PubMed
    1. Borozan I, Wilson S, Blanchette P, Laflamme P, Watt SN, Krzyzanowski PM, Sircoulomb F, Rottapel R, Branton PE, Ferretti V. 2012. CaPSID: a bioinformatics platform for computational pathogen sequence identification in human genomes and transcriptomes. BMC Bioinformatics 13: 206 10.1186/1471-2105-13-206 - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources