Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Apr 20;46(7):3309-3325.
doi: 10.1093/nar/gky180.

ViFi: accurate detection of viral integration and mRNA fusion reveals indiscriminate and unregulated transcription in proximal genomic regions in cervical cancer

Affiliations

ViFi: accurate detection of viral integration and mRNA fusion reveals indiscriminate and unregulated transcription in proximal genomic regions in cervical cancer

Nam-Phuong D Nguyen et al. Nucleic Acids Res. .

Abstract

The integration of viral sequences into the host genome is an important driver of tumorigenesis in many viral mediated cancers, notably cervical cancer and hepatocellular carcinoma. We present ViFi, a computational method that combines phylogenetic methods with reference-based read mapping to detect viral integrations. In contrast with read-based reference mapping approaches, ViFi is faster, and shows high precision and sensitivity on both simulated and biological data, even when the integrated virus is a novel strain or highly mutated. We applied ViFi to matched genomic and mRNA data from 68 cervical cancer samples from TCGA and found high concordance between the two. Surprisingly, viral integration resulted in a dramatic transcriptional upregulation in all proximal elements, including LINEs and LTRs that are not normally transcribed. This upregulation is highly correlated with the presence of a viral gene fused with a downstream human element. Moreover, genomic rearrangements suggest the formation of apparent circular extrachromosomal (ecDNA) human-viral structures. Our results suggest the presence of apparent small circular fusion viral/human ecDNA, which correlates with indiscriminate and unregulated expression of proximal genomic elements, potentially contributing to the pathogenesis of HPV-associated cervical cancers. ViFi is available at https://github.com/namphuon/ViFi.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of integration detection process. Integration detection is split into two phases. In the (A) pre-processing step, a BWA index is created from the human reference genome and input viral genomes (Hg19+viral). In addition, a multiple sequence alignment is estimated from the viral genomes, and a maximum likelihood tree is estimated from the alignment. The alignment is decomposed into an ensemble of profile Hidden Markov models. In the (B) viral detection step, the paired-end reads are mapped against the Hg19+viral index. Candidate paired-end reads are selected if, (i) one end of the read maps to the human genome and the other end maps to a viral genome, or (ii) one end of the read maps to the human genome and the other end scores high against the HMM ensemble. All other reads are discarded. The integration point is then inferred from the set of candidate reads.
Figure 2.
Figure 2.
Algorithm for generating the ensemble of HMMs. The input is an initial multiple sequence alignment and a maximum likelihood tree that has been estimated from the multiple sequence alignment. The algorithm begins by adding the HMM built on the multiple sequence alignment to the ensemble. If the multiple sequence alignment has >10 sequences, the maximum likelihood tree is decomposed into two subtrees by deleting the centroid edge (i.e. the edge that produces a maximally balanced split of the sequence set into two sets). The subtrees are used to generate induced alignments. HMMs are built for each induced alignment and added to the ensemble. The process iterates on those subtrees that meet the criterion for decomposition (subset size >10).
Figure 3.
Figure 3.
ViFi performance on simulated datasets. Comparison of ViFi, Virus-Finder, and VERSE on simulated datasets where (A) the coverage is fixed to be 25x coverage and the number of integrations ranges from 10, 25, 50 and 100, and (B) the number of integrations is fixed to be 10 integrations and the coverage ranges from 5×, 10× and 25×. Each simulation has four model conditions. The first three model conditions (easy, medium, and hard) vary the percent similarity of simulated HPV16 genomes to the reference HPV16 genome, with five replicates per simulation. The last model condition uses Alouatta guariba papillomavirus 1 (AgPV1), a PV genome not included in the set of viral genomes to simulate integration of a novel HPV virus. AgPV1 is 44% similar to HPV16. Random noise (drawn from a uniform distribution between -0.01 and 0.01) was added to each point due to points often directly overlapping each other. VERSE is unable to detect integrations or terminates earlier on two easy cases, one medium case, 22 hard cases, and on all the AgPV1 datasets, and we exclude these results from the figure. (C) The mean wall clock running time (in hours) as a function of the number of integrations (top) and as a function of the coverage (bottom). All methods were run on a machine with 24 cores for a maximum of 48 wall clock hours (1152 total core hours). Only runs that report integrations were included.
Figure 4.
Figure 4.
Comparison of ViFi on fusion event detection. (A) Venn diagram of the overlap of the WGS integration points with a matching mRNA event within 100 kb reported by ViFi, VERSE, and the Tang et al. (2013) study on the TCGA-CESC samples with both RNA-seq and WGS sequencing matched pair data. (B) Comparison of the fusion events detected by ViFi and The Cancer Genome Atlas Research Network 2017 study. Fusion are considered to have WGS support if ViFi detected a genomic integration within a 100 kb region of the fusion event.
Figure 5.
Figure 5.
Characterization of genomic integration sites and fusion mRNA. (A) Density plot of the distance of fusion mRNA junction to the nearest WGS integration breakpoint. (B) Number of annotated types covered by WGS or RNA-seq reads across all integration regions. The points give the number of specific functional annotations (e.g. LINE) across all 181 integration regions in the TCGA-CESC data set that are partially covered by at least three reads. Blue represents results from WGS data, and red represents results from RNA-seq data. The violin plot show the distribution of the total number of specific annotations across 1000 replicates that are partially covered by at least three reads, where each replicate is a collection of 181 randomly chosen intervals. The P-values of the observed annotation counts (Z-test) are all statistically significant for the RNA-seq data(P-value <10−20), but for the WGS data only the SINE elements (P-value <10−8) and genes (P-value <10−7) were enriched in a statistically significant manner.
Figure 6.
Figure 6.
Impact of viral integration on proximal transcription. For each integration, we compare the expression change in the 10kb genomic interval around an integration in a sample to the mean expression change for the same 10 kb genomic interval for all other samples without the integration. (A) The distribution of log2-fold change in expression of human mRNA between segments with and without integrations, separated by whether the integration produces fusion mRNA or is a fusionless integration. The dashed line represents the geometric mean value of the distribution. (B) The –log(P-value) for expression change for integrations that produce fusion mRNA and fusionless integrations (see Materials and Methods for description of P-value computation). Each point on the x-axis corresponds to a distinct genomic fusion segment sorted by increasing p-value. The red dashed line denotes the threshold beyond which the samples do not show a significant change in expression (P-value >0.05 after FDR correction).
Figure 7.
Figure 7.
Expression of human segments upstream and downstream of the integrated viral gene. (A) Expression fold-change within an integration region and (B) percent of samples in which the position in the integration region has a higher FPKM than its FPKMUQ. The blue line represents an integrated virus, with arrow representing the direction of transcription of the viral genome, and the red line represents the human genome. An integration is denoted as ‘fusionless’ when it does not contain a mapped chimeric (viral-human mRNA); otherwise, it is denoted as ‘simple’ when it is the only integration within a 10 kb window, and at least 75% of the chimeric paired-end reads supporting a fusion mRNA event are oriented in the same direction relative to the viral gene. All other regions are denoted ‘complex’. The position is reported relative to the integration point in the human genome, with negative position being upstream of the viral gene, and positive position being downstream of the viral gene. In total, there are 68 simple integrations, 51 complex integrations, and 107 integration events with no fusion mRNA sequences. We observe a high increase downstream of simple integrations, in the entire region of complex integrations, and no increase in expression in fusionless integrations.
Figure 8.
Figure 8.
Proposed apparent ecDNA structure for TCGA-C5-A0TN. Proposed apparent ecDNA structure for an integration from TCGA-C5-A0TN. The joined segments are chr2:195,586,245-195,603,512, chr3:126,826,267-126,849,186, and HPV16:0-7,905. There are 235 chimeric paired-end between chr2 and HPV16, 149 discordant paired-end reads between chr2 and chr3, and 229 chimeric paired-end reads between chr3 and HPV16. The genomic coverage fold amplification of the region relative to the average genomic coverage of the entire genome is shown in blue, and the mRNA coverage of the region is shown red. The FPKM fold change for the human mRNA in this region for this sample is 7200x. LINE and LTR elements are highlighted in teal and gold. The viral genes are highlighted in light green. The viral genome is is not complete and has a deletion of the E2 region. The assembled fusion transcript from this region is shown in the figure.

References

    1. Plummer M., de Martel C., Vignat J., Ferlay J., Bray F., Franceschi S.. Global burden of cancers attributable to infections in 2012: a synthetic analysis. Lancet Glob. Health. 2016; 4:e609–e616. - PubMed
    1. Duensing S., Münger K.. The human papillomavirus type 16 E6 and E7 oncoproteins independently induce numerical and structural chromosome instability. Cancer Res. 2002; 62:7075–7082. - PubMed
    1. Yim E.-K., Park J.-S.. The role of HPV E6 and E7 oncoproteins in HPV-associated cervical carcinogenesis. Cancer Res. Treat. 2005; 37:319–324. - PMC - PubMed
    1. Zhang T., Zhang J., You X., Liu Q., Du Y., Gao Y., Shan C., Kong G., Wang Y., Yang X. et al. . Hepatitis B virus X protein (HBx) modulates oncogene YAP via CREB to promote growth of hepatoma cells. Hepatology. 2012; 56:2051–2059. - PubMed
    1. Carrillo-Infante C., Abbadessa G., Bagella L., Giordano A.. Viral infections as a cause of cancer (review). Int. J. Oncol. 2007; 30:1521–1528. - PubMed

Publication types

MeSH terms