Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Mar 14;18(Suppl 3):71.
doi: 10.1186/s12859-017-1470-x.

BATVI: Fast, sensitive and accurate detection of virus integrations

Affiliations

BATVI: Fast, sensitive and accurate detection of virus integrations

Chandana Tennakoon et al. BMC Bioinformatics. .

Abstract

Background: The study of virus integrations in human genome is important since virus integrations were shown to be associated with diseases. In the literature, few methods have been proposed that predict virus integrations using next generation sequencing datasets. Although they work, they are slow and are not very sensitive.

Results and discussion: This paper introduces a new method BatVI to predict viral integrations. Our method uses a fast screening method to filter out chimeric reads containing possible viral integrations. Next, sensitive alignments of these candidate chimeric reads are called by BLAST. Chimeric reads that are co-localized in the human genome are clustered. Finally, by assembling the chimeric reads in each cluster, high confident virus integration sites are extracted.

Conclusion: We compared the performance of BatVI with existing methods VirusFinder and VirusSeq using both simulated and real-life datasets of liver cancer patients. BatVI ran an order of magnitude faster and was able to predict almost twice the number of true positives compared to other methods while maintaining a false positive rate less than 1%. For the liver cancer datasets, BatVI uncovered novel integrations to two important genes TERT and MLL4, which were missed by previous studies. Through gene expression data, we verified the correctness of these additional integrations. BatVI can be downloaded from http://biogpu.ddns.comp.nus.edu.sg/~ksung/batvi/index.html .

Keywords: Alignment; NGS; Viral integration.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
A pipeline to identify potential chimeric reads
Fig. 2
Fig. 2
This figure illustrates the orientation of the chimera reads when they map on the human genome. For all examples, we orient the human-virus integration fragments such that the human reference is in +ve strain. ac illustrate cases where human is on the 5’ side of virus. In such cases, for each read R i aligned on the human genome, we have either the whole read R i or its prefix aligns on the +ve strand of the human genome or only the prefix of R i aligns on the -ve strand of the human genome. df illustrate cases where human is on the 3’ side of virus. In such cases, for every read R i aligned on the human, we have either R i or its suffix aligns on the -ve strain of the human genome or only the suffix of R i aligns on the +ve strain of the human genome
Fig. 3
Fig. 3
The work flow showing how clusters are refined and breakpoints are predicted
Fig. 4
Fig. 4
The figure shows how the human segment of a read may be unaligned by BLAST. The black and gray lines indicate the human and viral reference genomes, respectively. The red segments are sequences originating from viral genome and the blue segments originate from the human genome. The green segment indicates a random sequence and the blue vertical lines indicate places where reference and the human segment match. In (1), although the human segment matches the reference, it is too short to be detected by BLAST. In (2), a random sequence is present in the integration and the human segment present is too short to be detected by BLAST. In (3), their is no human segment at all. This may be due to an insertion or due to a misalignment of the sequence. We will attempt to rescue reads in cases (1) and (2) through local alignment
Fig. 5
Fig. 5
The figure shows how the breakpoints are estimated from a cluster of reads. The red segments of a read aligns to human genome (shown as a black line), and the blue segments belong to the viral genome (shown as a gray line). The solid arrows show properly aligned reads and dashed arrows indicate reads that are aligned incorrectly. For a read cluster Ci+ (or Ci) we take the 3’-most(5’-most) aligned position of the read cluster as the estimated human breakpoint. In (a), there is no read passing through the actual breakpoint so the estimation can be off to the 3’ side (or 5’ side). This can be as much as the maximum insert size span of the library. However, if there is a split read R d (b), the exact human breakpoint can be recovered. To find the viral co-ordinate of the integration following procedure can be used. If a split read is available close to the estimated human breakpoint, the exact viral breakpoint can be found out c. Otherwise, the viral mappings of the cluster Ci+ (or Ci) will be further sub-divided into two clusters based on the strand of the mapping. The cluster containing the largest number of reads will be considered as correct. Then, the viral breakpoints can be estimated using similar method as that for the human breakpoints d
Fig. 6
Fig. 6
Algorithm showing how the breakpoints are found for Ci+ clusters
Fig. 7
Fig. 7
The change of false positives and true positives with the number of reads used to predict an integration with BatVI. The x-axis is log-scaled. The plots for the comparisons except BatVI are shown as straight lines for clarity, but they are in fact points with x values not exceeding 0
Fig. 8
Fig. 8
The graph shows the distribution of the distance between the exact breakpoint and the predicted breakpoint for different programs
Fig. 9
Fig. 9
The venn diagrams for the HBV integrations reported by BatVI, VirusFinder 2, VirusSeq and HIVID. a is the Venn diagram for 7 samples with insert size 170 bp. b is the Venn diagram for the same 7 samples with insert size 800 bp
Fig. 10
Fig. 10
The violin plots on the left and on the right show the tumor/normal expression ratios of TERT and MLL4, respectively. For each plot, the 87 samples are partitioned into three violin plots. The second violin plot (NG) corresponds to the original samples where the HBV integrations were detected. The first violin plot (BatVI) corresponds to the extra samples where the HBV integrations were detected. The third violin plot (nil) corresponds to the samples with no HBV integration detected

References

    1. Rous P. A transmissible avian neoplasm.(sarcoma of the common fowl) by peyton rous, md, experimental medicine for Sept. 1, 1910, vol. 12, pp. 696-705. J Exp Med. 1979;150(4):729–53. doi: 10.1084/jem.150.4.729. - DOI - PMC - PubMed
    1. Khoury JD, Nizar M, Williams MD, Chen Y, Yao H, Zhang J, Thompson EJ, Network TCGA, Meric-Bernstam F, Medeiros LJ. The landscape of DNA virus associations across human malignant cancers using RNA-Seq: an analysis of 3775 cases. J Virol. 2013;:JVI–00340. - PMC - PubMed
    1. Kao JH, Chen DS. Global control of hepatitis b virus infection. Lancet Infect Dis. 2002;2(7):395–403. doi: 10.1016/S1473-3099(02)00315-8. - DOI - PubMed
    1. Isakov O, Modai S, Shomron N. Pathogen detection using short-rna deep sequencing subtraction and assembly. Bioinformatics. 2011;27(15):2027–030. doi: 10.1093/bioinformatics/btr349. - DOI - PMC - PubMed
    1. Kostic AD, Ojesina AI, Pedamallu CS, Jung J, Verhaak RG, Getz G, Meyerson M. Pathseq: software to identify or discover microbes by deep sequencing of human tissue. Nature Biotechnol. 2011;29(5):393–6. doi: 10.1038/nbt.1868. - DOI - PMC - PubMed