Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Dec 7:7:127.
doi: 10.1186/s13073-015-0251-2.

ScanIndel: a hybrid framework for indel detection via gapped alignment, split reads and de novo assembly

Affiliations

ScanIndel: a hybrid framework for indel detection via gapped alignment, split reads and de novo assembly

Rendong Yang et al. Genome Med. .

Abstract

Comprehensive identification of insertions/deletions (indels) across the full size spectrum from second generation sequencing is challenging due to the relatively short read length inherent in the technology. Different indel calling methods exist but are limited in detection to specific sizes with varying accuracy and resolution. We present ScanIndel, an integrated framework for detecting indels with multiple heuristics including gapped alignment, split reads and de novo assembly. Using simulation data, we demonstrate ScanIndel's superior sensitivity and specificity relative to several state-of-the-art indel callers across various coverage levels and indel sizes. ScanIndel yields higher predictive accuracy with lower computational cost compared with existing tools for both targeted resequencing data from tumor specimens and high coverage whole-genome sequencing data from the human NIST standard NA12878. Thus, we anticipate ScanIndel will improve indel analysis in both clinical and research settings. ScanIndel is implemented in Python, and is freely available for academic use at https://github.com/cauyrd/ScanIndel.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The ScanIndel workflow. ScanIndel aligns the raw read FASTQ files with a gapped NGS aligner (BWA-MEM) to detect short indels according to the initial mapping results. Soft-clipped reads with breakpoint evidence support are extracted for BLAT re-alignment to refine the CIGAR and genomic positions. Those re-aligned soft-clipped reads help to identify large deletions and medium-sized insertions. Meanwhile, ScanIndel carries out de novo assembly with the Inchworm assembler from Trinity for unmapped reads and BLAT realigned soft-clipped reads to detect large indels. All individual calling sets are merged by vcfcombine (from vcflib) to get one final VCF output containing all indel predictions
Fig. 2
Fig. 2
Effect of different strategies on indel detection. ScanIndel is executed by three modes: (1) BWA-MEM alignment + soft-clipping realignment + FreeBayes indel calling (labeled as ‘scanindel_mapping_only’); (2) BWA-MEM alignment + de novo assembly + FreeBayes indel calling (labeled as ‘scanindel_assembly_only’); and (3) complete ScanIndel procedures — BWA-MEM alignment + softclipping realignment + de novo assembly + FreeBayes indel calling (labeled as ‘scanindel’). In addition, FreeBayes indel calling directly from BWA-MEM alignment is tested as well (labeled as ‘freebayes’). Smoothed histograms (40-bp bins) showed the comparison on simulated short reads 100 bp and 200 bp in length under 10×, 20× and 50× mean coverage for detecting 1000 deletions and 1000 insertions ranged evenly in size from 1 bp to 1 kb
Fig. 3
Fig. 3
Performance comparison for indel detection with 100-bp simulated reads. Recall (upper panels) and precision (lower panels) are evaluated for ScanIndel, GATK HaplotypeCaller, Pindel, Platypus, Scalpel, Delly and FermiKit. Smoothed histograms (100-bp bins) showed the comparison on simulated data of 10×, 20× and 50× mean coverage for detecting 1000 deletions and 1000 insertions, one each from the size range 1 bp to 1 kb. Precision is not calculated if a zero denominator (TP + FP = 0) is given by the method
Fig. 4
Fig. 4
Performance comparison of large indel detection on NIST standard NA12878. a Validated large deletions (138) from the literature with sizes from 530–155,154 bp are used as a reference standard set. b Novel sequence insertions (105) previously identified by the 1000 Genomes Project with sizes from 37–8224 bp are used as reference standard
Fig. 5
Fig. 5
Time and peak memory used by ScanIndel and Pindel on NA12878 individual 50× WGS data. The run time of ScanIndel is counted in each module: split read re-alignment (SR), de novo assembly (AS) and variant calling (VC). All the measurements refer to the program itself, and do not include BWA-MEM alignment

Similar articles

Cited by

References

    1. Mullaney JM, Mills RE, Stephen Pittard W, Devine SE. Small insertions and deletions (INDELs) in human genomes. Hum Mol Genet. 2010;19:R131–6. doi: 10.1093/hmg/ddq400. - DOI - PMC - PubMed
    1. Meldrum C, Doyle MA, Tothill RW. Next-generation sequencing for cancer diagnostics: a practical perspective. Clin Biochem Rev. 2011;32:177–95. - PMC - PubMed
    1. Ding L, Wendl MC, McMichael JF, Raphael BJ. Expanding the computational toolbox for mining cancer genomes. Nat Rev Genet. 2014;15(July):556–70. doi: 10.1038/nrg3767. - DOI - PMC - PubMed
    1. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60. doi: 10.1093/bioinformatics/btp324. - DOI - PMC - PubMed
    1. Neuman JA, Isakov O, Shomron N. Analysis of insertion-deletion from deep-sequencing data: Software evaluation for optimal detection. Brief Bioinform. 2013;14:46–55. doi: 10.1093/bib/bbs013. - DOI - PubMed