. 2016 Dec;11(12):2529-2548.

doi: 10.1038/nprot.2016.150. Epub 2016 Nov 17.

Indel variant analysis of short-read sequencing data with Scalpel

Han Fang^{1

2

3}, Ewa A Bergmann⁴, Kanika Arora⁴, Vladimir Vacic⁴, Michael C Zody⁴, Ivan Iossifov¹, Jason A O'Rawe^{2

3}, Yiyang Wu^{2

3}, Laura T Jimenez Barron^{2

5}, Julie Rosenbaum¹, Michael Ronemus¹, Yoon-Ha Lee¹, Zihua Wang¹, Esra Dikoglu², Vaidehi Jobanputra^{2

6}, Gholson J Lyon^{2

3}, Michael Wigler¹, Michael C Schatz^{1

7}, Giuseppe Narzisi^{1

4}

Affiliations

¹ Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA.
² Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA.
³ Stony Brook University, Stony Brook, New York, USA.
⁴ New York Genome Center, New York, New York, USA.
⁵ Centro de Ciencias Genomicas, Universidad Nacional Autonoma de Mexico, Cuernavaca, Mexico.
⁶ Columbia University Medical Center, New York, New York, USA.
⁷ Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA.

PMID: 27854363
PMCID: PMC5507611
DOI: 10.1038/nprot.2016.150

Indel variant analysis of short-read sequencing data with Scalpel

Han Fang et al. Nat Protoc. 2016 Dec.

. 2016 Dec;11(12):2529-2548.

doi: 10.1038/nprot.2016.150. Epub 2016 Nov 17.

Authors

Affiliations

¹ Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA.
² Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA.
³ Stony Brook University, Stony Brook, New York, USA.
⁴ New York Genome Center, New York, New York, USA.
⁵ Centro de Ciencias Genomicas, Universidad Nacional Autonoma de Mexico, Cuernavaca, Mexico.
⁶ Columbia University Medical Center, New York, New York, USA.
⁷ Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA.

PMID: 27854363
PMCID: PMC5507611
DOI: 10.1038/nprot.2016.150

Abstract

As the second most common type of variation in the human genome, insertions and deletions (indels) have been linked to many diseases, but the discovery of indels of more than a few bases in size from short-read sequencing data remains challenging. Scalpel (http://scalpel.sourceforge.net) is an open-source software for reliable indel detection based on the microassembly technique. It has been successfully used to discover mutations in novel candidate genes for autism, and it is extensively used in other large-scale studies of human diseases. This protocol gives an overview of the algorithm and describes how to use Scalpel to perform highly accurate indel calling from whole-genome and whole-exome sequencing data. We provide detailed instructions for an exemplary family-based de novo study, but we also characterize the other two supported modes of operation: single-sample and somatic analysis. Indel normalization, visualization and annotation of the mutations are also illustrated. Using a standard server, indel discovery and characterization in the exonic regions of the example sequencing data can be completed in ∼5 h after read mapping.

PubMed Disclaimer

Conflict of interest statement

COMPETING FINANCIAL INTERESTS The authors declare competing financial interests: details are available in the online version of the paper.

Figures

**Figure 1**
High accuracy of indel detection using Scalpel on WGS data. Scalpel was run in single mode on 30× Illumina Hiseq 2000 2 × 100 bp WGS data described in Narzisi *et al.* and later analyzed in Fang *et al.*. This figure shows the size distribution of valid (green) and invalid (gray) indels that were randomly selected for validation (using targeted resequencing) from the two previous studies. This validation set includes 160 and 145 candidate variants that were WGS–WES intersected and WGS-specific, respectively. Among a total of 305 candidates, 90% of them (274) were successfully validated. Positive predictive value (PPV) is computed by PPV = no. TP/(no. TP + no. FP), where no. TP is the number of true-positive calls and no. FP is the number of false-positive calls.

**Figure 2**
Main steps in the Scalpel protocol. Starting from raw sequencing data, reads are first aligned to the human genome using the BWA software package (Step 4 in PROCEDURE). After the standard practices in the field, the alignments are sorted (using SAMtools, Step 5 in PROCEDURE) and duplicates are marked (using Picard Tools—http://broadinstitute.github.io/picard/, Step 6 in PROCEDURE). Finally, indels can be called with Scalpel (Steps 8 and 9 in PROCEDURE), and statistical assessment of the variant calls can provide diagnostics of the data (Steps 10–20 in PROCEDURE). Note that as Scalpel locally reassembles the reads, this procedure is free of computationally expensive techniques such as indel realignment and base quality recalibrations. The BAM files obtained after the earlier steps are the input for the Scalpel microassembly procedure. Scalpel then localizes the reads within a window, constructs a de Bruijn graph, resolves repeat structure and enumerates haplotype paths. Image adapted with permission from ref. , Nature Publishing Group.

**Figure 3**
Overview of the indel variant filtering cascade. This figure is a conceptual representation of the filtering procedure (Steps 9–12 in the PROCEDURE). It is used to report high-quality *de novo* and inherited indels within the target region, coding regions in this case. (i) Inherited and *de novo* indels are analyzed separately; (ii) only variants within the target regions are exported; (iii) low-quality indels are identified and removed based on sequence composition (e.g., STRs); and (iv) additional filters based on supporting coverage and allele balance are used to reduce the number of false positives.

**Figure 4**
Higher coverage can improve Scalpel’s sensitivity performance for indel detection with WGS data. The sensitivity performance is assessed using the high-confidence call set shared by WGS and WES data (both Illumina HiSeq2000 platform) from eight samples using all available coverage (70× mean coverage). We down-sampled the reads to a fraction of the original coverage and performed indel calling again. Compared with the original set at 70× mean coverage, we report the percentage of variants that could still be called at a reduced coverage. The y axis represents the percentage of the high confidence indels revealed at a down-sampled data set. The x axis represents the mean coverage of the eight down-sampled genomes. Among the entire call set, ~61% of the indels are heterozygous and the remaining 39% are homozygous. Performance for heterozygous (blue) and homozygous (green) indel detection is shown by separate curves. Reduced coverage indeed affected the detection of heterozygous indels more than that of homozygous ones.

**Figure 5**
Comparison of standard WGS and PCR-free data based on indel quality. Indel quality was defined with respect to alternative allele coverage and χ² score, which is used in the PROCEDURE and described in Fang *et al.*. ‘Intersection’ represents the shared indels from both the PCR-free and standard WGS indels. The number reported above a call set represent the total number of indels in that subset; the two data shared 2,684 variants, whereas 310 and 538 were specific to standard WGS and PCR-free data, respectively. Indel calls are further categorized (side-bars) based on their sequence composition: Poly A, Poly C, Poly G, Poly T, other-STR and non-STR. Note: although Poly C and Poly G indels exist in the call-set, their fractions are too minimal to be visualized in the plot. In fact, Poly A, Poly T and non-homopolymer STRs dominate the STR indels. Poly, homopolymer.

**Figure 6**
Whole-genome mutational concordance. (a) Concordance and discordant indel mutations as a function of the Phred-scaled Fisher’s exact score cutoff between primary and metastasis for a pair of highly concordant colorectal cancer samples from Branon *et al.*. Increasing the Fisher’s exact score cutoff substantially reduces the number of private indels while maintaining a similar number of shared ones. This demonstrates the Fisher’s exact score’s ability to discriminate true mutations from the false-positive ones. (b) Distribution of variant allele fraction (VAF) as a function of different Phred-scaled Fisher’s exact score cutoffs for the somatic indels detected in the primary tumor. Increasing the cutoff shifts the distribution to the expected 20% VAF for these samples.

**Figure 7**
Size distribution of inherited and *de novo* indels. The y axis represents the number of indels, whereas the x axis represents the size of indels in base pairs. We should expect a log-normal distribution of indels, with a majority of them being short—i.e., <5 bp in the human exonic regions. This figure was generated using the data from Step 15.

**Figure 8**
Histograms of low-quality homopolymer indels by category. The y axis represents the number of indels, whereas the x axis represents the variant allele fraction (VAF). Homopolymer A or T indels should be more abundant than C or G indels in the call set, especially indels with very low VAF. Due to the limitations of PCR amplification, homopolymer A or T runs are more likely to result in inaccurate molecules. This figure was generated using the data from Step 16.

**Figure 9**
Variant allele fractions (VAF %) of the inherited indels. Low/high-quality indels here were defined with respect to the coverage and χ² scores described in Steps 11 and 12. The VAF of high-quality inherited indels should follow an approximately normal distribution, with a mean of ~50%. In practice, because of sequencing and alignment biases, the mean of the normal distribution is usually slightly less than 50%. Low-quality indels usually have low VAF values, generally tending to be lower than 20%. This figure was generated using the data from Step 17.

**Figure 10**
Filtering cascade of inherited and *de novo* indel calls. The numbers in each box denote the expected numbers of indel calls remaining after filtering. The *de novo* indels underwent a two-pass search to reduce the number of false positives. The numbers in this figure were obtained from Steps 9–12 and 22. It is important to use a two-pass search in *de novo* indel calling, as false-positive calls can be reduced by using a more sensitive parameter setting for the parents’ data.

**Figure 11**
Frame-preserving indels are more abundant within coding sequences. This figure was generated using data generated by Step 23, which was the set of inherited indels from the proband, NA12882. The y axis represents the number of indels, whereas the x axis shows the indel size. Stacked bar plots of insertions (red) and deletions (green) are shown in this figure. Indels with a size that is a multiple of three (frame-preserving) are more abundant than the frame-disrupting ones.

**Figure 12**
Screenshot of the alignment of the *de novo* deletion in the IGV browser. From the top to the bottom, the alignment is as follows: NA12877 (father), NA12878 (mother), NA12881 (sibling) and NA12882 (proband). The black lines in the alignment of NA12882 show the T deletion in the genome. It is clear from the alignment of the reads that this deletion is present only in the proband and not in any other family members.

See this image and copyright information in PMC

References

1. Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372:793–795. - PMC - PubMed
1. Highnam G, et al. An analytical framework for optimizing variant discovery from personal genomes. Nat Commun. 2015;6:6275. - PMC - PubMed
1. Watson JD, Baker TA, Gann A, Levine M, Losick R. Molecular Biology of the Gene. 7. Cold Spring Harbor Laboratory Press; 2013.
1. Nik-Zainal S, et al. Mutational processes molding the genomes of 21 breast cancers. Cell. 2012;149:979–993. - PMC - PubMed
1. Zaidi S, et al. De novo mutations in histone-modifying genes in congenital heart disease. Nature. 2013;498:220–223. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Indel variant analysis of short-read sequencing data with Scalpel

Affiliations

Indel variant analysis of short-read sequencing data with Scalpel

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials