Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2020 Oct 26;12(1):91.
doi: 10.1186/s13073-020-00791-w.

Best practices for variant calling in clinical sequencing

Affiliations
Review

Best practices for variant calling in clinical sequencing

Daniel C Koboldt. Genome Med. .

Abstract

Next-generation sequencing technologies have enabled a dramatic expansion of clinical genetic testing both for inherited conditions and diseases such as cancer. Accurate variant calling in NGS data is a critical step upon which virtually all downstream analysis and interpretation processes rely. Just as NGS technologies have evolved considerably over the past 10 years, so too have the software tools and approaches for detecting sequence variants in clinical samples. In this review, I discuss the current best practices for variant calling in clinical sequencing studies, with a particular emphasis on trio sequencing for inherited disorders and somatic mutation detection in cancer patients. I describe the relative strengths and weaknesses of panel, exome, and whole-genome sequencing for variant detection. Recommended tools and strategies for calling variants of different classes are also provided, along with guidance on variant review, validation, and benchmarking to ensure optimal performance. Although NGS technologies are continually evolving, and new capabilities (such as long-read single-molecule sequencing) are emerging, the "best practice" principles in this review should be relevant to clinical variant calling in the long term.

Keywords: Best practices; Cancer sequencing; Clinical sequencing; Mutation detection; Next-generation sequencing; Variant calling.

PubMed Disclaimer

Conflict of interest statement

The author is a co-inventor of VarScan 2 and thus receives a portion of licensing royalties from several commercial licensees.

Figures

Fig. 1
Fig. 1
Standard pipelines for NGS analysis. a Alignment and pre-processing of NGS data for an individual sample. Raw sequence data in FASTQ format are aligned to the reference sequence, with the resulting alignments typically stored in binary alignment/map (BAM) file format. Marking of duplicates in the BAM file is a critical step to account for duplicate reads of the same fragment. Base quality score recalibration (BQSR) and local realignment around indels are a computationally expensive step that may marginally improve variant calls. At the conclusion of this step, the file is ready for variant analysis. b Variant calling in NGS trio sequencing. In this common study design, variants are called jointly (simultaneously) in a proband and both parents, which enables the phasing of variants by parent of origin. The initial variant calls are typically filtered to remove a number of recurrent artifacts associated with short-read alignment and maybe visually confirmed by manual review of the sequence alignments. Orthogonal validation may be performed to confirm the variant and its segregation within the family. De novo alterations should be aggressively filtered to remove both artefactual calls in the proband (false positives) and inherited variants that were under-called in a parent (false negatives). In addition to manual inspection of alignments, most de novo mutations are independently verified by orthogonal validation techniques, such as Sanger sequencing. c Somatic variant calling in matched tumor-normal pairs. Identification of somatic alterations in tumors requires specialized variant callers which consider aligned data from the tumor and normal simultaneously. Candidate somatic variants are filtered and visually reviewed to remove common alignment artifacts as well as germline variants under-called in the normal sample. The resulting variants are typically validated by orthogonal approaches, which may require specialized approaches for low-frequency variants
Fig. 2
Fig. 2
Common artifacts in NGS alignments that gave rise to a false-positive de novo mutation call in a family trio. Each pane is an IGV screenshot of WGS alignments for the proband (top track), mother, (middle track), and father (bottom track). Each sample’s track comprises two parts: a histogram of the read depth and the reads as aligned to the reference sequence. Reads are colored according to the aligned strand (red = forward strand; blue = reverse strand). a False positive associated with low base quality. Most reads supporting the variant have low base quality indicated by lightly shaded non-reference bases. Four reads in the proband showed the alternate allele with good quality, triggering the variant call. b False positive due to misalignments near the start or end of reads. Notice that the alternate allele is only observed at the start/end of reads in the proband. In this case, the read depth histogram provides a clue as to the cause of the misalignment. As shown in the next panel, this occurs at the breakpoint of a large paternally inherited deletion. c The same position as in b, but with soft-clipped bases shown in color. BLAT alignment of such reads reveals that the soft-clipped portion matches the other side of the deletion segment some 5.2 kb downstream. d False positive associated with strand bias. All but one variant-supporting reads in the proband are on the reverse strand, whereas reference-supporting reads are equally represented on both strands. e False positives associated with low-complexity sequences. In this case, reads erroneously showing a single-base deletion (horizontal black line) at a T-homopolymer are enriched in the proband. R supporting insertions (purple) are also seen. Note that this position is zoomed out compared to the other panels, a recommended practice to visualize the end of repetitive sequences. f False positives due to paralogous alignments of reads from regions not well represented in the reference. Alignments for proband include reads with several substitutions relative to the reference sequence within the 41-bp viewing window. This typically occurs when reads from sequences not represented in the reference are mapped to the closest paralog
Fig. 3
Fig. 3
Visual review of copy number and structural variants. Each pane is an IGV screenshot of WGS for a proband (top), mother, (middle), and father (bottom). The top track for each sample is a histogram of sequence depth. Reads are viewed as pairs, with discordant pair alignments highlighted in color. a A homozygous ~ 4-kb del that appears heterozygous in the proband, homozygous in the mother, and absent from the father. Note the discordant read pairs suggesting a deletion (red) and visible change in read depth. b Homozygous deletion inherited from two heterozygous parents. c A heterozygous paternally inherited deletion with ambiguous end point by paired-end mapping resolved by visual inspection of read depth. d A maternally inherited tandem duplication. Note the increased read depth in the histogram and the discordant read pairs highlighted in green that span the original sequence and their tandem duplication
Fig. 4
Fig. 4
Detecting somatic rearrangements in cancer using NGS. Shown is whole-genome sequencing data for chromosome 1 for a tumor-normal pair. Top: Log2 values indicate copy number changes in the tumor relative to the normal. Bottom: copy gains and losses skew tumor allele frequencies for heterozygous variants, with loss of heterozygosity (red) apparent in regions of heterozygous deletions

References

    1. Cancer Genome Atlas N. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61–70. doi: 10.1038/nature11412. - DOI - PMC - PubMed
    1. Cancer Genome Atlas N. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012;487(7407):330–337. doi: 10.1038/nature11252. - DOI - PMC - PubMed
    1. Cancer Genome Atlas Research N Integrated genomic analyses of ovarian carcinoma. Nature. 2011;474(7353):609–615. doi: 10.1038/nature10166. - DOI - PMC - PubMed
    1. Cancer Genome Atlas Research N Comprehensive genomic characterization of squamous cell lung cancers. Nature. 2012;489(7417):519–525. doi: 10.1038/nature11404. - DOI - PMC - PubMed
    1. Cancer Genome Atlas Research N Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature. 2013;499(7456):43–49. doi: 10.1038/nature12222. - DOI - PMC - PubMed

Publication types

MeSH terms