Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Aug;46(8):912-918.
doi: 10.1038/ng.3036. Epub 2014 Jul 13.

Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications

Affiliations

Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications

Andy Rimmer et al. Nat Genet. 2014 Aug.

Abstract

High-throughput DNA sequencing technology has transformed genetic research and is starting to make an impact on clinical practice. However, analyzing high-throughput sequencing data remains challenging, particularly in clinical settings where accuracy and turnaround times are critical. We present a new approach to this problem, implemented in a software package called Platypus. Platypus achieves high sensitivity and specificity for SNPs, indels and complex polymorphisms by using local de novo assembly to generate candidate variants, followed by local realignment and probabilistic haplotype estimation. It is an order of magnitude faster than existing tools and generates calls from raw aligned read data without preprocessing. We demonstrate the performance of Platypus in clinically relevant experimental designs by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Simplified flow diagram of the integrated calling algorithm. The three stages of the algorithm are pipelined without using intermediate files or separate processes. Mapped and sorted BAM files are used as input; merging, sample demultiplexing and read deduplication are all performed by Platypus. The resulting variant calls require no post-processing, except for a Bayesian filtering stage for de novo mutations. (a) Candidate variants are obtained from read alignments, local assembly and external sources (not shown), and candidate haplotypes are formed. (b) The support of each read for any candidate haplotype is computed by alignment, and population haplotype frequencies are fitted to a diploid segregation model. (c) Variants are called by first calling haplotypes, followed by marginalization over secondary variation. Filtering on the variant and sample levels results in a final call set. See the supplementary note for full details of the algorithm.
Figure 2
Figure 2
Size distribution of indel calls in the NA12878 trio. (a) Histogram of small indel calls (up to 50 bp) by size (negative size, deletion with respect to the reference sequence) for three calling algorithms. UG, UnifiedGenotyper; HC, HaplotypeCaller. (b) Smoothed histograms (10-bp bins) showing larger indels and peaks around ~300 bp corresponding to insertions and deletions of Alu transposable elements. Local assembly allows Platypus to detect insertions up to a few hundred basepairs in size and deletions of over 1 kb in size.
Figure 3
Figure 3
Genotypes of the HLA-A, HLA-B and HLA-C loci at two- and four-digit resolution. Combined genotype alignment scores relative to the maximum-scoring genotypes are shown as a heat map; gray boxes indicate correct genotypes. Correct and unique genotypes at four-digit resolution were estimated from Platypus reference and variant calls for HLA-B (HLA-B*56:01/HLA-B*08:01) and HLA-C (HLA-C*07:01/HLA-C*01:02) (heat maps for genotypes at two-digit resolution shown; see supplementary Fig. 2 for heat maps at four-digit resolution). For HLA-A, the correct genotype HLA-A*11:01/HLA-A*01:01 was among the highest scoring genotypes, but it could be resolved uniquely at two-digit resolution only (lower middle panel); ambiguities remained at four-digit resolution for both haplotypes (right; genotypes with identical scores are indicated).

Similar articles

Cited by

References

    1. DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. - PMC - PubMed
    1. Albers CA, et al. Dindel: accurate indel calls from short-read data. Genome Res. 2011;21:961–973. - PMC - PubMed
    1. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–2993. - PMC - PubMed
    1. Li R, et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 2009;19:1124–1132. - PMC - PubMed
    1. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 2012;44:226–232. - PMC - PubMed

Publication types