Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Sep 2;17(1):703.
doi: 10.1186/s12864-016-3045-z.

An analytical workflow for accurate variant discovery in highly divergent regions

Affiliations

An analytical workflow for accurate variant discovery in highly divergent regions

Shulan Tian et al. BMC Genomics. .

Abstract

Background: Current variant discovery methods often start with the mapping of short reads to a reference genome; yet, their performance deteriorates in genomic regions where the reads are highly divergent from the reference sequence. This is particularly problematic for the human leukocyte antigen (HLA) region on chromosome 6p21.3. This region is associated with over 100 diseases, but variant calling is hindered by the extreme divergence across different haplotypes.

Results: We simulated reads from chromosome 6 exonic regions over a wide range of sequence divergence and coverage depth. We systematically assessed combinations between five mappers and five callers for their performance on simulated data and exome-seq data from NA12878, a well-studied individual in which multiple public call sets have been generated. Among those combinations, the number of known SNPs differed by about 5 % in the non-HLA regions of chromosome 6 but over 20 % in the HLA region. Notably, GSNAP mapping combined with GATK UnifiedGenotyper calling identified about 20 % more known SNPs than most existing methods without a noticeable loss of specificity, with 100 % sensitivity in three highly polymorphic HLA genes examined. Much larger differences were observed among these combinations in INDEL calling from both non-HLA and HLA regions. We obtained similar results with our internal exome-seq data from a cohort of chronic lymphocytic leukemia patients.

Conclusions: We have established a workflow enabling variant detection, with high sensitivity and specificity, over the full spectrum of divergence seen in the human genome. Comparing to public call sets from NA12878 has highlighted the overall superiority of GATK UnifiedGenotyper, followed by GATK HaplotypeCaller and SAMtools, in SNP calling, and of GATK HaplotypeCaller and Platypus in INDEL calling, particularly in regions of high sequence divergence such as the HLA region. GSNAP and Novoalign are the ideal mappers in combination with the above callers. We expect that the proposed workflow should be applicable to variant discovery in other highly divergent regions.

Keywords: Alignment algorithm; Chronic lymphocytic leukemia; Exome sequencing; Human leukocyte antigen; Variant calling.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Mapping status of simulated reads. Eight datasets were simulated to 100x per-base coverage at seven divergence levels (0.05–15 %) or without introducing sequence variation (control). After mapping, the total reads were first broken into those mapping to chromosome 6, those mapping to other chromosomes (‘Not on Chr6’) and unmapped reads (‘Unmapped’). The reads mapping to chromosome 6 were then grouped into five clusters based on the distance (0, 1–2, 3–10, 11–20 and >20 bp) from their original locations to the mapping locations reported by a mapper
Fig. 2
Fig. 2
Plotting SNP and INDEL calling sensitivity as a function of coverage depth. X-axis indicates coverage depth in simulation and Y-axis denotes sensitivity. a-b SNP calling sensitivity at 5 and 10 % divergence. c-d INDEL calling sensitivity at 5 and 10 % divergence. The 20 mapper-caller combinations are color-coded, using color gradient to differentiate the four mappers combined with the same caller. GATK HC GATK HaplotypeCaller; GATK UG GATK UnifiedGenotyper
Fig. 3
Fig. 3
Known SNPs in HLA-DRB1 and HLA-DQB1 from NA12878. The HLA-DRB1 (a) and HLA-DQB1 (b) structures are shown at the top, with filled boxes representing the exons (E1 to E6 in HLA-DRB1 and E1 to E5 in HLA-DQB1) and arrow indicating transcription direction. The exome-seq reads from NA12878 were mapped by BWA and GSNAP, respectively, and SNPs were called by GATK UnifiedGenotyper (GATK UG). Known SNPs matching dbSNP v138 are showed as ‘circles’ and clustered into three groups. Number in parentheses indicates the number of known SNPs in each group. The two coverage plots depict BWA and GSNAP mapping coverage at base-pair resolution
Fig. 4
Fig. 4
Variant calling sensitivity in NA12878 at three divergence settings. a SNP calling in non-HLA regions. b SNP calling in HLA region. c INDEL calling in non-HLA regions. d INDEL calling in HLA region. The three divergence levels in mapper parameter settings are 1, 5 and 10 %. Reads were aligned to the hg19 reference sequence by the three mappers at each of the divergence settings. GATK HC GATK HaplotypeCaller; GATK UG GATK UnifiedGenotyper
Fig. 5
Fig. 5
Overlap of known variants in the HLA region of the CLL sample 612703. a Overlap of known SNPs. b Overlap of known INDELs. The 12 call sets were generated by three callers together with four mappers. Number of known variants is shown in parentheses. Each non-triangle box is pseudo-colored to signify the proportion of the call set on the left that is overlapped by the call set showed on the top. HC GATK HaplotypeCaller; PY Platypus; UG GATK UnifiedGenotyper

Similar articles

Cited by

References

    1. Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–8. doi: 10.1038/nature12213. - DOI - PMC - PubMed
    1. Rabbani B, Tekin M, Mahdieh N. The promise of whole-exome sequencing in medical genetics. J Hum Genet. 2014;59(1):5–15. doi: 10.1038/jhg.2013.114. - DOI - PubMed
    1. Yu X, Sun S. Comparing a few SNP calling algorithms using low-coverage sequencing data. BMC Bioinformatics. 2013;14:274. doi: 10.1186/1471-2105-14-274. - DOI - PMC - PubMed
    1. Robinson PN, Krawitz P, Mundlos S. Strategies for exome and genome sequence data analysis in disease-gene discovery projects. Clin Genet. 2011;80(2):127–32. doi: 10.1111/j.1399-0004.2011.01713.x. - DOI - PubMed
    1. Yang Y, Muzny DM, Xia F, Niu Z, Person R, Ding Y, Ward P, Braxton A, Wang M, Buhay C, et al. Molecular findings among patients referred for clinical whole-exome sequencing. JAMA. 2014;312(18):1870–9. doi: 10.1001/jama.2014.14601. - DOI - PMC - PubMed

Publication types