Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Sep 22;11(1):4794.
doi: 10.1038/s41467-020-18564-9.

A diploid assembly-based benchmark for variants in the major histocompatibility complex

Affiliations

A diploid assembly-based benchmark for variants in the major histocompatibility complex

Chen-Shan Chin et al. Nat Commun. .

Abstract

Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct accurate, phased de novo assemblies. We focus on a medically important, highly variable, 5 million base-pair (bp) region where diploid assembly is particularly useful - the Major Histocompatibility Complex (MHC). Here, we develop a human genome benchmark derived from a diploid assembly for the openly-consented Genome in a Bottle sample HG002. We assemble a single contig for each haplotype, align them to the reference, call phased small and structural variants, and define a small variant benchmark for the MHC, covering 94% of the MHC and 22368 variants smaller than 50 bp, 49% more variants than a mapping-based benchmark. This benchmark reliably identifies errors in mapping-based callsets, and enables performance assessment in regions with much denser, complex variation than regions covered by previous benchmarks.

PubMed Disclaimer

Conflict of interest statement

C.-S.C. and A.F. are employees of DNAnexus Inc., a company providing a cloud computing platform for processing genomic information. C.-S.C. is a co-founder and partner of Omni BioComputing, LLC, which currently develops genome assembler related technologies. Q.Z. is an employee of Laboratory Corporation of America Holdings, a company providing clinical diagnostics services. A.T.D. is a partner in Peptide Groove, LLP. A.C. is an employee of Google, a company providing a cloud computing platform. W.J.R. is an employee and shareholder of Pacific Biosciences. A.M.B. is an ex-employee and shareholder of 10x Genomics. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Assembling a single contig for each haplotype.
a We regenotyped DeepVariant (DV) heterozygous SNVs with WhatsHap using Oxford Nanopore Technologies (ONT) and PacBio HiFi (CCS) reads to find a confident set of SNVs with concordant genotypes from DV/CCS, WhatsHap/ONT, and WhatsHap/CCS—our Confident HETs for phasing. We selected 10x Genomics (10X) variants with phased blocks from the 10X VCF. For phasing, we used WhatsHap to combine phased blocks from 10X with ONT reads to get a single phased block across the MHC. b We binned PacBio HiFi reads into two haplotypes, which are denoted as orange and blue reads, using WhatsHap. c We performed diploid assembly using the Peregrine Assembler with the haplotype-binned HiFi reads. d We generated the benchmark variant callset from the assembled haplotigs using dipcall, and defined benchmark regions excluding SVs, exceptionally divergent regions, low-quality regions in the assembly, and long homopolymers.
Fig. 2
Fig. 2. Alignments of the two main haplotigs to the primary GRCh37 MHC region.
We compute the local divergence (est. difference) of the HG002 MHC haplotigs to the MHC of GRCh37 by performing local alignment. The differences between the assembled contigs to the references are computed using sequence blocks anchored with minimers and aligned locally using an O(ND) alignment algorithm.
Fig. 3
Fig. 3. Evaluation of benchmark’s ability to reliably identify FNs and FPs across technologies.
a Proportion of 10 randomly selected FPs and 10 randomly selected FNs from 11 callsets from Illumina (Ill), 10x Genomics (10x), PacBio HiFi (PB), and Oxford Nanopore (ONT) that were determined to be fully correct in the benchmark and incorrect or only partially correct in the query callset. b Breakdown of variants potentially incorrect in the benchmark or correct in the query, where curation of the benchmark determined it to be incorrect (no), correct (yes), or unclear (unsure).
Fig. 4
Fig. 4. Example of partially called complex variant counted as both false positives and false negatives.
The CCS-DeepVariant VCF from PacBio HiFi reads incorrectly filters the 2-bp deletion and 9 of the 13 SNVs in the region (filtered variants are light gray boxes). The benchmark correctly calls this complex variant, and represents it as a 26-bp insertion of a TG tandem repeat followed by a 29-bp deletion of adjacent tandem repeats. When comparing this VCF to our MHC benchmark, the benchmark insertion and deletion variants are counted as false negatives, while the 5 SNVs called are counter-intuitively counted as false positives because the other variants are incorrectly filtered. If the CCS-DeepVariant VCF had not filtered all of the other variants, all variants would be counted as true positives.

Similar articles

Cited by

References

    1. Zook JM, et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 2019;37:561–566. doi: 10.1038/s41587-019-0074-6. - DOI - PMC - PubMed
    1. Zook JM, et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 2014;32:246–251. doi: 10.1038/nbt.2835. - DOI - PubMed
    1. Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 10.1101/gr.210500.116 (2016). - PMC - PubMed
    1. Zook, J. M. et al. A robust benchmark for detection of germline large insertions and deletions. Nat. Biotechnol. 10.1038/s41587-020-0538-8 (2020). - PMC - PubMed
    1. Horton R, et al. Gene map of the extended human MHC. Nat. Rev. Genet. 2004;5:889–899. doi: 10.1038/nrg1489. - DOI - PubMed

Publication types