. 2019 Jun 3;20(1):116.

doi: 10.1186/s13059-019-1709-0.

Haplotype-aware diplotyping from noisy long reads

Jana Ebler^{1

2

3}, Marina Haukness⁴, Trevor Pesout⁴, Tobias Marschall^{5

6}, Benedict Paten⁷

Affiliations

¹ Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany.
² Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany.
³ Graduate School of Computer Science, Saarland University, Saarland Informatics Campus E1.3, Saarbrücken, Germany.
⁴ UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA.
⁵ Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany. t.marschall@mpi-inf.mpg.de.
⁶ Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany. t.marschall@mpi-inf.mpg.de.
⁷ UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA. benedict@soe.ucsc.edu.

PMID: 31159868
PMCID: PMC6547545
DOI: 10.1186/s13059-019-1709-0

Haplotype-aware diplotyping from noisy long reads

Jana Ebler et al. Genome Biol. 2019.

. 2019 Jun 3;20(1):116.

doi: 10.1186/s13059-019-1709-0.

Authors

Jana Ebler^{1

2

3}, Marina Haukness⁴, Trevor Pesout⁴, Tobias Marschall^{5

6}, Benedict Paten⁷

Affiliations

¹ Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany.
² Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany.
³ Graduate School of Computer Science, Saarland University, Saarland Informatics Campus E1.3, Saarbrücken, Germany.
⁴ UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA.
⁵ Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany. t.marschall@mpi-inf.mpg.de.
⁶ Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, Germany. t.marschall@mpi-inf.mpg.de.
⁷ UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, CA, USA. benedict@soe.ucsc.edu.

PMID: 31159868
PMCID: PMC6547545
DOI: 10.1186/s13059-019-1709-0

Abstract

Current genotyping approaches for single-nucleotide variations rely on short, accurate reads from second-generation sequencing devices. Presently, third-generation sequencing platforms are rapidly becoming more widespread, yet approaches for leveraging their long but error-prone reads for genotyping are lacking. Here, we introduce a novel statistical framework for the joint inference of haplotypes and genotypes from noisy long reads, which we term diplotyping. Our technique takes full advantage of linkage information provided by long reads. We validate hundreds of thousands of candidate variants that have not yet been included in the high-confidence reference set of the Genome-in-a-Bottle effort.

Keywords: Computational genomics; Diplotypes; Genotyping; Haplotypes; Long reads; Phasing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Motivation and overview of diplotyping. a Gray sequences illustrate the haplotypes; the reads are shown in red and blue. The red reads originate from the upper haplotype, the blue ones from the lower. Genotyping each SNV individually would lead to the conclusion that all of them are heterozygous. Using the haplotype context reveals uncertainty about the genotype of the second SNV. b Clockwise starting top left: first, sequencing reads aligned to a reference genome are given as input; second, the read alignments are used to nominate candidate variants (red vertical bars), which are characterized by the differences to the reference genome; third, a hidden Markov model (HMM) is constructed where each candidate variant gives rise to one “row” of states, representing possible ways of assigning each read to one of the two haplotypes as well as possible genotypes (see the “Methods” section for details); forth, the HMM is used to perform *diplotyping*, i.e., we infer genotypes of each candidate variant as well as how the alleles are assigned to haplotypes

**Fig. 2**
Reach of short read and long read technologies. The callable and mappable regions for NA12878 spanning various repetitive or duplicated sequences on GRCh38 are shown. Feature locations are determined based on BED tracks downloaded from the UCSC Genome Browser [48]. Other than the Gencode regions [49, 50], all features are subsets of the Repeat Masker [51] track. Four coverage statistics for long reads (shades of red) and three for short reads (shades of blue) are shown. The labels “PacBio Mappable” and “Nanopore Mappable” describe areas where at least one primary read with GQ ≥ 30 has mapped, and “Long Read Mappable” describes where this is true for at least one of the long read technologies. “Long Read Callable” describes areas where both read technologies have coverage of at least 20 and less than twice the median coverage. “GIAB High Confidence,” “GATK Callable,” and “Short Read Mappable” are the regions associated with the evaluation callsets. For the feature-specific plots, the numbers on the right detail coverage over the feature and coverage over the whole genome (parenthesized)

**Fig. 3**
Precision and recall of MarginPhase on Nanopore and WhatsHap on PacBio datasets in GIAB high confidence regions. Genotype concordance (bottom) (wrt. GIAB high confidence calls) of MarginPhase (mp, top) on Nanopore and WhatsHap (wh, middle) on PacBio (PB). Furthermore, genotype concordance for the intersection of the calls made by WhatsHap on the PacBio and MarginPhase on the Nanopore reads is shown (bottom)

**Fig. 4**
Genotyping errors (with respect to GIAB calls) as a function of coverage. The full length reads were used for genotyping (blue), and additionally, reads were cut such as to cover at most two variants (red) and one variant (yellow)

**Fig. 5**
Confirming short-read variants. We examine all distinct variants found by our method, GIAB high confidence, GATK/HC, and FreeBayes. Raw variant counts appear on top of each section, and the percentage of total variants is shown at the bottom. a All variants. b Variants in GIAB high-confidence regions. c Variants outside GIAB high-confidence regions

**Fig. 6**
Alignment matrix. Here, the alphabet of possible alleles is the set of DNA nucleotides, i.e., Σ={A,C,G,T}

**Fig. 7**
Example graph. Left—an alignment matrix. Right—the corresponding directed graph representing the bipartitions of active rows and active non-terminal rows, where the labels of the nodes indicate the partitions, e.g., “1,2 /.” is shorthand for A=({1,2},{}})

**Fig. 8**
Genotyping HMM. Colored states correspond to bipartitions of reads and allele assignments at that position. States in C₁ and C₂ correspond to bipartitions of reads covering positions 1 and 2 or 2 and 3, respectively. In order to compute genotype likelihoods after running the forward-backward algorithm, states of the same color have to be summed up in each column

**Fig. 9**
The merger of two read partitioning HMMs with the same number of columns. Top and middle: two HMMs to be merged; bottom: the merged HMM. Transition and emission probabilities not shown

See this image and copyright information in PMC

References

1. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, et al. From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinforma. 2013;43(1):11–0. - PMC - PubMed
1. 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526(7571):68. - PMC - PubMed
1. Li W, Freudenberg J. Mappability and read length. Front Genet. 2014;5:381. - PMC - PubMed
1. Altemose N, Miga KH, Maggioni M, Willard HF. Genomic characterization of large heterochromatic gaps in the human genome assembly. PLoS Comput Biol. 2014;10(5):1003628. - PMC - PubMed
1. Porubsky D, Garg S, Sanders AD, Korbel JO, Guryev V, Lansdorp PM, Marschall T. Dense and accurate whole-chromosome haplotyping of individual genomes. Nat Commun. 2017;8(1):1293. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Haplotype-aware diplotyping from noisy long reads

Affiliations

Haplotype-aware diplotyping from noisy long reads

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources