Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Oct 22:10.1038/nbt.4277.
doi: 10.1038/nbt.4277. Online ahead of print.

De novo assembly of haplotype-resolved genomes with trio binning

Affiliations

De novo assembly of haplotype-resolved genomes with trio binning

Sergey Koren et al. Nat Biotechnol. .

Abstract

Complex allelic variation hampers the assembly of haplotype-resolved sequences from diploid genomes. We developed trio binning, an approach that simplifies haplotype assembly by resolving allelic variation before assembly. In contrast with prior approaches, the effectiveness of our method improved with increasing heterozygosity. Trio binning uses short reads from two parental genomes to first partition long reads from an offspring into haplotype-specific sets. Each haplotype is then assembled independently, resulting in a complete diploid reconstruction. We used trio binning to recover both haplotypes of a diploid human genome and identified complex structural variants missed by alternative approaches. We sequenced an F1 cross between the cattle subspecies Bos taurus taurus and Bos taurus indicus and completely assembled both parental haplotypes with NG50 haplotig sizes of >20 Mb and 99.998% accuracy, surpassing the quality of current cattle reference genomes. We suggest that trio binning improves diploid genome assembly and will facilitate new studies of haplotype variation and inheritance.

PubMed Disclaimer

Conflict of interest statement

Competing financial interests

SBK is a current employee of Pacific Biosciences. All other authors declare no competing interests.

Figures

Figure 1.
Figure 1.. Outline of trio binning and haplotype assembly.
a) Two parents constitute four haplotypes including shared sequence in both parents (solid lines) and sequence unique to one parent (dashed lines). The offspring inherits a recombined haplotype from each parent (blue, paternal; red, maternal). b) Short-read sequencing of the parents identifies unique length-k subsequences (k-mers), which can be used to infer the origin of heterozygous alleles in the offspring’s diploid genome. c) Trio binning simplifies assembly by first partitioning long reads from the offspring into paternal and maternal sets based on these k-mers. Each haplotype is then assembled separately without the interference of heterozygous variants. Unassignable reads are homozygous and can be assigned to both sets or assembled separately. d) The resulting assemblies represent genome-scale haplotypes, and accurately recover both point and structural variation.
Figure 2.
Figure 2.. Effect of data characteristics on trio binning.
a) Diploid assembly representations shown with homozygous alleles in black and heterozygous alleles (called “bubbles”) colored by haplotype. Graphical representations typically collapse homozygous alleles into a single sequence. A pseudo-haplotype is a path through the diploid graph that separates heterozygous alleles but does not preserve phase between loci. Complete haplotypes represent all alleles and preserve phase across the entire genome. Ability to assign sequencing reads to a haplotype depends on the zygosity of the genome, the sequencing read length, and the sequencing error rate. b) Log-log plot of minimum required read length (y-axis) such that there is a 99% probability of observing at least one haplotype-specific 21-mer per read (negative binomial distribution, Methods), dependent on the sequencing error rate (labels) and fraction of haplotype-specific 21-mers in the genome (x-axis). Dotted vertical lines mark the fraction of heterozygous 21-mers for H. sapiens and the B. taurus F1 cross.
Figure 3.
Figure 3.. Read and assembly k-mer statistics for an Arabidopsis thaliana F1 hybrid.
a) GenomeScope k-mer count distributions for the F1 PacBio data corrected by Canu, and partitioned by haplotype and corrected by TrioCanu for the b) Col-0 and c) Cvi-0 haplotypes. GenomeScope reports an estimated genome size and SNP heterozygosity based on a model fit to the histogram. The dashed lines show k-mer peaks identified by GenomeScope, from left to right they are the 1-copy (heterozygous), 2-copy (homozygous), 3-copy, and 4-copy (repeats). The k-mer distribution for all reads shows two clear peaks, characteristic of a diploid read set. In comparison, the binned PacBio data shows a normal k-mer count distribution, characteristic of a haploid read set. d) Counts of Col-0 (x-axis) and Cvi-0 (y-axis) haplotype-specific k-mers in FALCON-Unzip and e) TrioCanu contigs (colored circles). FALCON-Unzip primary contigs switch between haplotypes, resulting in a mix of k-mers from both parents, whereas the FALCON-Unzip associated haplotigs are smaller but preserve local phase information. In comparison, TrioCanu haplotigs contain sequence from only a single haplotype and are automatically sorted into two complete haplotypes.
Figure 4.
Figure 4.. Haplotype variation in a diploid human genome.
a) Counts of structural variants between NA12878 haplotypes across the entire genome as reported by Assemblytics . Canu haplotypes (top, red) showed a balance of insertions and deletions, with peaks at ~300 bp and ~6 kbp corresponding to human Alu and LINE elements, respectively. In comparison, the Supernova pseudo-haplotypes (bottom, blue) were missing these larger structural variants. b) Ribbon visualization of MHC haplotypes for human reference sample NA12878 as assembled by TrioCanu from PacBio data (top) and Supernova from 10X Genomics data (bottom). Red bands indicate >95% identity between haplotypes; yellow bands <95% identity; and unaligned in white (gaps and indels). Genes are annotated in black if matching the known truth without error. TrioCanu captured more haplotype variation than Supernova, especially in the highly variable MHC class II region, which contains a long stretch of high sequence divergence (yellow). In addition to phasing the entire region, TrioCanu perfectly reconstructed all typed MHC genes on both haplotypes, with the exception of the paternal DQB1, which contained a single base indel (Supplementary Table 4). Supernova produced an overly homozygous reconstruction that incorrectly assembled a majority of genes and introduced false gene duplications (Supplementary Table 5). FALCON-Unzip correctly reconstructed the MHC genes but with a higher edit distance than TrioCanu (Supplementary Table 6). Canu (without binning) correctly reconstructed the more heterozygous class II genes but collapsed the class I genes (Supplementary Table 7).
Figure 5.
Figure 5.. Diploid assembly of a Bos taurus F1 hybrid.
Stacked k-mer histograms from KAT comparing a) TrioCanu and b) FALCON-Unzip k-mer counts to an independent Illumina dataset of the same individual. The x-axis bins are k-mer coverage in the Illumina dataset, and the y-axis is the frequency of those k-mers in the Illumina set colored by copy number in the assembly. The FALCON-Unzip distribution has more k-mers that do not appear in the Illumina data (arrows), a longer tail of 1-copy k-mers (red, collapsed haplotype), and slightly more 3-copy k-mers (green, duplicated haplotype). c) Alignment dotplot of the TrioCanu Angus and Brahman haplotypes in a highly heterozygous region containing multiple guanylate binding protein (GBP) genes. Relative to Brahman, the Angus haplotype is missing a ~140 kbp region containing GBP2, previously reported to be associated with muscularity (light green). The Angus haplotype also has a duplicated GBP6-like sequence (light blue) in a region associated with conformation score (genes marked in grey are highly divergent from known transcripts). The FALCON-Unzip assembly confirms the TrioCanu structure but is split into five primary contigs and four associated haplotigs of mixed origin (Supplementary Fig. 9).

References

    1. Phillippy AM New advances in sequence assembly. Genome Res 27, xi–xiii (2017). - PMC - PubMed
    1. Koren S et al. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol 14, R101 (2013). - PMC - PubMed
    1. Korlach J et al. De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads. Gigascience 6, 1–16 (2017). - PMC - PubMed
    1. Myers EW et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000). - PubMed
    1. Mouse Genome Sequencing Consortium et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002). - PubMed

References for Online Methods

    1. Kajitani R et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res 24, 1384–1395 (2014). - PMC - PubMed
    1. Fofanov Y et al. How independent are the appearances of n-mers in different genomes? Bioinformatics 20, 2421–2428 (2004). - PubMed
    1. Schneider VA et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res 27, 849–864 (2017). - PMC - PubMed
    1. Dilthey A, Cox C, Iqbal Z, Nelson MR & McVean G Improved genome inference in the MHC using a population reference graph. Nat Genet 47, 682–688 (2015). - PMC - PubMed
    1. Li H Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.399 7 (2013).