Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 28;12(1):1935.
doi: 10.1038/s41467-020-20536-y.

Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C

Affiliations

Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C

Zev N Kronenberg et al. Nat Commun. .

Abstract

Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. To date, these assemblies have been best created with complex protocols, such as cultured cells that contain a single-haplotype (haploid) genome, single cells where haplotypes are separated, or co-sequencing of parental genomes in a trio-based approach. These approaches are impractical in most situations. To address this issue, we present FALCON-Phase, a phasing tool that uses ultra-long-range Hi-C chromatin interaction data to extend phase blocks of partially-phased diploid assembles to chromosome or scaffold scale. FALCON-Phase uses the inherent phasing information in Hi-C reads, skipping variant calling, and reduces the computational complexity of phasing. Our method is validated on three benchmark datasets generated as part of the Vertebrate Genomes Project (VGP), including human, cow, and zebra finch, for which high-quality, fully haplotype-resolved assemblies are available using the trio-based approach. FALCON-Phase is accurate without having parental data and performance is better in samples with higher heterozygosity. For cow and zebra finch the accuracy is 97% compared to 80-91% for human. FALCON-Phase is applicable to any draft assembly that contains long primary contigs and phased associate contigs.

PubMed Disclaimer

Conflict of interest statement

E.E.E. is on the scientific advisory board (SAB) of DNAnexus, Inc. [and was an SAB member of Pacific Biosciences, Inc. (2009–2013)]. S.B.K., Z.N.K., P.P., G.T.C., and R.J.H. are employees and share holders of Pacific Biosciences, a company developing single-molecule sequencing technologies. S.T.S. and I.L. are employee and share holders, and Z.N.K. and K.A.M. are shareholder of Phase Genomics, a provider of services and products for Hi-C and other proximity-ligation methods. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of FALCON-Phase method.
a Partially phased long-read assembly consists of primary contigs (blue) and shorter alternate haplotigs (red). The region where a haplotig overlaps a primary contig is a phase block and is referred to as being unzipped because two haplotypes are resolved. Regions of the primary contig without associated haplotigs are referred to as collapsed because the haplotypes have low or no heterozygosity. b A haplotig placement file specifies primary contig coordinates where the haplotigs align. c This placement file is used to mince the primary contigs at the haplotig alignment start and end coordinates. Mincing defines the phase blocks (A–B haplotype pairs, blue and red) and collapsed haplotypes (gray). d Hi-C read pairs are mapped to the minced contigs and alignments are filtered to retain haplotype-specific mapping. e Phase blocks are assigned to state 0 or 1 using the phasing algorithm. f The output of FALCON-Phase is two full-length pseudo-haplotypes for phase 0 and 1. These sequences are of similar length to the original primary contig and the unzipped haplotypes are in phase with each other.
Fig. 2
Fig. 2. Phasing accuracy of contigs before (left) and after applying FALCON-Phase (right) to the contigs.
Parent-specific k-mer count from mother is on the x-axis and father on the y-axis. Contig size is indicated by size of the data point and well-phased contigs lie along the axes. Unphased primary contigs (blue) are large but contain a mixture of k-mer markers from mother and father. Haplotigs are mostly phased but shorter in length. After phasing by FALCON-Phase, phase 0 and phase 1 contigs are of similar length to the FALCON-Unzip primary contigs and have less mixing of parental markers within contigs. a Zebra finch contigs before phasing; b zebra finch contigs after phasing; c cow contigs before phasing; d cow contigs after phasing; e HG00733 contigs before phasing; f HG00733 contigs after phasing; g mHomSap4 contigs before phasing; h mHomSap3 contigs after phasing.
Fig. 3
Fig. 3. Phasing accuracy of scaffolds before (left) and after applying FALCON-Phase (right).
Parent-specific k-mers from mother are on the x-axis and father on the y-axis. Scaffold size is indicated by size of the data point and well-phased contigs lie along the axes. Only the phase 0 contigs from FALCON-Phase were scaffolded. Scaffolds after a second round of phasing by FALCON-Phase show greater separation, indicating each scaffold contains a higher proportion of markers from one parent. a Zebra finch scaffolds before phasing; b zebra finch scaffolds after phasing; c cow scaffolds before phasing; d cow scaffolds after phasing; e HG00733 scaffolds before phasing; f HG00733 scaffolds after phasing; g mHomSap4 scaffolds before phasing; h mHomSap3 scaffolds after phasing.

References

    1. Kronenberg, Z. N. et al. High-resolution comparative analysis of great ape genomes. Science10.1126/science.aar6343 (2018). - PMC - PubMed
    1. English, A. C. et al. Assessing structural variation in a personal genome-towards a human reference diploid genome. BMC Genomics10.1186/s12864-015-1479-3 (2015). - PMC - PubMed
    1. Merker, J. D. et al. Long-read genome sequencing identifies causal structural variation in a Mendelian disease. Genet. Med. 10.1038/gim.2017.86 (2018). - PMC - PubMed
    1. Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet.10.1038/s41576-018-0003-4 (2018). - PubMed
    1. Church, D. M. et al. Extending reference assembly models. Genome Biol. 10.1186/s13059-015-0587-3 (2015). - PMC - PubMed

Publication types