Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Mar;39(3):302-308.
doi: 10.1038/s41587-020-0719-5. Epub 2020 Dec 7.

Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads

Affiliations

Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads

David Porubsky et al. Nat Biotechnol. 2021 Mar.

Abstract

Human genomes are typically assembled as consensus sequences that lack information on parental haplotypes. Here we describe a reference-free workflow for diploid de novo genome assembly that combines the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing1,2 with continuous long-read or high-fidelity3 sequencing data. Employing this strategy, we produced a completely phased de novo genome assembly for each haplotype of an individual of Puerto Rican descent (HG00733) in the absence of parental data. The assemblies are accurate (quality value > 40) and highly contiguous (contig N50 > 23 Mbp) with low switch error rates (0.17%), providing fully phased single-nucleotide variants, indels and structural variants. A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms.

PubMed Disclaimer

Conflict of interest statement

E.E.E. was on the scientific advisory board of DNAnexus (2012–2020).

Figures

Fig. 1
Fig. 1. Overview of the genome assembly pipeline.
a, In a single Strand-seq library, only the template DNA strand (solid line) is sequenced for each parental homologous chromosome. b, Template strands of each homologue in a given diploid cell are randomly inherited by daughter cells (‘+’ positive strand, teal—Crick and ‘−’ negative strand, orange—Watson), resulting in three possible template strand states for homologous chromosomes (height of bars plotted along each chromosome represents the number of ‘+’ and ‘−’ reads mapped in each genomic bin): WC, one Crick and one Watson strand represented for given homologues; WW, only Watson template strands represented; or CC, only Crick template strands represented. c, Unassigned contigs follow the same pattern of template strand state inheritance based on the homologue they belong to. d, Contig order can be inferred based on low-frequency changes in a template strand state resulting from sister chromatid exchange (SCE) events in the parental cell: contigs that are closer to each other tend to share the same template strand state more often than more distant contigs. e, Regions with WC strand state are haplotype informative and can be assembled into continuous haplotypes. f, Haplotypes can then be used to split long reads into their respective homologues. g, Generation of long-read (HiFi/CLR/ONT)-based assemblies: 1) producing squashed assemblies; 2) assigning contigs to clusters using Strand-seq (StrandS); 3) phasing clustered assemblies using the combination of Strand-seq and long PacBio reads; and 4) partitioning and reassembling of haplotype-specific PacBio reads and polishing of the final diploid assemblies.
Fig. 2
Fig. 2. Reference-free scaffolding and phasing of squashed assembly for HG00733.
a, Each contig represents a range based on mapping coordinates on GRCh38. Contigs are colored based on cluster identity determined by SaaRclust. In an ideal scenario, there is a single color for each chromosome. b, The size of the longest haplotype block per chromosome is shown as red bars, with the remaining haplotype blocks of negligible length. The size of the point at the bottom of each bar reflects the number of haplotype blocks in each cluster. For perspective, the real size of each chromosome for GRCh38 is plotted as a horizontal solid line. c, The percentage of PacBio reads successfully assigned to either H1 (teal) or H2 (yellow). Reads that could not be assigned to either haplotype are shown in red.
Fig. 3
Fig. 3. Phased assembly analysis and common assembly breaks.
a, Each 1-Mbp block of phased contigs (Freeze 1.1; ‘Data availability’) are assigned to one of the parental genomes using SNV data from the parents: maternal segments (HG00732) are shown in blue; paternal segments (HG00731) are shown in yellow. b, Genome-wide summary of SNV density counted in 500-kbp genomic bins sliding by 10 kbp. The HLA locus on chromosome 6 is labeled as ‘HLA’. c, An ideogram shows aligned contigs separately for H1 and H2. Subsequent contigs are plotted as discontiguous rectangles along each chromosome. Positions of common breaks (n = 222) between Flye (CLR reads) and Peregrine (HiFi reads) assemblies are highlighted by horizontal lines and their overlap with various genomic features, such as SDs, is marked by colored dots. Note: owing to the difficulty of aligning contigs continuously over the centromeres, we flag these regions as unresolved. Inset, a bar plot summarizing the total counts for each genomic feature across all 222 assembly breaks. Unannot, unannotated assembly breaks. d, An ideogram shows genomic positions of 154 common assembly breaks shared by multiple assembles. Gray rectangles represent centromeric positions, whereas white rectangles point to genome gaps. e, Effect of coverage and read length on assembly contiguity. Points connected by lines represent the N50s of Peregrine assemblies for CHM libraries as a function of coverage (blue, CHM13, 10.9 kbp; orange, CHM1, 11.9 kbp; purple, CHM13, 14.2 kbp; brown, CHM13; 17.8 kbp). These assemblies show what contiguity is attainable with Peregrine given different read lengths and coverages in a genome with only one haplotype. Highlighted in red and green are the two Peregrine assemblies of the haplotypes of HG00733 (red, H1, 13.5 kbp; green, H2, 13.5 kbp).

References

    1. Falconer E, et al. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat. Methods. 2012;9:1107–1112. doi: 10.1038/nmeth.2206. - DOI - PMC - PubMed
    1. Sanders AD, Falconer E, Hills M, Spierings DCJ, Lansdorp PM. Single-cell template strand sequencing by Strand-seq enables the characterization of individual homologs. Nat. Protoc. 2017;12:1151–1176. doi: 10.1038/nprot.2017.029. - DOI - PubMed
    1. Wenger AM, et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 2019;37:1155–1162. doi: 10.1038/s41587-019-0217-9. - DOI - PMC - PubMed
    1. Levy S, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. doi: 10.1371/journal.pbio.0050254. - DOI - PMC - PubMed
    1. Schneider VA, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27:849–864. doi: 10.1101/gr.213611.116. - DOI - PMC - PubMed

Publication types