Single haplotype assembly of the human genome from a hydatidiform mole

Affiliations

¹ The Genome Institute at Washington University, St. Louis, Missouri 63108, USA;
² National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
³ Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA; Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA;
⁴ Department of Pathology and Human Genetics, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, USA;
⁵ Personalis, Inc., Menlo Park, California 94025, USA.

PMID: 25373144
PMCID: PMC4248323
DOI: 10.1101/gr.180893.114

Single haplotype assembly of the human genome from a hydatidiform mole

Karyn Meltz Steinberg et al. Genome Res. 2014 Dec.

. 2014 Dec;24(12):2066-76.

doi: 10.1101/gr.180893.114. Epub 2014 Nov 4.

Affiliations

¹ The Genome Institute at Washington University, St. Louis, Missouri 63108, USA;
² National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
³ Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA; Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA;
⁴ Department of Pathology and Human Genetics, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, USA;
⁵ Personalis, Inc., Menlo Park, California 94025, USA.

PMID: 25373144
PMCID: PMC4248323
DOI: 10.1101/gr.180893.114

Abstract

A complete reference assembly is essential for accurately interpreting individual genomes and associating variation with phenotypes. While the current human reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can improve assembly, even the longest available reads do not resolve all regions. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones and 100× Illumina whole-genome shotgun (WGS) sequence coverage. We used the WGS sequence and the GRCh37 reference assembly to create an assembly of the CHM1 genome. We subsequently incorporated 382 finished BAC clone sequences to generate a draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity. However, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions. Most of these regions are enriched for structural variation and segmental duplication, and can be resolved in the future. This publicly available assembly will be integrated into the Genome Reference Consortium curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.

PubMed Disclaimer

Figures

**Figure 1.**
Comparison of contig count and contig N50 between CHM1_1.1 (GenBank GCA_000306695.2) and HuRef (J. Craig Venter assembly; GenBank GCA_000002125.2), ALLPATHS (GenBank AEKP00000000.1) and YH_2.0 (GenBank GCA_000004845.2) WGS assemblies. CHM1_1.1 has only 10%–20% of the number of total contigs as the other assemblies and has a contig N50 1.5 to six times larger.

**Figure 2.**
(A) WGS assembly from the first pass (CHM1_1.0; GCF_000306695.1, bronze line) on Chromosome 1p12 (NC_018912.1: 121,050,000-121,400,000) demonstrated a gap (gray box) in the assembly (assembly name: AMYH010000980.1, green lines). Using MEGABLAST, two CH17 clones (AC247039.2 and AC253572.3, red lines) aligned to the region and appeared to span the gap. (B) By incorporating these BAC sequences into the assembly, the gap was subsequently resolved in CHM1_1.1 (NC_018912.2: 121,050,000–121,650,000). The tiling path components, FP325311.11, AC241952.2, AC247039.2, AC253572.3, and AC241377.3 indicate the clone names used to resolve the gap. The clones from A are indicated in red while the other clones are in purple. The final assembly–assembly alignment is indicated in purple, showing the gap resolution.

**Figure 3.**
(A) Comparison of segmental duplications (SDs) in GRCh37 and CHM1.1 assemblies predicted by WGAC analysis by chromosome. The duplication content is comparable between GRCh37 and CHM1_1.1 assemblies indicating good assembly quality. (B) Venn diagram of SDs in GRCh37 and CHM1_1.1 assemblies shows that most duplications are shared between the assemblies.

**Figure 4.**
Functional consequences of CHM1 heterozygous variants not in repetitive sequence (HNR variants). Approximately 97% of HNR variants are intergenic or intronic. Of the remaining 3% of other variants, ∼48% are in the 3′ or 5′ UTR, 17% are silent, and 35% are coding (missense, nonsense, essential splice site).

**Figure 5.**
Overview of the Chr 11 (NC_018922.2) 1.9-Mb region, exhibiting three alignment bins with a large number of PacBio “cliff” reads where the alignment coverage dropped off sharply. WGS component (light green lines) boundaries flanked by such reads are marked with red dashed lines. The ends of each component at the boundary are labeled with letters to show orientation. Pairs of alignments corresponding to three different PacBio reads are marked in yellow, green, and dark blue. These alignments overlap by < 10% on each of the reads. The split alignments for these three reads suggest that the two WGS components marked in purple should be inverted and translocated as indicated by the arrow at the *top* of the image. The other PacBio reads in these bins exhibit the same pattern of split alignments, which supports the proposed reordering and orientation of the WGS components. The *bottom* light green lines show a proposed tiling path with the orientation corrected; the letters indicate where each end of the initial tiling path components should be placed.

See this image and copyright information in PMC

References

1. The 1000 Genomes Project Consortium 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. - PMC - PubMed
1. Alkan C, Sajjadian S, Eichler EE. 2011. Limitations of next-generation genome sequence assembly. Nat Methods 8: 61–65. - PMC - PubMed
1. Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. 2001. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res 11: 1005–1017. - PMC - PubMed
1. Barbouti A, Stankiewicz P, Nusbaum C, Cuomo C, Cook A, Hoglund M, Johansson B, Hagemeijer A, Park SS, Mitelman F, et al. . 2004. The breakpoint region of the most common isochromosome, i(17q), in human neoplasia is characterized by a complex genomic architecture with large, palindromic, low-copy repeats. Am J Hum Genet 74: 1–10. - PMC - PubMed
1. Bosnakovski D, Xu Z, Gang EJ, Galindo CL, Liu M, Simsek T, Garner HR, Agha-Mohammadi S, Tassin A, Coppee F, et al. . 2008. An isogenetic myoblast expression screen identifies DUX4-mediated FSHD-associated molecular pathologies. EMBO J 27: 2766–2779. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Associated data

Grants and funding

P01 HG004120/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- SILVA
Research Materials
- Cellosaurus - a cell line knowledge resource
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Single haplotype assembly of the human genome from a hydatidiform mole

Affiliations

Single haplotype assembly of the human genome from a hydatidiform mole

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials

Miscellaneous