Direct determination of diploid genome sequences

Neil I Weisenfeld¹, Vijay Kumar¹, Preyas Shah¹, Deanna M Church¹, David B Jaffe¹

Affiliations

PMID: 28381613
PMCID: PMC5411770
DOI: 10.1101/gr.214874.116

Direct determination of diploid genome sequences

Neil I Weisenfeld et al. Genome Res. 2017 May.

. 2017 May;27(5):757-767.

doi: 10.1101/gr.214874.116. Epub 2017 Apr 5.

Authors

Neil I Weisenfeld¹, Vijay Kumar¹, Preyas Shah¹, Deanna M Church¹, David B Jaffe¹

Affiliation

¹ 10x Genomics, Pleasanton, California 94566, USA.

PMID: 28381613
PMCID: PMC5411770
DOI: 10.1101/gr.214874.116

Erratum in

Corrigendum: Direct determination of diploid genome sequences.
Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB. Weisenfeld NI, et al. Genome Res. 2018 Apr;28(4):606.1. doi: 10.1101/gr.235812.118. Genome Res. 2018. PMID: 29610250 Free PMC article. No abstract available.

Abstract

Determining the genome sequence of an organism is challenging, yet fundamental to understanding its biology. Over the past decade, thousands of human genomes have been sequenced, contributing deeply to biomedical research. In the vast majority of cases, these have been analyzed by aligning sequence reads to a single reference genome, biasing the resulting analyses, and in general, failing to capture sequences novel to a given genome. Some de novo assemblies have been constructed free of reference bias, but nearly all were constructed by merging homologous loci into single "consensus" sequences, generally absent from nature. These assemblies do not correctly represent the diploid biology of an individual. In exactly two cases, true diploid de novo assemblies have been made, at great expense. One was generated using Sanger sequencing, and one using thousands of clone pools. Here, we demonstrate a straightforward and low-cost method for creating true diploid de novo assemblies. We make a single library from ∼1 ng of high molecular weight DNA, using the 10x Genomics microfluidic platform to partition the genome. We applied this technique to seven human samples, generating low-cost HiSeq X data, then assembled these using a new "pushbutton" algorithm, Supernova. Each computation took 2 d on a single server. Each yielded contigs longer than 100 kb, phase blocks longer than 2.5 Mb, and scaffolds longer than 15 Mb. Our method provides a scalable capability for determining the actual diploid genome sequence in a sample, opening the door to new approaches in genomic biology and medicine.

PubMed Disclaimer

Figures

**Figure 1.**
Lines in an assembly graph. Each edge represents a DNA sequence. (A) Blue portion describes a line in an assembly graph, which is an acyclic graph part bounded on both ends by single edges. The line alternates between five common segments and four bubbles, three of which have two branches. The third bubble is more complicated. The entire graph may be partitioned so that each of its edges lies in a unique line (allowing for degenerate cases, including single edge lines, and circles). (B) The same line, but now each bubble has been replaced by a bubble consisting of all its paths. After this change, each bubble consists only of parallel edges.

**Figure 2.**
Supernova assemblies encode diploid genome architecture. Each edge represents a sequence. Blue represents one parental allele, and gold represents the other. Megabubble arms represent alternative parental alleles at a given locus, whereas sequences between megabubbles are homozygous (or appear so to Supernova). Successive megabubbles are not phased relative to each other. Smaller scale features appear as gaps and bubbles.

**Figure 3.**
Representation of Supernova assemblies as FASTA. Several styles are depicted. (A) The raw style represents every edge in the assembly as a FASTA record (red segments). These include microbubble arms and also gaps (printed as records comprising 100 Ns for gaps bridged by read pairs, or a larger number, the estimated gap size) (Supplemental Note 5). Unresolved cycles are replaced by a path through the cycle, followed by 10 Ns. Bubbles and gaps generally appear once per 10–20 kb; consequently, FASTA records from A are much shorter (∼100 times) than those from *B, C,* and D. For each edge in the raw graph, there is also an edge written to the FASTA file representing the reverse complement sequence. For the remaining output styles, we flatten each microbubble by selecting the branch having highest coverage, merge gaps with adjacent sequences (leaving Ns), and drop reverse complement edges. (B) In this style each megabubble arm corresponds to a FASTA record, as does each intervening sequence. (C) The pseudohap style generates a single record per scaffold. As compared to the megabubble style, in the example, seven red edges are seen on *top* (corresponding to seven FASTA records) that are combined into a single FASTA record in the pseudohap style. Megabubble arms are chosen arbitrarily so many records will mix maternal and paternal alleles. (D) This style is like the pseudohap option, except that for each scaffold, two “parallel” pseudohaplotypes are created and placed in separate FASTA files.

**Figure 4.**
Alignment of Supernova assembly to finished sequence from the same sample. GenBank sequence AC004551.1 for finished clone RPCI1-71H24 has length 162,346 bases, and its reverse complement perfectly matches GRCh37. The clone encompasses a region of Neandertal origin (Mendez et al. 2013). Both the clone and assembly F (Table 1) represent DNA from the same HGP donor. The clone matches a region of which 96% is between two megabubbles in the assembly, thus represented as homozygous. The alignment of the assembly to the clone region on GRCh37 is shown. Each line pair shows the assembly on *top* and the reference on the *bottom*. (Yellow) abbreviated, perfectly matching stretches; (green) mismatched bases; (blue) indels; (cyan) indels, but not present in comparison to raw graph; (red) captured gap: signified by 34 Ns (actual number in assembly is 100); assembly region also has two cycles, each suffixed by 10 Ns in output, not shown. In these cases the flattened sequence for the cycle exactly matches the reference.

See this image and copyright information in PMC

References

1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed
1. Adey A, Kitzman JO, Burton JN, Daza R, Kuman A, Christiansen L, Ronaghi M, Amini S, Gunderson KL, Steemers FJ, et al. 2014. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity. Genome Res 24: 2041–2049. - PMC - PubMed
1. Anantharaman T, Mishra B. 2001. False positives in genome map assembly and sequence validation. In Algorithms in bioinformatics (ed. Gascuel O, Moret BM), pp. 27–40. Springer, Berlin, Germany.
1. Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, Adams MD, Myers EW, Li PW, Eichler EE. 2002. Recent segmental duplications in the human genome. Science 297: 1003–1007. - PubMed
1. Bovee D, Zhou Y, Haugen E, Wu Z, Hayden HS, Gillett W, Tuzun E, Cooper GM, Sampas N, Phelps K, et al. 2008. Closing gaps in the human genome with fosmid resources generated from multiple individuals. Nat Genet 40: 96–101. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Direct determination of diploid genome sequences

Affiliation

Direct determination of diploid genome sequences

Authors

Affiliation

Erratum in

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous