In vitro, long-range sequence information for de novo genome assembly via transposase contiguity

Affiliations

¹ Department of Genome Sciences, University of Washington, Seattle, Washington 98115, USA;
² Illumina, Inc., Advanced Research Group, San Diego, California 92122, USA.
³ Department of Genome Sciences, University of Washington, Seattle, Washington 98115, USA; shendure@uw.edu.

PMID: 25327137
PMCID: PMC4248320
DOI: 10.1101/gr.178319.114

In vitro, long-range sequence information for de novo genome assembly via transposase contiguity

Andrew Adey et al. Genome Res. 2014 Dec.

. 2014 Dec;24(12):2041-9.

doi: 10.1101/gr.178319.114. Epub 2014 Oct 19.

Authors

Affiliations

¹ Department of Genome Sciences, University of Washington, Seattle, Washington 98115, USA;
² Illumina, Inc., Advanced Research Group, San Diego, California 92122, USA.
³ Department of Genome Sciences, University of Washington, Seattle, Washington 98115, USA; shendure@uw.edu.

PMID: 25327137
PMCID: PMC4248320
DOI: 10.1101/gr.178319.114

Abstract

We describe a method that exploits contiguity preserving transposase sequencing (CPT-seq) to facilitate the scaffolding of de novo genome assemblies. CPT-seq is an entirely in vitro means of generating libraries comprised of 9216 indexed pools, each of which contains thousands of sparsely sequenced long fragments ranging from 5 kilobases to > 1 megabase. These pools are "subhaploid," in that the lengths of fragments contained in each pool sums to ∼5% to 10% of the full genome. The scaffolding approach described here, termed fragScaff, leverages coincidences between the content of different pools as a source of contiguity information. Specifically, CPT-seq data is mapped to a de novo genome assembly, followed by the identification of pairs of contigs or scaffolds whose ends disproportionately co-occur in the same indexed pools, consistent with true adjacency in the genome. Such candidate "joins" are used to construct a graph, which is then resolved by a minimum spanning tree. As a proof-of-concept, we apply CPT-seq and fragScaff to substantially boost the contiguity of de novo assemblies of the human, mouse, and fly genomes, increasing the scaffold N50 of de novo assemblies by eight- to 57-fold with high accuracy. We also demonstrate that fragScaff is complementary to Hi-C-based contact probability maps, providing midrange contiguity to support robust, accurate chromosome-scale de novo genome assemblies without the need for laborious in vivo cloning steps. Finally, we demonstrate CPT-seq as a means of anchoring unplaced novel human contigs to the reference genome as well as for detecting misassembled sequences.

PubMed Disclaimer

Figures

**Figure 1.**
CPT-seq method and performance. (A) High molecular weight (HMW) genomic DNA reacted with hyperactive Tn5 transposase loaded with indexed adaptors. After the transposase complex fragments the DNA and appends the indexed adaptors, the enzyme remains tightly bound to the DNA, such that library molecules derived from the same HMW genomic DNA molecule remain physically linked. Once the transposase is removed by denaturation, PCR amplification of viable templates (gray boxes) can be performed. (B) Schematic of two tier indexing. A 96-plex indexed tagmentation is performed (but without removing the transposase), followed by pooling, mixing, and redistribution to 96 wells. These new pools are subjected to removal of the transposase, 96-plex indexed PCR and then pooling to a single sequencing library. Individual molecules within the final library have indices corresponding to both the pool in which their originating HMW genomic DNA fragment was present during tagmentation (96 indices) as well as during PCR (96 indices), such that there are effectively 96 × 96 = 9216 compartments. (C) Representation of coverage profiles for indexed fragment pools, i.e., compartments (*top*) and trimodal distribution of adjacently aligning reads within individual compartments. The first peak (∼100 bp; red) corresponds to simple read pairs; the second peak (∼3.2 kbp; green) corresponds to reads originating from the same HMW genomic DNA fragment; the third peak (∼1 Mbp; blue) corresponds to reads originating from different HMW genomic DNA fragments. (D) Distribution of estimated HMW genomic DNA fragment lengths for CPT-seq of GM12878. The mean fragment size is 33.9 kbp, but it is a broad distribution and nearly 1M fragments are >100 kbp.

**Figure 2.**
*fragScaff* assembly method. (A) The ends of contigs in a de novo genome assembly (gray boxes) are defined as nodes, and the subsets of the 9216 CPT-seq compartments, i.e., indexed pools, containing reads that align to each node are identified. The fraction of shared compartments between every possible pair of nodes is calculated. Pairs of nodes that are truly adjacent to one another in the genome are expected to exhibit excess sharing with respect to CPT-seq compartments as a result of HMW genomic DNA fragments that bridge the gap in the de novo genome assembly. Nonadjacent pairs of nodes will co-occur in a small fraction of compartments by chance, as each contains HMW genomic fragments that cover ∼10% of the genome. (B) The fraction of shared compartments is calculated for all possible pairs of nodes, and distributions are generated for each node. Outlier nodes in each distribution are identified assuming normality and using a P-value cutoff. If a link is reciprocated, i.e., if two nodes are each outliers in the other’s distribution, it is stored as an edge. (C) Subgraphs are reduced to their minimum spanning tree (MST), and the longest path (Trunk) is found. Branches (light nodes) are then placed to produce the final output scaffold. (D) Size distribution of gaps between properly linked contigs. Boxes indicate joins spanning gaps just beyond the 2.5-kbp mate-pair library (red), ∼6 kbp L1 repeat elements (green), and joins longer than 35 kbp, which cannot be achieved via fosmid mate-pair libraries (blue; n = 664).

**Figure 3.**
Misassembly detection using CPT-seq. (A) Three regions of assembled scaffolds representing various misassembly detections are shown. For each region, the set of CPT-seq indexed pools that cover each 5-kbp window (x-axis) was determined. The shared fraction of indexed pools between immediately adjacent windows (blue) and for windows one apart (purple) is plotted. Subregions for which both shared fraction values were in the bottom fifth percentile overall were called as potential misassemblies. False positive misassembly calls (i.e., no misassembly is actually present) overwhelmingly consisted of an isolated window (green shading), whereas multiple consecutive windows with low shared fraction values corresponded to misassemblies by *fragScaff* (yellow shading) or in the initial input assembly (red shading). (B) Breakdown of regions called as potentially misassembled by this approach (*left*) versus a randomly selected set of windows for comparison (*right*).

See this image and copyright information in PMC

References

1. Adey A, Shendure J. 2012. Ultra-low-input, tagmentation-based whole-genome bisulfite sequencing. Genome Res 22: 1139–1143. - PMC - PubMed
1. Adey A, Morrison HG, Asan, Xun X, Kitzman JO, Turner EH, Stackhouse B, MacKenzie AP, Caruccio NC, Zhang X, Shendure J. 2010. Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome Biol 11: R119. - PMC - PubMed
1. Adey A, Burton JN, Kitzman JO, Hiatt JB, Lewis AP, Martin BK, Qiu R, Lee C, Shendure J. 2013. The haplotype-resolved genome and epigenome of the aneuploid HeLa cancer cell line. Nature 500: 207–211. - PMC - PubMed
1. Amini S, Pushkarev D, Christiansen L, Royce T, Turk C, Pignatelli N, Adey A, Kitzman JO, Ronaghi M, Shendure J, et al. . 2014. Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat Genet doi: 10.1038/ng.3119. - DOI - PMC - PubMed
1. Blanco-Ulate B, Rolshausen PE, Cantu D. 2013. Draft genome sequence of the grapevine dieback fungus Eutypa lata UCR-EL1. Genome Announc 1: e00228-13. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

In vitro, long-range sequence information for de novo genome assembly via transposase contiguity

Affiliations

In vitro, long-range sequence information for de novo genome assembly via transposase contiguity

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials