Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May;29(5):798-808.
doi: 10.1101/gr.245126.118. Epub 2019 Apr 2.

Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly

Affiliations

Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly

Ou Wang et al. Genome Res. 2019 May.

Abstract

Here, we describe single-tube long fragment read (stLFR), a technology that enables sequencing of data from long DNA molecules using economical second-generation sequencing technology. It is based on adding the same barcode sequence to subfragments of the original long DNA molecule (DNA cobarcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process, up to 3.6 billion unique barcode sequences were generated on beads, enabling practically nonredundant cobarcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique cobarcoding of more than 8 million 20- to 300-kb genomic DNA fragments. Analysis of the human genome NA12878 with stLFR demonstrated high-quality variant calling and phase block lengths up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries, and their construction did not significantly add to the time or cost of whole-genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high-quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of stLFR. (A) The first step of stLFR involves inserting a hybridization sequence approximately every 200–1000 bp on long genomic DNA molecules. This is achieved using transposons. The transposon-integrated DNA is then mixed with beads that each contain ∼400,000 copies of an adapter sequence that contains a unique barcode shared by all adapters on the bead, a common PCR primer site, and a common capture sequence that is complementary to the sequence on the integrated transposons. After the genomic DNA is captured to the beads, the transposons are ligated to the barcode adapters. There are a few additional library processing steps and then the cobarcoded subfragments are sequenced on a BGISEQ-500 or equivalent sequencer. (B) Mapping read data by barcode results in clustering of reads within 10- to 350-kb regions of the genome. Total coverage and barcode coverage from four barcodes are shown for the 1-ng stLFR-1 library across a small region on Chromosome 11. Most barcodes are associated with only one read cluster in the genome. (C) The number of original long DNA fragments per barcode are plotted for the 1-ng libraries stLFR-1 (blue) and stLFR-2 (orange) and the 10-ng stLFR libraries stLFR-3 (yellow) and stLFR-4 (gray). More than 80% of the fragments from the 1-ng stLFR libraries are cobarcoded by a single unique barcode. (D) The fraction of nonoverlapping sequence reads (blue) and captured subfragments (orange) covering each original long DNA fragment are plotted for the 1-ng stLFR-1 library.
Figure 2.
Figure 2.
stLFR-1 phasing performance. The 221 phased blocks from the stLFR-1 library are depicted on chromosomes as alternating colors of gray and purple. Unphased regions are depicted in white. The inset table shows the performance of phasing with different sequence read coverage levels.
Figure 3.
Figure 3.
SV detection. (A) Previously reported deletions in NA12878 were also found using stLFR data. Heat maps of barcode sharing for each deletion can be found in Supplemental Figure S3. (B) A heat map of barcode sharing within windows of 2 kb for a region with a ∼150 kb heterozygous deletion on Chromosome 8 was plotted using a Jaccard index as previously described (Zhang et al. 2017). Regions of high overlap are depicted in dark red. Those with no overlap in beige. Arrows demonstrate how regions that are spatially distant from each other on Chromosome 8 have increased overlap marking the locations of the deletion. (C) Cobarcoded reads are separated by haplotype and plotted by unique barcode on the y-axis and Chromosome 8 position on the x-axis. The heterozygous deletion is found in a single haplotype. Heat maps were also plotted for overlapping barcodes between Chromosomes 5 and 12 for a patient cell line with a known translocation (Dong et al. 2016) (D) and GM20759, a cell line with a known transversion in Chromosome 2 (Dong et al. 2017) (E).
Figure 4.
Figure 4.
Dot plots of de novo–assembled NA12878. The scaffolds from the de novo assemblies of stLFR-1 (A) and stLFR-2 (B) were compared against chromosomes from GRCh38 using dot plots.

References

    1. Amini S, Pushkarev D, Christiansen L, Kostem E, Royce T, Turk C, Pignatelli N, Adey A, Kitzman JO, Vijayan K, et al. 2014. Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat Genet 46: 1343–1349. 10.1038/ng.3119 - DOI - PMC - PubMed
    1. Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. 2013. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods 10: 1213–1218. 10.1038/nmeth.2688 - DOI - PMC - PubMed
    1. Chen T, Guestrin C. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM, San Francisco, CA.
    1. Cheng X, Wu M, Chin R, Lam H, Chen D, Wang L, Fan F, Zou Y, Chen A, Zhang W, et al. 2018. A simple bead-based method for generating cost-effective co-barcoded sequence reads. Protoc Exch 10.1038/protex.2018.116 - DOI
    1. Cleary JG, Braithwaite R, Gaastra K, Hilbush BS, Inglis S, Irvine SA, Jackson A, Littin R, Rathod M, Ware D, et al. 2015. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. bioRxiv 10.1101/023754 - DOI

Publication types

LinkOut - more resources