Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jul 3;10(1):2933.
doi: 10.1038/s41467-019-10978-4.

DNA assembly for nanopore data storage readout

Affiliations

DNA assembly for nanopore data storage readout

Randolph Lopez et al. Nat Commun. .

Abstract

Synthetic DNA is becoming an attractive substrate for digital data storage due to its density, durability, and relevance in biological research. A major challenge in making DNA data storage a reality is that reading DNA back into data using sequencing by synthesis remains a laborious, slow and expensive process. Here, we demonstrate successful decoding of 1.67 megabytes of information stored in short fragments of synthetic DNA using a portable nanopore sequencing platform. We design and validate an assembly strategy for DNA storage that drastically increases the throughput of nanopore sequencing. Importantly, this assembly strategy is generalizable to any application that requires nanopore sequencing of small DNA amplicons.

PubMed Disclaimer

Conflict of interest statement

Y-J. C., S.D.A., and K.S. are Microsoft employees. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Overview of the DNA data storage workflow. a The encoding process starts with mapping multiple digital files into 150-nucleotide DNA sequences and sending them for synthesis. Each file has unique sequence addresses at the 5′ and 3′ end of each oligonucleotide for random access retrieval. Using PCR primers containing complementary overhang sequences, a specific file can be amplified and concatenated into long double-stranded DNA molecules suited for ONT Nanopore sequencing. Upon sequencing, a subset of reads with high accuracy are used to decode the selected file. b Our assembly and decoding strategy enabled successful decoding of 1.67 MB of digital information stored in DNA using nanopore sequencing. Our work represents a 2-order of magnitude improvement in demonstrated decoding ability using nanopore sequencing for DNA storage. c Sequence-until diagram. Nanopore sequencing enables real-time coverage estimation for decoding of digital files store in DNA. This enables the user to generate reads until coverage is enough for successful decoding. Upon decoding, a different file can be sequenced in the same flowcell or the sequencing run can be stopped and resumed later on. d Four different files encoded in DNA were amplified, assembled and sequenced using ONT MinION platform. We implemented overlap-extension PCR and Gibson Assembly to build assemblies of 6, 10, or 24 fragments for each file
Fig. 2
Fig. 2
Random access and sequential Gibson strategy for DNA storage. a Our sequential Gibson Assembly strategy enables random access and concatenation of any particular file from a pool of DNA oligonucleotides. We demonstrated up to 24-fragment assembly by performing two sequential assemblies of 6 and 4 fragments, respectively. b First, a given file is PCR-amplified using primers specific to its address (AD1 and AD2) and containing overlapping overhang sequences (Xn). For a three-fragment assembly, three separate PCR amplification reactions are carried out, each containing primer pairs with different overhangs based on a sequential assembly design (X1 and X2*, X2 and X3*, and X3 and X4*). Upon amplification, products are purified and combined by Gibson Assembly. The assembly product can consist of any combination of oligonucleotides in a given file pool separated by ordered overhangs with a consistent directionality (e.g., X1-AD1-Payload1-AD2-X2-AD1-Payload2-AD2-X3-AD1-Payload3-AD2-X4). This product is then amplified by using primers specific to ends of the assembly (X1 and X4*) and can be used as the starting material for a second assembly. c Gel electrophoresis size distribution corresponding to a PCR amplification of a 6-fragment Gibson Assembly. The expected band size was 1110 bp. d Gel electrophoresis size distribution corresponding to a PCR amplification of a 24-fragment second Gibson Assembly. The expected band size was 4590 bp. The correct fragment was gel extracted and used for nanopore sequencing
Fig. 3
Fig. 3
Random access and OE-PCR strategy for DNA storage a First, a digital file is encoded using Reed–Solomon code to increase robustness to errors. The binary data then get broken into fixed-size payloads and mapped to 110-bp DNA nucleotides with addressing information (Addr). These DNA fragments are split into multiple groups, each group is given unique primers as group ID (FGID front group ID, BGID back group ID). b To retrieve a particular file, each group is amplified with primers containing overhangs overlapping with each adjacent group primers. Subsequently, all the groups can be combined and amplified using primers corresponding to the end (e.g., FGID in the group 1 and BGID in the group 3) to form the assembly product using overlap-extension PCR. c Bioanalyzer traces corresponding to the assembly of group 1–2–3 (lane 1), group 4–5–6 (lane 2), group 7–8–9 (lane 3), and group 1 through 10 (lane 4). We found no observable by-products in the assembly of up to 10 groups demonstrating the scalability of this approach
Fig. 4
Fig. 4
Nanopore sequencing analysis for 1.5 MB Apollo file. Two MinION flowcells generated 267,152 1D2 reads of the 24-fragment Gibson Assembly of the Apollo file. a Base pair size of sequencing reads matches closely with the assembly size of 4590 bp. b We aligned each reference payload sequence to the sequencing reads. Each sequencing read resulting in an average of 17.99 alignments to different payloads. Ideally, each read should have 24 alignments. c We found an average sequencing coverage of 43x. Without any assembly, the same number of reads would have resulted in an average sequencing coverage of 2.6x. d We estimated raw sequencing quality by aggregating the average Phred quality score in each read. e Based on the reads that aligned to payloads, we calculated the average percent error for each base for insertions, deletion, and substitutions. f Substitution comparison across different bases revealed strong bias in between purines and pyrimidines
Fig. 5
Fig. 5
Comparison between nanopore sequencing runs with different DNA fragment sizes. We compared five nanopore sequencing runs of equivalent sequencing quality to understand how input DNA size affects sequencing throughput. a Number of 1D2 sequencing reads generated in different sequencing runs. We found a modest decrease in the number of 1D2 reads as the input DNA size increased. b 1D2 sequencing yield (# of bases) generated in different sequencing runs. We found that large assembly sizes resulted in higher sequencing yield and higher coverage

References

    1. Zhirnov V, Zadegan RM, Sandhu GS, Church GM. Nucleic acid memory. Nat. Mater. 2016;15:366–370. doi: 10.1038/nmat4594. - DOI - PMC - PubMed
    1. Alharthi A, Krotov V, Bowman M. Addressing barriers to big data. Bus. Horiz. 2017;60:285–292. doi: 10.1016/j.bushor.2017.01.002. - DOI
    1. Yazdi, S., Yuan, Y., Ma, J., Zhao, H. & Milenkovic, O. A rewritable, random-access DNA-based storage system. Sci. Rep.5, 14138 (2015). - PMC - PubMed
    1. Goldman N, et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature. 2013;494:77. doi: 10.1038/nature11875. - DOI - PMC - PubMed
    1. Bornholt J, et al. A DNA-based archival storage system. ACM SIGARCH Comput. Archit. News. 2016;44:637–649. doi: 10.1145/2980024.2872397. - DOI

Publication types