. 2019 Jul 3;10(1):2933.

doi: 10.1038/s41467-019-10978-4.

DNA assembly for nanopore data storage readout

Randolph Lopez^{1

2}, Yuan-Jyue Chen³, Siena Dumas Ang³, Sergey Yekhanin³, Konstantin Makarychev³, Miklos Z Racz³, Georg Seelig^{2

4

5}, Karin Strauss³, Luis Ceze⁶

Affiliations

¹ Department of Bioengineering, University of Washington, Seattle, WA, 98105, USA.
² Molecular Engineering & Sciences Institute, University of Washington, Seattle, WA, 98195, USA.
³ Microsoft Research, Redmond, WA, 98052, USA.
⁴ Department of Electrical & Computer Engineering, University of Washington, Seattle, WA, 98195, USA.
⁵ Paul G. Allen School for Computer Science & Engineering, University of Washington, Seattle, WA, 98195, USA.
⁶ Paul G. Allen School for Computer Science & Engineering, University of Washington, Seattle, WA, 98195, USA. luisceze@cs.washington.edu.

PMID: 31270330
PMCID: PMC6610119
DOI: 10.1038/s41467-019-10978-4

DNA assembly for nanopore data storage readout

Randolph Lopez et al. Nat Commun. 2019.

. 2019 Jul 3;10(1):2933.

doi: 10.1038/s41467-019-10978-4.

Authors

Randolph Lopez^{1

2}, Yuan-Jyue Chen³, Siena Dumas Ang³, Sergey Yekhanin³, Konstantin Makarychev³, Miklos Z Racz³, Georg Seelig^{2

4

5}, Karin Strauss³, Luis Ceze⁶

Affiliations

¹ Department of Bioengineering, University of Washington, Seattle, WA, 98105, USA.
² Molecular Engineering & Sciences Institute, University of Washington, Seattle, WA, 98195, USA.
³ Microsoft Research, Redmond, WA, 98052, USA.
⁴ Department of Electrical & Computer Engineering, University of Washington, Seattle, WA, 98195, USA.
⁵ Paul G. Allen School for Computer Science & Engineering, University of Washington, Seattle, WA, 98195, USA.
⁶ Paul G. Allen School for Computer Science & Engineering, University of Washington, Seattle, WA, 98195, USA. luisceze@cs.washington.edu.

PMID: 31270330
PMCID: PMC6610119
DOI: 10.1038/s41467-019-10978-4

Abstract

Synthetic DNA is becoming an attractive substrate for digital data storage due to its density, durability, and relevance in biological research. A major challenge in making DNA data storage a reality is that reading DNA back into data using sequencing by synthesis remains a laborious, slow and expensive process. Here, we demonstrate successful decoding of 1.67 megabytes of information stored in short fragments of synthetic DNA using a portable nanopore sequencing platform. We design and validate an assembly strategy for DNA storage that drastically increases the throughput of nanopore sequencing. Importantly, this assembly strategy is generalizable to any application that requires nanopore sequencing of small DNA amplicons.

PubMed Disclaimer

Conflict of interest statement

Y-J. C., S.D.A., and K.S. are Microsoft employees. The remaining authors declare no competing interests.

Figures

**Fig. 1**
Overview of the DNA data storage workflow. a The encoding process starts with mapping multiple digital files into 150-nucleotide DNA sequences and sending them for synthesis. Each file has unique sequence addresses at the 5′ and 3′ end of each oligonucleotide for random access retrieval. Using PCR primers containing complementary overhang sequences, a specific file can be amplified and concatenated into long double-stranded DNA molecules suited for ONT Nanopore sequencing. Upon sequencing, a subset of reads with high accuracy are used to decode the selected file. b Our assembly and decoding strategy enabled successful decoding of 1.67 MB of digital information stored in DNA using nanopore sequencing. Our work represents a 2-order of magnitude improvement in demonstrated decoding ability using nanopore sequencing for DNA storage. c Sequence-until diagram. Nanopore sequencing enables real-time coverage estimation for decoding of digital files store in DNA. This enables the user to generate reads until coverage is enough for successful decoding. Upon decoding, a different file can be sequenced in the same flowcell or the sequencing run can be stopped and resumed later on. d Four different files encoded in DNA were amplified, assembled and sequenced using ONT MinION platform. We implemented overlap-extension PCR and Gibson Assembly to build assemblies of 6, 10, or 24 fragments for each file

**Fig. 2**
Random access and sequential Gibson strategy for DNA storage. a Our sequential Gibson Assembly strategy enables random access and concatenation of any particular file from a pool of DNA oligonucleotides. We demonstrated up to 24-fragment assembly by performing two sequential assemblies of 6 and 4 fragments, respectively. b First, a given file is PCR-amplified using primers specific to its address (AD1 and AD2) and containing overlapping overhang sequences (Xn). For a three-fragment assembly, three separate PCR amplification reactions are carried out, each containing primer pairs with different overhangs based on a sequential assembly design (X1 and X2*, X2 and X3*, and X3 and X4*). Upon amplification, products are purified and combined by Gibson Assembly. The assembly product can consist of any combination of oligonucleotides in a given file pool separated by ordered overhangs with a consistent directionality (e.g., X1-AD1-Payload1-AD2-X2-AD1-Payload2-AD2-X3-AD1-Payload3-AD2-X4). This product is then amplified by using primers specific to ends of the assembly (X₁ and X₄*) and can be used as the starting material for a second assembly. c Gel electrophoresis size distribution corresponding to a PCR amplification of a 6-fragment Gibson Assembly. The expected band size was 1110 bp. d Gel electrophoresis size distribution corresponding to a PCR amplification of a 24-fragment second Gibson Assembly. The expected band size was 4590 bp. The correct fragment was gel extracted and used for nanopore sequencing

**Fig. 3**
Random access and OE-PCR strategy for DNA storage a First, a digital file is encoded using Reed–Solomon code to increase robustness to errors. The binary data then get broken into fixed-size payloads and mapped to 110-bp DNA nucleotides with addressing information (Addr). These DNA fragments are split into multiple groups, each group is given unique primers as group ID (FGID front group ID, BGID back group ID). b To retrieve a particular file, each group is amplified with primers containing overhangs overlapping with each adjacent group primers. Subsequently, all the groups can be combined and amplified using primers corresponding to the end (e.g., FGID in the group 1 and BGID in the group 3) to form the assembly product using overlap-extension PCR. c Bioanalyzer traces corresponding to the assembly of group 1–2–3 (lane 1), group 4–5–6 (lane 2), group 7–8–9 (lane 3), and group 1 through 10 (lane 4). We found no observable by-products in the assembly of up to 10 groups demonstrating the scalability of this approach

**Fig. 4**
Nanopore sequencing analysis for 1.5 MB Apollo file. Two MinION flowcells generated 267,152 1D² reads of the 24-fragment Gibson Assembly of the Apollo file. a Base pair size of sequencing reads matches closely with the assembly size of 4590 bp. b We aligned each reference payload sequence to the sequencing reads. Each sequencing read resulting in an average of 17.99 alignments to different payloads. Ideally, each read should have 24 alignments. c We found an average sequencing coverage of 43x. Without any assembly, the same number of reads would have resulted in an average sequencing coverage of 2.6x. d We estimated raw sequencing quality by aggregating the average Phred quality score in each read. e Based on the reads that aligned to payloads, we calculated the average percent error for each base for insertions, deletion, and substitutions. f Substitution comparison across different bases revealed strong bias in between purines and pyrimidines

**Fig. 5**
Comparison between nanopore sequencing runs with different DNA fragment sizes. We compared five nanopore sequencing runs of equivalent sequencing quality to understand how input DNA size affects sequencing throughput. a Number of 1D² sequencing reads generated in different sequencing runs. We found a modest decrease in the number of 1D² reads as the input DNA size increased. b 1D² sequencing yield (# of bases) generated in different sequencing runs. We found that large assembly sizes resulted in higher sequencing yield and higher coverage

See this image and copyright information in PMC

References

1. Zhirnov V, Zadegan RM, Sandhu GS, Church GM. Nucleic acid memory. Nat. Mater. 2016;15:366–370. doi: 10.1038/nmat4594. - DOI - PMC - PubMed
1. Alharthi A, Krotov V, Bowman M. Addressing barriers to big data. Bus. Horiz. 2017;60:285–292. doi: 10.1016/j.bushor.2017.01.002. - DOI
1. Yazdi, S., Yuan, Y., Ma, J., Zhao, H. & Milenkovic, O. A rewritable, random-access DNA-based storage system. Sci. Rep.5, 14138 (2015). - PMC - PubMed
1. Goldman N, et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature. 2013;494:77. doi: 10.1038/nature11875. - DOI - PMC - PubMed
1. Bornholt J, et al. A DNA-based archival storage system. ACM SIGARCH Comput. Archit. News. 2016;44:637–649. doi: 10.1145/2980024.2872397. - DOI

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

Molecular Informatics/United States Department of Defense | Defense Advanced Research Projects Agency (DARPA)/International

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

DNA assembly for nanopore data storage readout

Affiliations

DNA assembly for nanopore data storage readout

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources