Terminator-free template-independent enzymatic DNA synthesis for digital information storage

Henry H Lee^{1

2}, Reza Kalhor^{3

4}, Naveen Goela⁵, Jean Bolot⁵, George M Church^{6

7}

Affiliations

¹ Department of Genetics, Harvard Medical School, Boston, 02115, MA, USA. hhlee@genetics.med.harvard.edu.
² Wyss Institute for Biologically Inspired Engineering at Harvard University, Boston, 02115, MA, USA. hhlee@genetics.med.harvard.edu.
³ Department of Genetics, Harvard Medical School, Boston, 02115, MA, USA.
⁴ Wyss Institute for Biologically Inspired Engineering at Harvard University, Boston, 02115, MA, USA.
⁵ Technicolor Research & Innovation Lab, Palo Alto, 94306, CA, USA.
⁶ Department of Genetics, Harvard Medical School, Boston, 02115, MA, USA. gchurch@genetics.med.harvard.edu.
⁷ Wyss Institute for Biologically Inspired Engineering at Harvard University, Boston, 02115, MA, USA. gchurch@genetics.med.harvard.edu.

PMID: 31160595
PMCID: PMC6546792
DOI: 10.1038/s41467-019-10258-1

Terminator-free template-independent enzymatic DNA synthesis for digital information storage

Henry H Lee et al. Nat Commun. 2019.

. 2019 Jun 3;10(1):2383.

doi: 10.1038/s41467-019-10258-1.

Authors

Henry H Lee^{1

2}, Reza Kalhor^{3

4}, Naveen Goela⁵, Jean Bolot⁵, George M Church^{6

7}

Affiliations

¹ Department of Genetics, Harvard Medical School, Boston, 02115, MA, USA. hhlee@genetics.med.harvard.edu.
² Wyss Institute for Biologically Inspired Engineering at Harvard University, Boston, 02115, MA, USA. hhlee@genetics.med.harvard.edu.
³ Department of Genetics, Harvard Medical School, Boston, 02115, MA, USA.
⁴ Wyss Institute for Biologically Inspired Engineering at Harvard University, Boston, 02115, MA, USA.
⁵ Technicolor Research & Innovation Lab, Palo Alto, 94306, CA, USA.
⁶ Department of Genetics, Harvard Medical School, Boston, 02115, MA, USA. gchurch@genetics.med.harvard.edu.
⁷ Wyss Institute for Biologically Inspired Engineering at Harvard University, Boston, 02115, MA, USA. gchurch@genetics.med.harvard.edu.

PMID: 31160595
PMCID: PMC6546792
DOI: 10.1038/s41467-019-10258-1

Abstract

DNA is an emerging medium for digital data and its adoption can be accelerated by synthesis processes specialized for storage applications. Here, we describe a de novo enzymatic synthesis strategy designed for data storage which harnesses the template-independent polymerase terminal deoxynucleotidyl transferase (TdT) in kinetically controlled conditions. Information is stored in transitions between non-identical nucleotides of DNA strands. To produce strands representing user-defined content, nucleotide substrates are added iteratively, yielding short homopolymeric extensions whose lengths are controlled by apyrase-mediated substrate degradation. With this scheme, we synthesize DNA strands carrying 144 bits, including addressing, and demonstrate retrieval with streaming nanopore sequencing. We further devise a digital codec to reduce requirements for synthesis accuracy and sequencing coverage, and experimentally show robust data retrieval from imperfectly synthesized strands. This work provides distributive enzymatic synthesis and information-theoretic approaches to advance digital information storage in DNA.

PubMed Disclaimer

Conflict of interest statement

H.H.L., R.K., and G.M.C. have filed patents covering the synthesis process (WO 2017/176541) and the encoding/decoding process (PCT/US18/56900). N.G and J.B. have filed a patent for the use of synchronization markers for the codec (WO 2018/148260).

Figures

**Fig. 1**
An enzymatic synthesis strategy for storing information in DNA. a Schematic depiction of a series of enzymatic synthesis reactions consisting of an oligonucleotide initiator (N, gray), terminal deoxynucleotidyl transferase (TdT) and apyrase (AP). The initiator is tethered to a solid support. In each cycle, TdT catalyzes the addition of a given nucleoside triphosphate to the 3′-end of all initiators, whereas apyrase degrades the added substrate to limit net polymerization. A wash can be performed at the end of each cycle to remove reaction byproducts or to facilitate downstream processes. b DNA strands synthesized for each of eight consecutive synthesis cycle, as shown on 15% TBE-urea gel. The initiators were not tethered to a solid support and no wash was performed between cycles. The first lane is a single-stranded DNA size marker, which includes 24 nucleotide long initiator oligonucleotide. c A schema for interconversion of DNA and information. Raw strands (strands^R) represent enzymatically synthesized DNA. A compressed strand (strand^C) represents a sequence of transitions between non-identical nucleotides. Transitions between nucleotides, starting with the last nucleotide of the initiator (as an example N = “a”, gray) are mapped from the compressed strand to digital data in trits. If a strand^C is equivalent to the template sequence, all desired transitions are present and the information stored in DNA is retrieved

**Fig. 2**
Demonstration of information storage in DNA using enzymatic synthesis. a The message “hello world!” was encoded in 12 template sequences, H01–H12, each representing one character. Transitions between nucleotides start with the last base of the initiator, which is labeled ‘g’. A header index (shaded gray) denotes strand order. Only results from H01–H05 are shown (see Supplementary Fig. 9). To encode each character, its respective ASCII decimal value, prefixed with an address is represented in base 2 (binary) or in base 3 (ternary) (see Supplementary Table 2), mapped to transitions (see Fig. 1c), resulting in template sequences with nucleotides to be synthesized (capitalized). b Extension lengths for each base from a is shown as a letter-value plot with median. Only perfect strands^R, those whose strand^C is equivalent to a template sequence, are presented. Synthesis was performed with initiators tethered to beads and sequencing performed on the Illumina platform. c Distribution of extension lengths for each nucleotide transition, combined across all positions from all perfect strands is shown as a letter-value plot with median. d Stepwise increases in strand^R length with an increasing strand^C length for all synthesized strands of H01–H12 is shown as a letter-value plot with median. e Distribution of all strand^R lengths. Distributions are derived via kernel density estimation for all synthesized strands (‘all’, gray shading) and a subpopulation of strands that contain all desired transitions (‘perfect’, dotted line). f Bulk error analysis for all synthesized strands of H01–H12. All strands^C were aligned, by Needleman–Wunsch, to their respective template sequences, and the number of mismatches, insertions, and missing nucleotides were tabulated. g Information retrieval with in silico filtering. Fraction of perfect strands^C is shown before (triangles) or after filtering (circles). Fraction of perfect strands^C is shown for all sequences (white) or only the top three most-abundant sequences (black). h Information retrieval by different sequencing platforms. Streaming nanopore sequencing (Oxford, filled diamonds) was compared with batch sequencing-by-synthesis (Illumina, open circles). Each dot indicates the fraction of sequencing run at which each strand is robustly retrieved (100% correct with 99.99% probability). Arrows denote the fraction of the sequencing run at which all data are robustly retrieved using each platform. Source data for b–h are provided in the Source Data file package

**Fig. 3**
Coded strand architecture for sequence reconstruction. a A DNA information storage channel. Data are converted to template sequences, synthesized (yielding strands^R), and can be stored in vitro. Retrieval starts with sequencing, then transitions of non-identical nucleotides are extracted in silico to form strands^C. Data retrieval occurs when the template sequence and reconstructed sequence are equivalent. Errors that occur in the synthesis and sequencing steps can be modeled as a communications channel. b A coded strand architecture, “scaffold”, enables data retrieval from strands^C that are missing nucleotides, whereas an “unguided” reconstruction results in multiple possible solutions. Synchronization nucleotides (dark gray boxes) localize errors to yield a single reconstructed sequence. c A 16-base transition sequence, E0, is synthesized and sequenced with Illumina. Examples of diverse strands^C produced by synthesis of E0. Strands^C are aligned, by Needleman–Wunsch, to the template. Ambiguous alignments can exist depending on the location and number of missing nucleotides within a strand^C. d Error analysis for purified strands of E0. Synthesized strands were purified in silico, by filtering for strands^R between 32 and 48 bases in length, and corresponding strands^C were aligned by Needleman–Wunsch to the E0 template. For each alignment, the number of mismatches, insertions, and missing nucleotides were tabulated. e Evaluating the diversity of synthesized strands. The number of sequencing reads for each length of strand^C was tabulated. Diversity was evaluated as the number of unique variants at each length of strand^C and the Levenshtein edit distance was computed with respect to the E0 template. The set of 802 purified strands contains two perfect strands. Source data for c and d are provided in the Source Data file package

**Fig. 4**
Coded strand architecture for robust information storage. a The message “Eureka!” was encoded and partitioned into four template sequences, E1–E4. Each sequence stores a 2-bit address and 14 bits of data. These bits are mapped to a template sequence of 16 nucleotides, which includes four synchronization nucleotides (dark gray). Synthesis was performed with initiators tethered to beads and sequencing performed on the Illumina platform. b Retrieving information from E1 to E4. Synthesized strands^R were sequenced using the Illumina sequencing-by-synthesis (SBS) platform and purified in silico based on raw length of 32–48 nucleotides (Methods). The decoding accuracy for each sequence is defined as the probability of 100% correct data retrieval for a given number of reads, estimated over 500 decoding trials. Each trial is based on a randomly drawn set of purified strand^C variants. A 90% decoding accuracy (gray band) is considered sufficient for robust data retrieval, and this accuracy could be further reinforced by other codec modules. c Decoding of E3. A set of 10 DNA strands^C is decoded as two sets of five strands^C. The decoder uses MAP estimation and a scaffold to determine the probability for each of the four nucleotides at every position. The decoded sequence is a probabilistic consensus of the reconstructed sequences from MAP estimation and successfully retrieves the data stored in E3. Source data for b is provided in the Source Data file package

**Fig. 5**
A roadmap for scaling DNA storage systems. a Efficiency of storage for experimental and simulated systems. Experimental systems (black) include storing 12 bits in 8-nucleotide template sequences, and 16 bits in 16-nucleotide template sequences. Simulated maximum storage systems (white circles) include gigabyte scale that stores 36 bits in a 74-nucleotide template sequence, and petabyte scale that stores 57 bits in a 152-nucleotide template sequence. The amount of bits stored per sequence is dependent on the amount of error-correction codes (ECC) that are applied. Reducing ECCs increases the efficiency rate of storage. The upper bound theoretical limit represents a maximum efficiency of storage of ~ 1.58 bits per transition between non-identical nucleotides (Supplementary Note 2). The lower bound theoretical limit represents the minimum number of bits per template sequence that must be stored for only addressing (Supplementary Note 4). See all tested storage systems in Supplementary Table 8. b Flexible-write storage is enabled by a codec, which harnesses diversely synthesized strands. The decoding pipeline supports robust data retrieval from synthesized strands with a significant percentage of errors. Inset: with ten strand^C variants, each with ~ 30% missing nucleotides, the correct decoded sequence can be reconstructed for both gigabyte- and petabyte scale maximum storage capacities. c A system architecture for storing information in enzymatically synthesized DNA. A bitstream is partitioned into rows, each augmented with an address to delineate its order for reassembly. An ECC such as a Bose–Chaudhuri–Hocquenghem (BCH) code can be applied to each row, or an ECC such as a Reed–Solomon (RS) code can be applied across multiple rows, to protect data from errors (Supplementary Note 2). Modulation consists of mapping sequences of bits to template sequences, which includes synchronization nucleotides. Enzymatic synthesis then produces multiple diverse strands^C per template sequence. The resulting strands^C are used for sequence reconstruction based on MAP estimation and probabilistic consensus. Subsequently, the reconstructed sequence is demodulated into bits. Error-correction is applied to ensure data retrieval. Source data for b are provided in the Source Data file package

See this image and copyright information in PMC

References

1. Bancroft C, Bowler T, Bloom B, Clelland CT. Long-term storage of information in DNA. Science. 2001;293:1763–1765. doi: 10.1126/science.293.5536.1763c. - DOI - PubMed
1. Zhirnov V, Zadegan RM, Sandhu GS, Church GM, Hughes WL. Nucleic acid memory. Nat. Mater. 2016;15:366–370. doi: 10.1038/nmat4594. - DOI - PMC - PubMed
1. Church GM, Gao Y, Kosuri S. Next-generation digital information storage in DNA. Science. 2012;337:1628. doi: 10.1126/science.1226355. - DOI - PubMed
1. Goldman N, et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature. 2013;494:77–80. doi: 10.1038/nature11875. - DOI - PMC - PubMed
1. Blawat M, et al. Forward error correction for DNA data storage. Procedia Comput. Sci. 2016;80:1011–1022. doi: 10.1016/j.procs.2016.05.398. - DOI

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Terminator-free template-independent enzymatic DNA synthesis for digital information storage

Affiliations

Terminator-free template-independent enzymatic DNA synthesis for digital information storage

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources