Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun 3;10(1):2383.
doi: 10.1038/s41467-019-10258-1.

Terminator-free template-independent enzymatic DNA synthesis for digital information storage

Affiliations

Terminator-free template-independent enzymatic DNA synthesis for digital information storage

Henry H Lee et al. Nat Commun. .

Abstract

DNA is an emerging medium for digital data and its adoption can be accelerated by synthesis processes specialized for storage applications. Here, we describe a de novo enzymatic synthesis strategy designed for data storage which harnesses the template-independent polymerase terminal deoxynucleotidyl transferase (TdT) in kinetically controlled conditions. Information is stored in transitions between non-identical nucleotides of DNA strands. To produce strands representing user-defined content, nucleotide substrates are added iteratively, yielding short homopolymeric extensions whose lengths are controlled by apyrase-mediated substrate degradation. With this scheme, we synthesize DNA strands carrying 144 bits, including addressing, and demonstrate retrieval with streaming nanopore sequencing. We further devise a digital codec to reduce requirements for synthesis accuracy and sequencing coverage, and experimentally show robust data retrieval from imperfectly synthesized strands. This work provides distributive enzymatic synthesis and information-theoretic approaches to advance digital information storage in DNA.

PubMed Disclaimer

Conflict of interest statement

H.H.L., R.K., and G.M.C. have filed patents covering the synthesis process (WO 2017/176541) and the encoding/decoding process (PCT/US18/56900). N.G and J.B. have filed a patent for the use of synchronization markers for the codec (WO 2018/148260).

Figures

Fig. 1
Fig. 1
An enzymatic synthesis strategy for storing information in DNA. a Schematic depiction of a series of enzymatic synthesis reactions consisting of an oligonucleotide initiator (N, gray), terminal deoxynucleotidyl transferase (TdT) and apyrase (AP). The initiator is tethered to a solid support. In each cycle, TdT catalyzes the addition of a given nucleoside triphosphate to the 3′-end of all initiators, whereas apyrase degrades the added substrate to limit net polymerization. A wash can be performed at the end of each cycle to remove reaction byproducts or to facilitate downstream processes. b DNA strands synthesized for each of eight consecutive synthesis cycle, as shown on 15% TBE-urea gel. The initiators were not tethered to a solid support and no wash was performed between cycles. The first lane is a single-stranded DNA size marker, which includes 24 nucleotide long initiator oligonucleotide. c A schema for interconversion of DNA and information. Raw strands (strandsR) represent enzymatically synthesized DNA. A compressed strand (strandC) represents a sequence of transitions between non-identical nucleotides. Transitions between nucleotides, starting with the last nucleotide of the initiator (as an example N = “a”, gray) are mapped from the compressed strand to digital data in trits. If a strandC is equivalent to the template sequence, all desired transitions are present and the information stored in DNA is retrieved
Fig. 2
Fig. 2
Demonstration of information storage in DNA using enzymatic synthesis. a The message “hello world!” was encoded in 12 template sequences, H01–H12, each representing one character. Transitions between nucleotides start with the last base of the initiator, which is labeled ‘g’. A header index (shaded gray) denotes strand order. Only results from H01–H05 are shown (see Supplementary Fig. 9). To encode each character, its respective ASCII decimal value, prefixed with an address is represented in base 2 (binary) or in base 3 (ternary) (see Supplementary Table 2), mapped to transitions (see Fig. 1c), resulting in template sequences with nucleotides to be synthesized (capitalized). b Extension lengths for each base from a is shown as a letter-value plot with median. Only perfect strandsR, those whose strandC is equivalent to a template sequence, are presented. Synthesis was performed with initiators tethered to beads and sequencing performed on the Illumina platform. c Distribution of extension lengths for each nucleotide transition, combined across all positions from all perfect strands is shown as a letter-value plot with median. d Stepwise increases in strandR length with an increasing strandC length for all synthesized strands of H01–H12 is shown as a letter-value plot with median. e Distribution of all strandR lengths. Distributions are derived via kernel density estimation for all synthesized strands (‘all’, gray shading) and a subpopulation of strands that contain all desired transitions (‘perfect’, dotted line). f Bulk error analysis for all synthesized strands of H01–H12. All strandsC were aligned, by Needleman–Wunsch, to their respective template sequences, and the number of mismatches, insertions, and missing nucleotides were tabulated. g Information retrieval with in silico filtering. Fraction of perfect strandsC is shown before (triangles) or after filtering (circles). Fraction of perfect strandsC is shown for all sequences (white) or only the top three most-abundant sequences (black). h Information retrieval by different sequencing platforms. Streaming nanopore sequencing (Oxford, filled diamonds) was compared with batch sequencing-by-synthesis (Illumina, open circles). Each dot indicates the fraction of sequencing run at which each strand is robustly retrieved (100% correct with 99.99% probability). Arrows denote the fraction of the sequencing run at which all data are robustly retrieved using each platform. Source data for bh are provided in the Source Data file package
Fig. 3
Fig. 3
Coded strand architecture for sequence reconstruction. a A DNA information storage channel. Data are converted to template sequences, synthesized (yielding strandsR), and can be stored in vitro. Retrieval starts with sequencing, then transitions of non-identical nucleotides are extracted in silico to form strandsC. Data retrieval occurs when the template sequence and reconstructed sequence are equivalent. Errors that occur in the synthesis and sequencing steps can be modeled as a communications channel. b A coded strand architecture, “scaffold”, enables data retrieval from strandsC that are missing nucleotides, whereas an “unguided” reconstruction results in multiple possible solutions. Synchronization nucleotides (dark gray boxes) localize errors to yield a single reconstructed sequence. c A 16-base transition sequence, E0, is synthesized and sequenced with Illumina. Examples of diverse strandsC produced by synthesis of E0. StrandsC are aligned, by Needleman–Wunsch, to the template. Ambiguous alignments can exist depending on the location and number of missing nucleotides within a strandC. d Error analysis for purified strands of E0. Synthesized strands were purified in silico, by filtering for strandsR between 32 and 48 bases in length, and corresponding strandsC were aligned by Needleman–Wunsch to the E0 template. For each alignment, the number of mismatches, insertions, and missing nucleotides were tabulated. e Evaluating the diversity of synthesized strands. The number of sequencing reads for each length of strandC was tabulated. Diversity was evaluated as the number of unique variants at each length of strandC and the Levenshtein edit distance was computed with respect to the E0 template. The set of 802 purified strands contains two perfect strands. Source data for c and d are provided in the Source Data file package
Fig. 4
Fig. 4
Coded strand architecture for robust information storage. a The message “Eureka!” was encoded and partitioned into four template sequences, E1–E4. Each sequence stores a 2-bit address and 14 bits of data. These bits are mapped to a template sequence of 16 nucleotides, which includes four synchronization nucleotides (dark gray). Synthesis was performed with initiators tethered to beads and sequencing performed on the Illumina platform. b Retrieving information from E1 to E4. Synthesized strandsR were sequenced using the Illumina sequencing-by-synthesis (SBS) platform and purified in silico based on raw length of 32–48 nucleotides (Methods). The decoding accuracy for each sequence is defined as the probability of 100% correct data retrieval for a given number of reads, estimated over 500 decoding trials. Each trial is based on a randomly drawn set of purified strandC variants. A 90% decoding accuracy (gray band) is considered sufficient for robust data retrieval, and this accuracy could be further reinforced by other codec modules. c Decoding of E3. A set of 10 DNA strandsC is decoded as two sets of five strandsC. The decoder uses MAP estimation and a scaffold to determine the probability for each of the four nucleotides at every position. The decoded sequence is a probabilistic consensus of the reconstructed sequences from MAP estimation and successfully retrieves the data stored in E3. Source data for b is provided in the Source Data file package
Fig. 5
Fig. 5
A roadmap for scaling DNA storage systems. a Efficiency of storage for experimental and simulated systems. Experimental systems (black) include storing 12 bits in 8-nucleotide template sequences, and 16 bits in 16-nucleotide template sequences. Simulated maximum storage systems (white circles) include gigabyte scale that stores 36 bits in a 74-nucleotide template sequence, and petabyte scale that stores 57 bits in a 152-nucleotide template sequence. The amount of bits stored per sequence is dependent on the amount of error-correction codes (ECC) that are applied. Reducing ECCs increases the efficiency rate of storage. The upper bound theoretical limit represents a maximum efficiency of storage of ~ 1.58 bits per transition between non-identical nucleotides (Supplementary Note 2). The lower bound theoretical limit represents the minimum number of bits per template sequence that must be stored for only addressing (Supplementary Note 4). See all tested storage systems in Supplementary Table 8. b Flexible-write storage is enabled by a codec, which harnesses diversely synthesized strands. The decoding pipeline supports robust data retrieval from synthesized strands with a significant percentage of errors. Inset: with ten strandC variants, each with ~ 30% missing nucleotides, the correct decoded sequence can be reconstructed for both gigabyte- and petabyte scale maximum storage capacities. c A system architecture for storing information in enzymatically synthesized DNA. A bitstream is partitioned into rows, each augmented with an address to delineate its order for reassembly. An ECC such as a Bose–Chaudhuri–Hocquenghem (BCH) code can be applied to each row, or an ECC such as a Reed–Solomon (RS) code can be applied across multiple rows, to protect data from errors (Supplementary Note 2). Modulation consists of mapping sequences of bits to template sequences, which includes synchronization nucleotides. Enzymatic synthesis then produces multiple diverse strandsC per template sequence. The resulting strandsC are used for sequence reconstruction based on MAP estimation and probabilistic consensus. Subsequently, the reconstructed sequence is demodulated into bits. Error-correction is applied to ensure data retrieval. Source data for b are provided in the Source Data file package

References

    1. Bancroft C, Bowler T, Bloom B, Clelland CT. Long-term storage of information in DNA. Science. 2001;293:1763–1765. doi: 10.1126/science.293.5536.1763c. - DOI - PubMed
    1. Zhirnov V, Zadegan RM, Sandhu GS, Church GM, Hughes WL. Nucleic acid memory. Nat. Mater. 2016;15:366–370. doi: 10.1038/nmat4594. - DOI - PMC - PubMed
    1. Church GM, Gao Y, Kosuri S. Next-generation digital information storage in DNA. Science. 2012;337:1628. doi: 10.1126/science.1226355. - DOI - PubMed
    1. Goldman N, et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature. 2013;494:77–80. doi: 10.1038/nature11875. - DOI - PMC - PubMed
    1. Blawat M, et al. Forward error correction for DNA data storage. Procedia Comput. Sci. 2016;80:1011–1022. doi: 10.1016/j.procs.2016.05.398. - DOI

Publication types

MeSH terms