. 2018 Oct 17;18(1):64.

doi: 10.1186/s12896-018-0476-4.

A highly parallel strategy for storage of digital information in living cells

Azat Akhmetov^{1

2}, Andrew D Ellington^{3

4

5}, Edward M Marcotte^{6

7

8}

Affiliations

¹ Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX, USA. azat@utexas.edu.
² Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA. azat@utexas.edu.
³ Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX, USA. ellingtonlab@gmail.com.
⁴ Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA. ellingtonlab@gmail.com.
⁵ Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA. ellingtonlab@gmail.com.
⁶ Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX, USA. marcotte@icmb.utexas.edu.
⁷ Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA. marcotte@icmb.utexas.edu.
⁸ Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA. marcotte@icmb.utexas.edu.

PMID: 30333005
PMCID: PMC6191901
DOI: 10.1186/s12896-018-0476-4

A highly parallel strategy for storage of digital information in living cells

Azat Akhmetov et al. BMC Biotechnol. 2018.

. 2018 Oct 17;18(1):64.

doi: 10.1186/s12896-018-0476-4.

Authors

Azat Akhmetov^{1

2}, Andrew D Ellington^{3

4

5}, Edward M Marcotte^{6

7

8}

Affiliations

¹ Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX, USA. azat@utexas.edu.
² Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA. azat@utexas.edu.
³ Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX, USA. ellingtonlab@gmail.com.
⁴ Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA. ellingtonlab@gmail.com.
⁵ Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA. ellingtonlab@gmail.com.
⁶ Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX, USA. marcotte@icmb.utexas.edu.
⁷ Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA. marcotte@icmb.utexas.edu.
⁸ Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA. marcotte@icmb.utexas.edu.

PMID: 30333005
PMCID: PMC6191901
DOI: 10.1186/s12896-018-0476-4

Abstract

Background: Encoding arbitrary digital information in DNA has attracted attention as a potential avenue for large scale and long term data storage. However, in order to enable DNA data storage technologies there needs to be improvements in data storage fidelity (tolerance to mutation), the facility of writing and reading the data (biases and systematic error arising from synthesis and sequencing), and overall scalability.

Results: To this end, we have developed and implemented an encoding scheme that is suitable for detecting and correcting errors that may arise during storage, writing, and reading, such as those arising from nucleotide substitutions, insertions, and deletions. We propose a scheme for parallelized long term storage of encoded sequences that relies on overlaps rather than the address blocks found in previously published work. Using computer simulations, we illustrate the encoding, sequencing, decoding, and recovery of encoded information, ultimately demonstrating the possibility of a successful round-trip read/write. These demonstrations show that in theory a precise control over error tolerance is possible. Even after simulated degradation of DNA, recovery of original data is possible owing to the error correction capabilities built into the encoding strategy. A secondary advantage of our method is that the statistical characteristics (such as repetitiveness and GC-composition) of encoded sequences can also be tailored without sacrificing the overall ability to store large amounts of data. Finally, the combination of the overlap-based partitioning of data with the LZMA compression that is integral to encoding means that the entire sequence must be present for successful decoding. This feature enables inordinately strong encryptions. As a potential application, an encrypted pathogen genome could be distributed and carried by cells without danger of being expressed, and could not even be read out in the absence of the entire DNA consortium.

Conclusions: We have developed a method for DNA encoding, using a significantly different fundamental approach from existing work, which often performs better than alternatives and allows for a great deal of freedom and flexibility of application.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
A diagram of the encoding and decoding process. The input data is first wrapped in a tar archive, to ensure a uniform input format as well as combining multiple files into a single contiguous data stream with a well-established method. The digital information is encoded using a pre-generated codebook, producing one long single sequence of DNA. This sequence is split into overlapping packets, each up to 200 bp long, which are then synthesized as a complex pool of oligonucleotides. These can be cloned into plasmids and transformed into cells, where they can be maintained reliably for a very long time. To recover the information, the population of cells (or alternatively plasmids or lyophilized oligonucleotides) can be sequenced with NextGen sequencing technology, and de novo assembly of the resulting reads is performed. During assembly, some errors can be corrected by simply considering the consensus of the contig, whereas systematic errors (such as those arising during synthesis) can be corrected in silico using the error correcting code. Finally, the codebook is used to decode the resulting contig and recover the digital files

**Fig. 2**
Digital data used for in silico experiments. Left: A 300 × 200 pixel color photo of a cat, encoded with the Jpeg algorithm so as to produce a files 10,387 byte file

**Fig. 3**
Overall self-similarity of the encoded Hamming image. Dot plot of the encoded Hamming image generated with dottup, using word size 20 as the parameter. Positions where 20 bp of the sequence are self-similar are marked with blue. Identical regions longer than 100 bp are marked with red. The plot shows a lack of long stretches of repetition that could interfere with assembly

**Fig. 4**
Self-similarity at corners. Dotplots of the same encoded DNA, showing only the ends of the sequence, generated with word size 10. Short blocks of repetitive sequence are visible as blue blocks, these result from header and terminator information utilized by the LZMA algorithm which is less variable than the compressed data stream itself. Top left: Sequence head vs. itself. Top right and bottom left: Head vs. tail. Bottom right: Sequence tail vs. itself

**Fig. 5**
Self similarity of flat file. Dot plot of the entire encoded flat file, generated with dottup with word size 10, showing self-similarity within the entire encoded DNA sequence

**Fig. 6**
Total nucleotide composition of encoded DNA. Bars show the relative fraction of each nucleotide within DNA obtained by encoding the given digital data

**Fig. 7**
Local composition. Nucleotide composition in sliding 100 bp window for each sequence of encoded DNA

**Fig. 8**
Total nucleotide composition error. Total deviation of nucleotide composition from the expected 25% proportion. Shown here is sum of error within each 100 bp window tiled along the encoded sequence, and divided by the sequence length

**Fig. 9**
Spurious ORFs in encoded sequence. Top: Histogram showing the distribution of spurious ORFs observed in the DNA sequence for the encoded Hamming image. Middle: Violin plot showing the length distribution of spurious ORFs grouped by reading frame. Frames are marked with a minus (−) if they are on the negative strand (ie. detected in the reverse complement of the sequence). Bottom: Distribution of spurious ORF start (grey) and stop (black) positions along the sequence

**Fig. 10**
Mutation buffering by error correcting code. Light gray line shows the effect of applying error correction on sequences mutated to varying degrees. The mutation rate is shown in average mutations per block, given that each block is 4 bp long. Shown here is the mean of 10 simulations (middle line) with ±2.58σ band (shaded area), representing 99% confidence interval. Bottom plot: Subset of the data corresponding to only lower mutation rates

**Fig. 11**
Distribution of oligos along the encoded DNA. Black bars indicate 200 bp oligos, produced so as to tile the encoded sequence with 175 bp overlaps between two successive oligos. Blue graph shows coverage of the DNA by oligos (uniformly 8× virtually everywhere). Only the beginning and end of the sequence is shown here; but the parts shown here are representative of the entire sequence

**Fig. 12**
Likelihood of successful assembly at varying read and sampling depths. Left: Fraction of base pairs in the longest assembled contig that matched the original sequence after mapping. Right: Fraction of the original sequence that was present in the longest assembled contig

**Fig. 13**
Monte Carlo simulations of library construction from pools of oligos. A series of simulated random draw experiments were performed for pools of 100 and 500 unique sequences; the number of draws ranged from 1 to 30 times the number of unique sequences. For each combination of parameters, 20 repeat experiments were performed, shown here as lines of identical color

**Fig. 14**
Recovery of original sequence after simulated sampling of packet pool and simulated sequencing. Each bar shows fraction of 20 simulated read-write experiments in which the data after decoding matched the encoded data exactly. For black bars, the error correction capacity built into the sequence was ignored, and the assembled contig was decoded as-is. For grey bars, the error correction was applied and decoding was attempted after

See this image and copyright information in PMC

Cited by

Optimizing fountain codes for DNA data storage.
Schwarz PM, Freisleben B. Schwarz PM, et al. Comput Struct Biotechnol J. 2024 Oct 26;23:3878-3896. doi: 10.1016/j.csbj.2024.10.038. eCollection 2024 Dec. Comput Struct Biotechnol J. 2024. PMID: 39559773 Free PMC article.
Robust direct digital-to-biological data storage in living cells.
Yim SS, McBee RM, Song AM, Huang Y, Sheth RU, Wang HH. Yim SS, et al. Nat Chem Biol. 2021 Mar;17(3):246-253. doi: 10.1038/s41589-020-00711-4. Epub 2021 Jan 11. Nat Chem Biol. 2021. PMID: 33432236 Free PMC article.
DNA storage: The future direction for medical cold data storage.
Shen P, Zheng Y, Zhang C, Li S, Chen Y, Chen Y, Liu Y, Cai Z. Shen P, et al. Synth Syst Biotechnol. 2025 Mar 14;10(2):677-695. doi: 10.1016/j.synbio.2025.03.006. eCollection 2025 Jun. Synth Syst Biotechnol. 2025. PMID: 40235856 Free PMC article. Review.
DNA storage-from natural biology to synthetic biology.
Bencurova E, Akash A, Dobson RCJ, Dandekar T. Bencurova E, et al. Comput Struct Biotechnol J. 2023 Feb 2;21:1227-1235. doi: 10.1016/j.csbj.2023.01.045. eCollection 2023. Comput Struct Biotechnol J. 2023. PMID: 36817961 Free PMC article. Review.
NOREC4DNA: using near-optimal rateless erasure codes for DNA storage.
Schwarz PM, Freisleben B. Schwarz PM, et al. BMC Bioinformatics. 2021 Aug 17;22(1):406. doi: 10.1186/s12859-021-04318-x. BMC Bioinformatics. 2021. PMID: 34404355 Free PMC article.

See all "Cited by" articles

References

1. Bancroft C, Bowler T, Bloom B, Clelland CT. Long-term storage of information in DNA. Science. 2001;293:1763–1765. doi: 10.1126/science.293.5536.1763c. - DOI - PubMed
1. Smith GC, Fiddes CC, Hawkins JP, Cox JPL. Some possible codes for encrypting data in DNA. Biotechnol Lett. 2003;25:1125–1130. doi: 10.1023/A:1024539608706. - DOI - PubMed
1. Church GM, Gao Y, Kosuri S. Next-generation digital information storage in DNA. Science. 2012;337:1628. doi: 10.1126/science.1226355. - DOI - PubMed
1. Goldman N, Bertone P, Chen S, Dessimoz C, LeProust EM, Sipos B, Birney E. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature. 2013;494:77–80. doi: 10.1038/nature11875. - DOI - PMC - PubMed
1. Bornholt J, Lopez R, Carmean DM, Ceze L, Seelig G, Strauss K. Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems. 2016. A DNA-based archival storage system.

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed