Review

. 2021 Jun 4;49(10):5451-5469.

doi: 10.1093/nar/gkab230.

Uncertainties in synthetic DNA-based data storage

Chengtao Xu¹, Chao Zhao¹, Biao Ma¹, Hong Liu¹

Affiliations

PMID: 33836076
PMCID: PMC8191772
DOI: 10.1093/nar/gkab230

Review

Uncertainties in synthetic DNA-based data storage

Chengtao Xu et al. Nucleic Acids Res. 2021.

. 2021 Jun 4;49(10):5451-5469.

doi: 10.1093/nar/gkab230.

Authors

Chengtao Xu¹, Chao Zhao¹, Biao Ma¹, Hong Liu¹

Affiliation

¹ State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, Jiangsu 210096, China.

PMID: 33836076
PMCID: PMC8191772
DOI: 10.1093/nar/gkab230

Abstract

Deoxyribonucleic acid (DNA) has evolved to be a naturally selected, robust biomacromolecule for gene information storage, and biological evolution and various diseases can find their origin in uncertainties in DNA-related processes (e.g. replication and expression). Recently, synthetic DNA has emerged as a compelling molecular media for digital data storage, and it is superior to the conventional electronic memory devices in theoretical retention time, power consumption, storage density, and so forth. However, uncertainties in the in vitro DNA synthesis and sequencing, along with its conjugation chemistry and preservation conditions can lead to severe errors and data loss, which limit its practical application. To maintain data integrity, complicated error correction algorithms and substantial data redundancy are usually required, which can significantly limit the efficiency and scale-up of the technology. Herein, we summarize the general procedures of the state-of-the-art DNA-based digital data storage methods (e.g. write, read, and preservation), highlighting the uncertainties involved in each step as well as potential approaches to correct them. We also discuss challenges yet to overcome and research trends in the promising field of DNA-based data storage.

PubMed Disclaimer

Figures

**Figure 1.**
Schematic diagram of the DNA-based data storage and its operation. (1) The original data was transformed into binary data. (2) Binary data was encoded into corresponding DNA sequences. One usual strategy was to split the binary data into chunks, each of which was converted into a specific DNA base sequence. Additional base sequences including address information, error correction, and DNA amplification were also attached to the data sequence (16). Another approach was to split the data into chunks and transfer them to sequences having overlapping DNA fragments, which served as both address information and data redundancy for error correction (13). (3) All DNA strands with the specific sequences were synthesized, either chemically or enzymatically. The chemical synthesis included column-based and array-based phosphoramidite methods, while the enzymatical synthesis relied on polymerases such as template-independent terminal transferase. (4) The synthesized DNA samples were preserved in aqueous solutions or encapsulated in silica beads for preservation. (5) Data was retrieved by the selective extraction or amplification of relevant DNA strands in the pool. This could be achieved by using magnetic beads with specific ligands or polymerase chain reaction. (6) The relevant DNA sequences were obtained by a DNA sequencer, which was usually based on a sequencing-by-synthesis method or a nanopore. Then the DNA base sequences were decoded to obtain the original binary data.

**Figure 2.**
Encoding for DNA-based data storage. (A) Schematics of a DNA strand with the encoded address information, error-correction information (RS codes), data (payload), and adapters (ID primer). The address represented the relative location of the payload in the original data, which was required for data reassembly. The error-correction segment contained additional information that ensured error-free data recovery. Adapter sequences at both ends provided binding sites of complementary primers for PCR amplification and random access of certain data (16). (B) Schematics of the composite encoding strategy based on degenerated bases. Instead of directly transforming each binary data into four bases, the information was encoded by the ratio of A, T, C, G at the same sites on different DNA strands. For example, the letter R could be represented by A and G in a 1:1 ratio. This composite strategy improved the theoretical information density for DNA-based storage (54). (C) Schematics of an encoding method based on transitions of bases (e.g. C to G), which was applied in enzymatic synthesis having low synthesis accuracy but high speed. The data was represented by the transition of one nucleotide to another in the sequence. For instance, the transition of bases from C to G represented ‘1’ and A to T represented ‘2’ (61). (D) Schematics of an encoding method for nanopore-based DNA database. The data was encoded by oligonucleotide hairpins of two lengths so that two types of electronic signals could be recorded when they penetrated the nanopore. By monitoring the current signals during the DNA passed the nanopore, binary data could be obtained (80).

**Figure 3.**
Error correction methods for the DNA database. (A) Reed-Solomon codes multiplied the data block with a pre-calculated encoding matrix, and the redundant information would be added at the end of the result (108). (B) Fountain codes separated data into many segments, selected them using a special distribution function and packaged them into many ‘droplets’. Unqualified sequences were excluded from the screening procedure. Droplets of good quality were used for oligonucleotide synthesis (16). (C) Overlapping was a simple but effective method to avoid errors. Repeated pieces of the sequence under a shifting pattern were included in different oligonucleotide strands to obtain available copies (For example, four-fold redundancy generated in the figure) (13). (D) Exclusive-or operations between any two information strands generated the third strand. Any two of them could restore the remaining strand (26).

**Figure 4.**
Schematics of various DNA immobilization methods. (A) DNA immobilized on the substrate using the electrostatic adsorption between polycations on the surface and negative phosphate groups of DNA molecules. (B) Immobilization of DNA with a thiol group on gold nanoparticles via Au–S bonding. (C) Immobilization of aminated DNA strands on the surface of the aldehyde group-modified substrate. The reaction between the amine group and the aldehyde group formed a Schiff base. (D) Immobilization of biotinylated DNA strands on the streptavidin-modified substrate.

**Figure 5.**
Schematics for long-term DNA preservation. (A) Preservation for layer-by-layer assembled DNA, in which cationic polymer and anionic DNA were organized into a layer-by-layer structure based on electrostatic interaction on the surface of microparticles (116). (B) DNA preservation by packaging in silica beads and then dispersed in polycaprolactone fibres. The DNA strands were encoded with data files that were used for 3D printing of a rabbit model. DNA molecules could be further extracted, amplified, and sequenced to acquire encoded files (179). Reproduced with permission (179). Copyright 2020, Springer Nature.

See this image and copyright information in PMC

Cited by

Overcoming the High Error Rate of Composite DNA Letters-Based Digital Storage through Soft-Decision Decoding.
Xu Y, Ding L, Wu S, Ruan J. Xu Y, et al. Adv Sci (Weinh). 2024 Aug;11(30):e2402951. doi: 10.1002/advs.202402951. Epub 2024 Jun 14. Adv Sci (Weinh). 2024. PMID: 38874370 Free PMC article.
Robust data storage in DNA by de Bruijn graph-based de novo strand assembly.
Song L, Geng F, Gong ZY, Chen X, Tang J, Gong C, Zhou L, Xia R, Han MZ, Xu JY, Li BZ, Yuan YJ. Song L, et al. Nat Commun. 2022 Sep 12;13(1):5361. doi: 10.1038/s41467-022-33046-w. Nat Commun. 2022. PMID: 36097016 Free PMC article.
Controlled enzymatic synthesis of oligonucleotides.
Pichon M, Hollenstein M. Pichon M, et al. Commun Chem. 2024 Jun 18;7(1):138. doi: 10.1038/s42004-024-01216-0. Commun Chem. 2024. PMID: 38890393 Free PMC article. Review.
DNA Stability in Biodosimetry, Pharmacy and DNA Based Data-Storage: Optimal Storage and Handling Conditions.
Cordsmeier L, Hahn MB. Cordsmeier L, et al. Chembiochem. 2022 Oct 19;23(20):e202200391. doi: 10.1002/cbic.202200391. Epub 2022 Sep 14. Chembiochem. 2022. PMID: 35972228 Free PMC article.
DNA-DISK: Automated end-to-end data storage via enzymatic single-nucleotide DNA synthesis and sequencing on digital microfluidics.
Li K, Lu X, Liao J, Chen H, Lin W, Zhao Y, Tang D, Li C, Tian Z, Zhu Z, Jiang H, Sun J, Zhang H, Yang C. Li K, et al. Proc Natl Acad Sci U S A. 2024 Aug 20;121(34):e2410164121. doi: 10.1073/pnas.2410164121. Epub 2024 Aug 15. Proc Natl Acad Sci U S A. 2024. PMID: 39145927 Free PMC article.

See all "Cited by" articles

References

1. Goda K., Kitsuregawa M.. The history of storage systems. Proc. IEEE. 2012; 100:1433–1440.
1. Hilbert M., Lopez P.. The world's technological capacity to store, communicate, and compute information. Science. 2011; 332:60–65. - PubMed
1. Reisel D., Gantz J., Rydning J.. Data age 2025: the digitization of the world from edge to core. Seagate. 2018; https://www.seagate.com/in/en/our-story/data-age-2025/.
1. Xu Z.-W. Cloud-sea computing systems: Towards thousand-fold improvement in performance per watt for the coming zettabyte era. J. Comput. Sci. Technol. 2014; 29:177–181.
1. Extance A. Could the molecule known for storing genetic information also store the world's data. Nature. 2016; 537:22–24. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Uncertainties in synthetic DNA-based data storage

Affiliation

Uncertainties in synthetic DNA-based data storage

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources