Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Jun 4;49(10):5451-5469.
doi: 10.1093/nar/gkab230.

Uncertainties in synthetic DNA-based data storage

Affiliations
Review

Uncertainties in synthetic DNA-based data storage

Chengtao Xu et al. Nucleic Acids Res. .

Abstract

Deoxyribonucleic acid (DNA) has evolved to be a naturally selected, robust biomacromolecule for gene information storage, and biological evolution and various diseases can find their origin in uncertainties in DNA-related processes (e.g. replication and expression). Recently, synthetic DNA has emerged as a compelling molecular media for digital data storage, and it is superior to the conventional electronic memory devices in theoretical retention time, power consumption, storage density, and so forth. However, uncertainties in the in vitro DNA synthesis and sequencing, along with its conjugation chemistry and preservation conditions can lead to severe errors and data loss, which limit its practical application. To maintain data integrity, complicated error correction algorithms and substantial data redundancy are usually required, which can significantly limit the efficiency and scale-up of the technology. Herein, we summarize the general procedures of the state-of-the-art DNA-based digital data storage methods (e.g. write, read, and preservation), highlighting the uncertainties involved in each step as well as potential approaches to correct them. We also discuss challenges yet to overcome and research trends in the promising field of DNA-based data storage.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Schematic diagram of the DNA-based data storage and its operation. (1) The original data was transformed into binary data. (2) Binary data was encoded into corresponding DNA sequences. One usual strategy was to split the binary data into chunks, each of which was converted into a specific DNA base sequence. Additional base sequences including address information, error correction, and DNA amplification were also attached to the data sequence (16). Another approach was to split the data into chunks and transfer them to sequences having overlapping DNA fragments, which served as both address information and data redundancy for error correction (13). (3) All DNA strands with the specific sequences were synthesized, either chemically or enzymatically. The chemical synthesis included column-based and array-based phosphoramidite methods, while the enzymatical synthesis relied on polymerases such as template-independent terminal transferase. (4) The synthesized DNA samples were preserved in aqueous solutions or encapsulated in silica beads for preservation. (5) Data was retrieved by the selective extraction or amplification of relevant DNA strands in the pool. This could be achieved by using magnetic beads with specific ligands or polymerase chain reaction. (6) The relevant DNA sequences were obtained by a DNA sequencer, which was usually based on a sequencing-by-synthesis method or a nanopore. Then the DNA base sequences were decoded to obtain the original binary data.
Figure 2.
Figure 2.
Encoding for DNA-based data storage. (A) Schematics of a DNA strand with the encoded address information, error-correction information (RS codes), data (payload), and adapters (ID primer). The address represented the relative location of the payload in the original data, which was required for data reassembly. The error-correction segment contained additional information that ensured error-free data recovery. Adapter sequences at both ends provided binding sites of complementary primers for PCR amplification and random access of certain data (16). (B) Schematics of the composite encoding strategy based on degenerated bases. Instead of directly transforming each binary data into four bases, the information was encoded by the ratio of A, T, C, G at the same sites on different DNA strands. For example, the letter R could be represented by A and G in a 1:1 ratio. This composite strategy improved the theoretical information density for DNA-based storage (54). (C) Schematics of an encoding method based on transitions of bases (e.g. C to G), which was applied in enzymatic synthesis having low synthesis accuracy but high speed. The data was represented by the transition of one nucleotide to another in the sequence. For instance, the transition of bases from C to G represented ‘1’ and A to T represented ‘2’ (61). (D) Schematics of an encoding method for nanopore-based DNA database. The data was encoded by oligonucleotide hairpins of two lengths so that two types of electronic signals could be recorded when they penetrated the nanopore. By monitoring the current signals during the DNA passed the nanopore, binary data could be obtained (80).
Figure 3.
Figure 3.
Error correction methods for the DNA database. (A) Reed-Solomon codes multiplied the data block with a pre-calculated encoding matrix, and the redundant information would be added at the end of the result (108). (B) Fountain codes separated data into many segments, selected them using a special distribution function and packaged them into many ‘droplets’. Unqualified sequences were excluded from the screening procedure. Droplets of good quality were used for oligonucleotide synthesis (16). (C) Overlapping was a simple but effective method to avoid errors. Repeated pieces of the sequence under a shifting pattern were included in different oligonucleotide strands to obtain available copies (For example, four-fold redundancy generated in the figure) (13). (D) Exclusive-or operations between any two information strands generated the third strand. Any two of them could restore the remaining strand (26).
Figure 4.
Figure 4.
Schematics of various DNA immobilization methods. (A) DNA immobilized on the substrate using the electrostatic adsorption between polycations on the surface and negative phosphate groups of DNA molecules. (B) Immobilization of DNA with a thiol group on gold nanoparticles via Au–S bonding. (C) Immobilization of aminated DNA strands on the surface of the aldehyde group-modified substrate. The reaction between the amine group and the aldehyde group formed a Schiff base. (D) Immobilization of biotinylated DNA strands on the streptavidin-modified substrate.
Figure 5.
Figure 5.
Schematics for long-term DNA preservation. (A) Preservation for layer-by-layer assembled DNA, in which cationic polymer and anionic DNA were organized into a layer-by-layer structure based on electrostatic interaction on the surface of microparticles (116). (B) DNA preservation by packaging in silica beads and then dispersed in polycaprolactone fibres. The DNA strands were encoded with data files that were used for 3D printing of a rabbit model. DNA molecules could be further extracted, amplified, and sequenced to acquire encoded files (179). Reproduced with permission (179). Copyright 2020, Springer Nature.

Similar articles

Cited by

References

    1. Goda K., Kitsuregawa M.. The history of storage systems. Proc. IEEE. 2012; 100:1433–1440.
    1. Hilbert M., Lopez P.. The world's technological capacity to store, communicate, and compute information. Science. 2011; 332:60–65. - PubMed
    1. Reisel D., Gantz J., Rydning J.. Data age 2025: the digitization of the world from edge to core. Seagate. 2018; https://www.seagate.com/in/en/our-story/data-age-2025/.
    1. Xu Z.-W. Cloud-sea computing systems: Towards thousand-fold improvement in performance per watt for the coming zettabyte era. J. Comput. Sci. Technol. 2014; 29:177–181.
    1. Extance A. Could the molecule known for storing genetic information also store the world's data. Nature. 2016; 537:22–24. - PubMed

Publication types

MeSH terms