Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Nov 22;16(11):17552-17571.
doi: 10.1021/acsnano.2c06748. Epub 2022 Oct 18.

Emerging Approaches to DNA Data Storage: Challenges and Prospects

Affiliations
Review

Emerging Approaches to DNA Data Storage: Challenges and Prospects

Andrea Doricchi et al. ACS Nano. .

Abstract

With the total amount of worldwide data skyrocketing, the global data storage demand is predicted to grow to 1.75 × 1014 GB by 2025. Traditional storage methods have difficulties keeping pace given that current storage media have a maximum density of 103 GB/mm3. As such, data production will far exceed the capacity of currently available storage methods. The costs of maintaining and transferring data, as well as the limited lifespans and significant data losses associated with current technologies also demand advanced solutions for information storage. Nature offers a powerful alternative through the storage of information that defines living organisms in unique orders of four bases (A, T, C, G) located in molecules called deoxyribonucleic acid (DNA). DNA molecules as information carriers have many advantages over traditional storage media. Their high storage density, potentially low maintenance cost, ease of synthesis, and chemical modification make them an ideal alternative for information storage. To this end, rapid progress has been made over the past decade by exploiting user-defined DNA materials to encode information. In this review, we discuss the most recent advances of DNA-based data storage with a major focus on the challenges that remain in this promising field, including the current intrinsic low speed in data writing and reading and the high cost per byte stored. Alternatively, data storage relying on DNA nanostructures (as opposed to DNA sequence) as well as on other combinations of nanomaterials and biomolecules are proposed with promising technological and economic advantages. In summarizing the advances that have been made and underlining the challenges that remain, we provide a roadmap for the ongoing research in this rapidly growing field, which will enable the development of technological solutions to the global demand for superior storage methodologies.

Keywords: DNA; DNA nanostructure; DNA preservation; costs; data storage; decoding; error correction; random access; reading; sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
General strategy for DNA data storage, wherein the data is stored directly in the sequence of the oligonucleotides. The six main steps—encoding, writing, storage, access, reading, and decoding—are depicted.
Figure 2
Figure 2
Comparison of the main differences between sequence-based (A,B) and structure-based DNA data storage (C,D), as has been presented in the literature to date. (A,B) Sequence-based storage relies on the de novo synthesis of DNA strands and the subsequent sequencing of these entities is performed using next-generation methods. Image adapted with permission from ref (12). Copyright 2019 Springer Nature. (C) By contrast, structure-based methods utilize self-assembly, which means that the information is encoded into their three-dimensional shape. Images adapted with permission: ref (21), copyright 2016 Springer Nature; ref (22), under a Creative Commons Attribution 4.0 License (CC BY), copyright 2021 Springer Nature. (D) These shapes can then be read off using single-molecule methods, including fluorescence, atomic force microscopy, and nanopore techniques. Image adapted from ref (23). Copyright 2019 American Chemical Society.
Figure 3
Figure 3
An overview of chemical and enzymatic strategies to synthesize custom DNA sequences. (A) Phosphoramidite synthesis—the most widely used chemical strategy for the synthesis of DNA—involves the sequential addition of nucleotides to a growing chain anchored on a solid support. Protecting groups are employed to ensure that no more than one nucleotide is added at each step and are then subsequently removed via chemical deblocking. (B) Deblocking can also be performed by electrochemistry. Reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY-NC) from ref (31). Copyright 2021 AAAS. (C) Enzymatic methods relying on T4rnl ligase or TdT can also be used to specifically add bases to a growing oligonucleotide in aqueous environments, which eliminates the need for organic solvents. Image reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (32). Copyright 2021 Elsevier B.V.
Figure 4
Figure 4
Overview of random access strategies to select a subpool of sequences, usually a file, from a large pool. PCR-based addressing methods leverage the high specificity of primers and the exponential amplification of PCR to enrich target sequences by using either a single or multiple PCR runs. Methods using physical separation as a tool to select sequences also rely on the high specificity of short primers or barcode sequences, but remove the desired sequences using magnetic bead extraction or fluorescence-activated sorting. Images adapted from ref (71) and reproduced with permission from ref (75). Copyright 2019 American Chemical Society and copyright 2021 Springer Nature, respectively.
Figure 5
Figure 5
Overview of next-generation sequencing technologies presently used in DNA data storage. (A) Illumina sequencing generates clusters of identical single-stranded oligonucleotides. As the complement is synthesized using spectrally distinct, fluorescently tagged nucleotides, the identity of each base along the strand can be determined through the color of emission. (B) Oxford Nanopore measurements do not require fluorescent dye molecules. As the oligonucleotide passes through the protein pore, the three-dimensional shape of each base will modulate the ionic current, which results in a current–time trace that corresponds to the specific sequence. Images adapted with permission from ref (85). Copyright 2016 Springer Nature.
Figure 6
Figure 6
Inner–Outer Code. Encoding. The original information is first encoded with an outer code that introduces redundancy and protects against the loss of sequences. In Grass et al. the original information was first grouped into blocks of multiple sequences (light blue). Then, each row was encoded with a Reed–Solomon code that adds redundancy (yellow). The columns correspond to single DNA sequences. These are labeled with a unique index (purple). Each column is then encoded with an inner code that adds logical redundancy on the level of each sequence (green). In general, the inner and outer codes need not add the redundancy separate from the original data, but instead return a modified longer word. Decoding. The original information from the set of noisy sequences (errors marked in red) is retrieved by first decoding the inner code. This removes most errors within the sequences. For large error rates dominated by insertions and deletions, this step may be preceded by a clustering and alignment step that generates sequences with fewer errors from multiple noisy copies. The sequences are ordered by their index. The ordered sequences are then decoded by the outer code. Here, lost sequences correspond to erasures and erroneous sequences to substitutions. These are corrected by the outer code.
Figure 7
Figure 7
(A) Cost trend of hard disk drives (HDD), NAND flash-based storage devices, linear tape-open tape cartridges (LTO tape), and optical Blu-ray (BD-RE). Image has been reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (99). Copyright 2018 AIP Publishing LLC. (B) Cost comparison between DNA synthesis for data storage and LTO tape storage. (C–E) Comparison of different DNA synthesis platforms and their characteristic traits. (C) Printing technology is primarily used by Twist and Agilent. (D) Electrochemical synthesis is employed by Custom Array. (E) Antkowiak et al. used light-directed synthesis. (C–E) Images reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (42). Copyright 2020 Springer Nature.
Figure 8
Figure 8
DNA nanostructures are data storage architectures. (A) DNA origami leverages the specific base-pairing motifs of DNA to create arbitrary structures. When a long scaffold strand (several thousand nucleotides in length) is combined with hundreds of short “staple” strands, complementary regions on the different strands will hybridize, thereby folding the scaffold into a desired conformation. These structures can then be examined using (B) atomic force microscopy or (C) electron microscopy, for example. (D) Data can be written onto DNA origami sheets through the site-specific addition of proteins; the data may be read using AFM. (E) Nanoparticles can also be controllably positioned on DNA origami with nanometer-scale resolution, which enables data writing with cryo-EM readout. (A) Image reproduced with permission from ref (108). Copyright Springer Nature 2021. (B) Image reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (109). Copyright 2019 AAAS. (C) Image reproduced with permission from ref (110). Copyright 2020 Springer Nature. (D) Image reproduced with permission from ref (111). Copyright 2010 Springer Nature. (E) Image reproduced with permission from ref (112). Copyright 2010 Wiley-VCH.
Figure 9
Figure 9
Examples of DNA nanostructures for digital information storage. (A) The folding of DNA origami into loop structures upon binding of a biomolecule target generates a shift in the assembly’s electrophoretic mobility. Image adapted with permission under a Creative Commons Attribution 4.0 license (CC BY) from ref (114). Copyright 2017 Oxford University Press. (B) The association of different DNA sequences to carbon nanotubes produces an array of morphologies and, therefore, can be used to produce barcodes. Image adapted from ref (116). Copyright 2019 American Chemical Society. (C). Data strings based on regions of varying fluorescence intensities along a DNA nanotube can be read out using single-molecule fluorescence microscopy. Image adapted from ref (117). Copyright 2021 American Chemical Society.
Figure 10
Figure 10
DNA data storage structures relying on nanopore readout. (A) An encrypted “DNA hard drive,” wherein readout may only occur once the correct molecular “keywords” have been added. Streptavidin molecules (gray circle in inset) partially block the nanopore as they translocate, which causes a momentary decrease in the current. Image reproduced from ref (25). Copyright 2020 American Chemical Society. (B) Multilevel barcoding is achievable by exploiting DNA junctions with different sizes, which create current drops of variable magnitude. Image reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (102). Copyright 2021 Wiley-VCH. (C) A DNA barcode with “structural colors” can also be formed by closely packing structural units, which therefore read as one protrusion. These units may be based on either monovalent streptavidin or a DNA cuboid. (D) Nanopore microscope can be used to detect up to 10 structural colors within the same DNA data string. The correct identification of the “color” was verified using fluorescence microscopy, wherein fluorescently labeled (5′-fluorescein) structural units were used. (C,D) Images reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (130). Copyright 2022 Springer Nature.
Figure 11
Figure 11
Tile-based computations and algorithmic self-assembly. (A) Self-assembly by SSTs. From a seed, tiles attach to the frontier of a growing SST lattice according to interaction rules determined by their exposed recognition sequences. (B) An iterated Boolean circuit mimicking the function of a computation to determine whether or not a binary number is a multiple of 310. A long enough lattice will settle into one or another fixed pattern corresponding to the calculation result. (C) The result of four “multiple of 3” tilings. The numbers at the left mark the experiment number. The tilings correctly determine which input numbers have a factor of 3. (A–C) Images adapted with permission from ref (144). Copyright 2019 Springer Nature. (D) A Sierpinski triangle created by a cumulative XOR computation performed by DNA tiles. Sierpinski’s triangle is a fractal pattern, and the self-assembly rule that creates it is Turing complete. Images reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (145). Copyright 2004 PLoS Biology.

Comment in

References

    1. Gu M.; Li X.; Cao Y. Optical Storage Arrays: A Perspective for Future Big Data Storage. Light Sci. Appl. 2014, 3 (5), e177.10.1038/lsa.2014.58. - DOI
    1. Carmean D.; Ceze L.; Seelig G.; Stewart K.; Strauss K.; Willsey M. DNA Data Storage and Hybrid Molecular-Electronic Computing. Proceedings of the IEEE 2019, 107 (1), 63–72. 10.1109/JPROC.2018.2875386. - DOI
    1. Hilbert M.; López P. The World’s Technological Capacity to Store, Communicate, and Compute Information. Science 2011, 332 (6025), 60–65. 10.1126/science.1200970. - DOI - PubMed
    1. Grass R. N.; Heckel R.; Puddu M.; Paunescu D.; Stark W. J. Robust Chemical Preservation of Digital Information on DNA in Silica with Error-Correcting Codes. Angew. Chem., Int. Ed. 2015, 54 (8), 2552–2555. 10.1002/anie.201411378. - DOI - PubMed
    1. Dabney J.; Knapp M.; Glocke I.; Gansauge M. T.; Weihmann A.; Nickel B.; Valdiosera C.; García N.; Pääbo S.; Arsuaga J. L.; Meyer M. Complete Mitochondrial Genome Sequence of a Middle Pleistocene Cave Bear Reconstructed from Ultrashort DNA Fragments. Proc. Natl. Acad. Sci. U. S. A. 2013, 110 (39), 15758–15763. 10.1073/pnas.1314445110. - DOI - PMC - PubMed

Publication types

MeSH terms