Review

. 2022 Nov 22;16(11):17552-17571.

doi: 10.1021/acsnano.2c06748. Epub 2022 Oct 18.

Emerging Approaches to DNA Data Storage: Challenges and Prospects

Andrea Doricchi^{1

2}, Casey M Platnich³, Andreas Gimpel⁴, Friederikee Horn⁵, Max Earle³, German Lanzavecchia^{1

6}, Aitziber L Cortajarena^{7

8}, Luis M Liz-Marzán^{7

8

9}, Na Liu^{10

11}, Reinhard Heckel⁵, Robert N Grass⁴, Roman Krahne¹, Ulrich F Keyser³, Denis Garoli¹

Affiliations

¹ Istituto Italiano di Tecnologia, via Morego 30, I-16163 Genova, Italy.
² Dipartimento di Chimica e Chimica Industriale, Università di Genova, via Dodecaneso 31, 16146 Genova, Italy.
³ Cavendish Laboratory, University of Cambridge, JJ Thomson Avenue, Cambridge CB3 0HE, U.K.
⁴ Institute for Chemical and Bioengineering, ETH Zurich, Vladimir-Prelog-Weg 1, 8093 Zurich, Switzerland.
⁵ Technical University of Munich, Department of Electrical and Computer Engineering Munchen, Bayern, DE 80333, Germany.
⁶ Dipartimento di Fisica, Università di Genova, via Dodecaneso 33, 16146 Genova, Italy.
⁷ Center for Cooperative Research in Biomaterials (CICbiomaGUNE), Basque Research and Technology Alliance (BRTA), Paseo de Miramón 194, 20014 Donostia-San Sebastián, Spain.
⁸ Ikerbasque, Basque Foundation for Science, 48009 Bilbao, Spain.
⁹ Biomedical Research Networking Center in Bioengineering, Biomaterials and Nanomedicine (CIBER-BBN), Av. Monforte de Lemos, 3-5. Pabellón 11. Planta 0, 28029 Madrid, Spain.
¹⁰ Second Physics Institute, University of Stuttgart, 70569 Stuttgart, Germany.
¹¹ Max Planck Institute for Solid State Research, 70569 Stuttgart, Germany.

PMID: 36256971
PMCID: PMC9706676
DOI: 10.1021/acsnano.2c06748

Review

Emerging Approaches to DNA Data Storage: Challenges and Prospects

Andrea Doricchi et al. ACS Nano. 2022.

. 2022 Nov 22;16(11):17552-17571.

doi: 10.1021/acsnano.2c06748. Epub 2022 Oct 18.

Authors

Affiliations

¹ Istituto Italiano di Tecnologia, via Morego 30, I-16163 Genova, Italy.
² Dipartimento di Chimica e Chimica Industriale, Università di Genova, via Dodecaneso 31, 16146 Genova, Italy.
³ Cavendish Laboratory, University of Cambridge, JJ Thomson Avenue, Cambridge CB3 0HE, U.K.
⁴ Institute for Chemical and Bioengineering, ETH Zurich, Vladimir-Prelog-Weg 1, 8093 Zurich, Switzerland.
⁵ Technical University of Munich, Department of Electrical and Computer Engineering Munchen, Bayern, DE 80333, Germany.
⁶ Dipartimento di Fisica, Università di Genova, via Dodecaneso 33, 16146 Genova, Italy.
⁷ Center for Cooperative Research in Biomaterials (CICbiomaGUNE), Basque Research and Technology Alliance (BRTA), Paseo de Miramón 194, 20014 Donostia-San Sebastián, Spain.
⁸ Ikerbasque, Basque Foundation for Science, 48009 Bilbao, Spain.
⁹ Biomedical Research Networking Center in Bioengineering, Biomaterials and Nanomedicine (CIBER-BBN), Av. Monforte de Lemos, 3-5. Pabellón 11. Planta 0, 28029 Madrid, Spain.
¹⁰ Second Physics Institute, University of Stuttgart, 70569 Stuttgart, Germany.
¹¹ Max Planck Institute for Solid State Research, 70569 Stuttgart, Germany.

PMID: 36256971
PMCID: PMC9706676
DOI: 10.1021/acsnano.2c06748

Abstract

With the total amount of worldwide data skyrocketing, the global data storage demand is predicted to grow to 1.75 × 10¹⁴ GB by 2025. Traditional storage methods have difficulties keeping pace given that current storage media have a maximum density of 10³ GB/mm³. As such, data production will far exceed the capacity of currently available storage methods. The costs of maintaining and transferring data, as well as the limited lifespans and significant data losses associated with current technologies also demand advanced solutions for information storage. Nature offers a powerful alternative through the storage of information that defines living organisms in unique orders of four bases (A, T, C, G) located in molecules called deoxyribonucleic acid (DNA). DNA molecules as information carriers have many advantages over traditional storage media. Their high storage density, potentially low maintenance cost, ease of synthesis, and chemical modification make them an ideal alternative for information storage. To this end, rapid progress has been made over the past decade by exploiting user-defined DNA materials to encode information. In this review, we discuss the most recent advances of DNA-based data storage with a major focus on the challenges that remain in this promising field, including the current intrinsic low speed in data writing and reading and the high cost per byte stored. Alternatively, data storage relying on DNA nanostructures (as opposed to DNA sequence) as well as on other combinations of nanomaterials and biomolecules are proposed with promising technological and economic advantages. In summarizing the advances that have been made and underlining the challenges that remain, we provide a roadmap for the ongoing research in this rapidly growing field, which will enable the development of technological solutions to the global demand for superior storage methodologies.

Keywords: DNA; DNA nanostructure; DNA preservation; costs; data storage; decoding; error correction; random access; reading; sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

**Figure 1**
General strategy for DNA data storage, wherein the data is stored directly in the sequence of the oligonucleotides. The six main steps—encoding, writing, storage, access, reading, and decoding—are depicted.

**Figure 2**
Comparison of the main differences between sequence-based (A,B) and structure-based DNA data storage (C,D), as has been presented in the literature to date. (A,B) Sequence-based storage relies on the *de novo* synthesis of DNA strands and the subsequent sequencing of these entities is performed using next-generation methods. Image adapted with permission from ref (12). Copyright 2019 Springer Nature. (C) By contrast, structure-based methods utilize self-assembly, which means that the information is encoded into their three-dimensional shape. Images adapted with permission: ref (21), copyright 2016 Springer Nature; ref (22), under a Creative Commons Attribution 4.0 License (CC BY), copyright 2021 Springer Nature. (D) These shapes can then be read off using single-molecule methods, including fluorescence, atomic force microscopy, and nanopore techniques. Image adapted from ref (23). Copyright 2019 American Chemical Society.

**Figure 3**
An overview of chemical and enzymatic strategies to synthesize custom DNA sequences. (A) Phosphoramidite synthesis—the most widely used chemical strategy for the synthesis of DNA—involves the sequential addition of nucleotides to a growing chain anchored on a solid support. Protecting groups are employed to ensure that no more than one nucleotide is added at each step and are then subsequently removed via chemical deblocking. (B) Deblocking can also be performed by electrochemistry. Reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY-NC) from ref (31). Copyright 2021 AAAS. (C) Enzymatic methods relying on T4rnl ligase or TdT can also be used to specifically add bases to a growing oligonucleotide in aqueous environments, which eliminates the need for organic solvents. Image reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (32). Copyright 2021 Elsevier B.V.

**Figure 4**
Overview of random access strategies to select a subpool of sequences, usually a file, from a large pool. PCR-based addressing methods leverage the high specificity of primers and the exponential amplification of PCR to enrich target sequences by using either a single or multiple PCR runs. Methods using physical separation as a tool to select sequences also rely on the high specificity of short primers or barcode sequences, but remove the desired sequences using magnetic bead extraction or fluorescence-activated sorting. Images adapted from ref (71) and reproduced with permission from ref (75). Copyright 2019 American Chemical Society and copyright 2021 Springer Nature, respectively.

**Figure 5**
Overview of next-generation sequencing technologies presently used in DNA data storage. (A) Illumina sequencing generates clusters of identical single-stranded oligonucleotides. As the complement is synthesized using spectrally distinct, fluorescently tagged nucleotides, the identity of each base along the strand can be determined through the color of emission. (B) Oxford Nanopore measurements do not require fluorescent dye molecules. As the oligonucleotide passes through the protein pore, the three-dimensional shape of each base will modulate the ionic current, which results in a current–time trace that corresponds to the specific sequence. Images adapted with permission from ref (85). Copyright 2016 Springer Nature.

**Figure 6**
Inner–Outer Code. **Encoding**. The original information is first encoded with an outer code that introduces redundancy and protects against the loss of sequences. In Grass et al. the original information was first grouped into blocks of multiple sequences (light blue). Then, each row was encoded with a Reed–Solomon code that adds redundancy (yellow). The columns correspond to single DNA sequences. These are labeled with a unique index (purple). Each column is then encoded with an inner code that adds logical redundancy on the level of each sequence (green). In general, the inner and outer codes need not add the redundancy separate from the original data, but instead return a modified longer word. **Decoding**. The original information from the set of noisy sequences (errors marked in red) is retrieved by first decoding the inner code. This removes most errors within the sequences. For large error rates dominated by insertions and deletions, this step may be preceded by a clustering and alignment step that generates sequences with fewer errors from multiple noisy copies. The sequences are ordered by their index. The ordered sequences are then decoded by the outer code. Here, lost sequences correspond to erasures and erroneous sequences to substitutions. These are corrected by the outer code.

**Figure 7**
(A) Cost trend of hard disk drives (HDD), NAND flash-based storage devices, linear tape-open tape cartridges (LTO tape), and optical Blu-ray (BD-RE). Image has been reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (99). Copyright 2018 AIP Publishing LLC. (B) Cost comparison between DNA synthesis for data storage and LTO tape storage. (C–E) Comparison of different DNA synthesis platforms and their characteristic traits. (C) Printing technology is primarily used by Twist and Agilent. (D) Electrochemical synthesis is employed by Custom Array. (E) Antkowiak et al. used light-directed synthesis. (C–E) Images reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (42). Copyright 2020 Springer Nature.

**Figure 8**
DNA nanostructures are data storage architectures. (A) DNA origami leverages the specific base-pairing motifs of DNA to create arbitrary structures. When a long scaffold strand (several thousand nucleotides in length) is combined with hundreds of short “staple” strands, complementary regions on the different strands will hybridize, thereby folding the scaffold into a desired conformation. These structures can then be examined using (B) atomic force microscopy or (C) electron microscopy, for example. (D) Data can be written onto DNA origami sheets through the site-specific addition of proteins; the data may be read using AFM. (E) Nanoparticles can also be controllably positioned on DNA origami with nanometer-scale resolution, which enables data writing with cryo-EM readout. (A) Image reproduced with permission from ref (108). Copyright Springer Nature 2021. (B) Image reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (109). Copyright 2019 AAAS. (C) Image reproduced with permission from ref (110). Copyright 2020 Springer Nature. (D) Image reproduced with permission from ref (111). Copyright 2010 Springer Nature. (E) Image reproduced with permission from ref (112). Copyright 2010 Wiley-VCH.

**Figure 9**
Examples of DNA nanostructures for digital information storage. (A) The folding of DNA origami into loop structures upon binding of a biomolecule target generates a shift in the assembly’s electrophoretic mobility. Image adapted with permission under a Creative Commons Attribution 4.0 license (CC BY) from ref (114). Copyright 2017 Oxford University Press. (B) The association of different DNA sequences to carbon nanotubes produces an array of morphologies and, therefore, can be used to produce barcodes. Image adapted from ref (116). Copyright 2019 American Chemical Society. (C). Data strings based on regions of varying fluorescence intensities along a DNA nanotube can be read out using single-molecule fluorescence microscopy. Image adapted from ref (117). Copyright 2021 American Chemical Society.

**Figure 10**
DNA data storage structures relying on nanopore readout. (A) An encrypted “DNA hard drive,” wherein readout may only occur once the correct molecular “keywords” have been added. Streptavidin molecules (gray circle in inset) partially block the nanopore as they translocate, which causes a momentary decrease in the current. Image reproduced from ref (25). Copyright 2020 American Chemical Society. (B) Multilevel barcoding is achievable by exploiting DNA junctions with different sizes, which create current drops of variable magnitude. Image reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (102). Copyright 2021 Wiley-VCH. (C) A DNA barcode with “structural colors” can also be formed by closely packing structural units, which therefore read as one protrusion. These units may be based on either monovalent streptavidin or a DNA cuboid. (D) Nanopore microscope can be used to detect up to 10 structural colors within the same DNA data string. The correct identification of the “color” was verified using fluorescence microscopy, wherein fluorescently labeled (5′-fluorescein) structural units were used. (C,D) Images reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (130). Copyright 2022 Springer Nature.

**Figure 11**
Tile-based computations and algorithmic self-assembly. (A) Self-assembly by SSTs. From a seed, tiles attach to the frontier of a growing SST lattice according to interaction rules determined by their exposed recognition sequences. (B) An iterated Boolean circuit mimicking the function of a computation to determine whether or not a binary number is a multiple of 310. A long enough lattice will settle into one or another fixed pattern corresponding to the calculation result. (C) The result of four “multiple of 3” tilings. The numbers at the left mark the experiment number. The tilings correctly determine which input numbers have a factor of 3. (A–C) Images adapted with permission from ref (144). Copyright 2019 Springer Nature. (D) A Sierpinski triangle created by a cumulative XOR computation performed by DNA tiles. Sierpinski’s triangle is a fractal pattern, and the self-assembly rule that creates it is Turing complete. Images reproduced with permission under a Creative Commons Attribution 4.0 License (CC BY) from ref (145). Copyright 2004 PLoS Biology.

See this image and copyright information in PMC

Comment in

Comment on "Emerging Approaches to DNA Data Storage: Challenges and Prospects".
Armeni P, Gatti A. Armeni P, et al. ACS Nano. 2022 Dec 27;16(12):19612-19613. doi: 10.1021/acsnano.2c11397. ACS Nano. 2022. PMID: 36598758 No abstract available.

References

1. Gu M.; Li X.; Cao Y. Optical Storage Arrays: A Perspective for Future Big Data Storage. Light Sci. Appl. 2014, 3 (5), e177. 10.1038/lsa.2014.58. - DOI
1. Carmean D.; Ceze L.; Seelig G.; Stewart K.; Strauss K.; Willsey M. DNA Data Storage and Hybrid Molecular-Electronic Computing. Proceedings of the IEEE 2019, 107 (1), 63–72. 10.1109/JPROC.2018.2875386. - DOI
1. Hilbert M.; López P. The World’s Technological Capacity to Store, Communicate, and Compute Information. Science 2011, 332 (6025), 60–65. 10.1126/science.1200970. - DOI - PubMed
1. Grass R. N.; Heckel R.; Puddu M.; Paunescu D.; Stark W. J. Robust Chemical Preservation of Digital Information on DNA in Silica with Error-Correcting Codes. Angew. Chem., Int. Ed. 2015, 54 (8), 2552–2555. 10.1002/anie.201411378. - DOI - PubMed
1. Dabney J.; Knapp M.; Glocke I.; Gansauge M. T.; Weihmann A.; Nickel B.; Valdiosera C.; García N.; Pääbo S.; Arsuaga J. L.; Meyer M. Complete Mitochondrial Genome Sequence of a Middle Pleistocene Cave Bear Reconstructed from Ultrashort DNA Fragments. Proc. Natl. Acad. Sci. U. S. A. 2013, 110 (39), 15758–15763. 10.1073/pnas.1314445110. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Emerging Approaches to DNA Data Storage: Challenges and Prospects

Affiliations

Emerging Approaches to DNA Data Storage: Challenges and Prospects

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous