Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Feb 7;494(7435):77-80.
doi: 10.1038/nature11875. Epub 2013 Jan 23.

Towards practical, high-capacity, low-maintenance information storage in synthesized DNA

Affiliations

Towards practical, high-capacity, low-maintenance information storage in synthesized DNA

Nick Goldman et al. Nature. .

Abstract

Digital production, transmission and storage have revolutionized how we access and use information but have also made archiving an increasingly complex task that requires active, continuing maintenance of digital media. This challenge has focused some interest on DNA as an attractive target for information storage because of its capacity for high-density information encoding, longevity under easily achieved conditions and proven track record as an information bearer. Previous DNA-based information storage approaches have encoded only trivial amounts of information or were not amenable to scaling-up, and used no robust error-correction and lacked examination of their cost-efficiency for large-scale information archival. Here we describe a scalable method that can reliably store more information than has been handled before. We encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information of 5.2 × 10(6) bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracy. Theoretical analysis indicates that our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving. In fact, current trends in technological advances are reducing DNA synthesis costs at a pace that should make our scheme cost-effective for sub-50-year archiving within a decade.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Digital information encoding in DNA
Digital information (a, in blue), here binary digits holding the ASCII codes for part of Shakespeare’s sonnet 18, was converted to base-3 (b, red) using a Huffman code that replaces each byte with five or six base-3 digits (trits). This in turn was converted in silico to our DNA code (c, green) by replacement of each trit with one of the three nucleotides different from the previous one used, ensuring no homopolymers were generated. This formed the basis for a large number of overlapping segments of length 100 bases with overlap of 75 bases, creating fourfold redundancy (d, green and, with alternate segments reverse complemented for added data security, violet). Indexing DNA codes were added (yellow), also encoded as non-repeating DNA nucleotides. See Supplementary Information for further details.
Figure 2
Figure 2. Scaling properties and robustness of DNA-storage
a, Encoding efficiency and costs change as the amount of stored information increases. The x-axis (logarithmic scale) represents the total amount of information to be encoded. Common data scales are indicated, including the 3 ZB (3 × 1021 bytes) global data estimate. The black line (y-axis scale to left) indicates encoding efficiency, measured as the proportion of synthesised bases available for data encoding. The blue curves (y-axis scale to right) indicate the corresponding effect on encoding costs, both at current synthesis cost levels (solid line) and in the case of a two-order of magnitude reduction (dashed line). b, Per-recovered-base error rate (y-axis) as a function of sequencing coverage, represented by the percentage of the original 79.6M read-pairs sampled (x-axis; logarithmic scale). The blue curve represents the four files recovered without human intervention: the error is zero when 2% of the original reads are used. The grey curve is derived from our theoretical error rate model. The orange curve represents the file that required manual correction: the minimum possible error rate is 0.0036%. c, Timescales for which DNA-storage is cost-effective. The blue curve indicates the relationship between break-even time beyond which DNA-storage is less expensive than magnetic tape (x-axis) and relative cost of DNA-storage synthesis and tape transfer fixed costs (y-axis), assuming the tape archive has to be read and re-written every 5 years. The orange curve corresponds to tape transfers every 10 years; broken curves correspond to other transfer periods as indicated. In the green-shaded region, DNA-storage is cost-effective when transfers occur more frequently than every 10 years; in the yellow-shaded region, DNA-storage is cost-effective when transfers occur from 5- to 10-yearly; in the red-shaded region tape is less expensive when transfers occur less frequently than every 5 years. Highlighted ranges of relative costs of DNA synthesis to tape transfer are 125–500 (current costs for 1 MB of data), 12.5–50 (achieved if DNA synthesis costs reduce by one order of magnitude) and 1.25–5 (costs reduced by two orders of magnitude). Note the logarithmic scales on both axes. See Supplementary Information for further details.

Comment in

References

    1. Baum EB. Building an associative memory vastly larger than the brain. Science. 1995;268:583–585. - PubMed
    1. Cox JPL. Long-term data storage in DNA. TRENDS Biotech. 2001;19:247–250. - PubMed
    1. Anchordoquy TJ, Molina MC. Preservation of DNA. Cell Preservation Tech. 2007;5:180–188.
    1. Bonnet J, et al. Chain and conformation stability of solid-state DNA: implications for room temperature storage. Nucl. Acids Res. 2010;38:1531–1546. - PMC - PubMed
    1. Clelland CT, Risca V, Bancroft C. Hiding messages in DNA microdots. Nature. 1999;399:533–534. - PubMed

Publication types