. 2021 Mar;17(3):246-253.

doi: 10.1038/s41589-020-00711-4. Epub 2021 Jan 11.

Robust direct digital-to-biological data storage in living cells

Sung Sun Yim¹, Ross M McBee^{1

2}, Alan M Song¹, Yiming Huang^{1

3}, Ravi U Sheth^{1

3}, Harris H Wang^{4

5}

Affiliations

¹ Department of Systems Biology, Columbia University, New York, NY, USA.
² Department of Biological Sciences, Columbia University, New York, NY, USA.
³ Integrated Program in Cellular, Molecular, and Biomedical Studies, Columbia University, New York, NY, USA.
⁴ Department of Systems Biology, Columbia University, New York, NY, USA. hw2429@columbia.edu.
⁵ Department of Pathology and Cell Biology, Columbia University, New York, NY, USA. hw2429@columbia.edu.

PMID: 33432236
PMCID: PMC7904632
DOI: 10.1038/s41589-020-00711-4

Robust direct digital-to-biological data storage in living cells

Sung Sun Yim et al. Nat Chem Biol. 2021 Mar.

. 2021 Mar;17(3):246-253.

doi: 10.1038/s41589-020-00711-4. Epub 2021 Jan 11.

Authors

Sung Sun Yim¹, Ross M McBee^{1

2}, Alan M Song¹, Yiming Huang^{1

3}, Ravi U Sheth^{1

3}, Harris H Wang^{4

5}

Affiliations

¹ Department of Systems Biology, Columbia University, New York, NY, USA.
² Department of Biological Sciences, Columbia University, New York, NY, USA.
³ Integrated Program in Cellular, Molecular, and Biomedical Studies, Columbia University, New York, NY, USA.
⁴ Department of Systems Biology, Columbia University, New York, NY, USA. hw2429@columbia.edu.
⁵ Department of Pathology and Cell Biology, Columbia University, New York, NY, USA. hw2429@columbia.edu.

PMID: 33432236
PMCID: PMC7904632
DOI: 10.1038/s41589-020-00711-4

Abstract

DNA has been the predominant information storage medium for biology and holds great promise as a next-generation high-density data medium in the digital era. Currently, the vast majority of DNA-based data storage approaches rely on in vitro DNA synthesis. As such, there are limited methods to encode digital data into the chromosomes of living cells in a single step. Here, we describe a new electrogenetic framework for direct storage of digital data in living cells. Using an engineered redox-responsive CRISPR adaptation system, we encoded binary data in 3-bit units into CRISPR arrays of bacterial cells by electrical stimulation. We demonstrate multiplex data encoding into barcoded cell populations to yield meaningful information storage and capacity up to 72 bits, which can be maintained over many generations in natural open environments. This work establishes a direct digital-to-biological data storage framework and advances our capacity for information exchange between silicon- and carbon-based entities.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest

H.H.W. is a scientific advisor to SNIPR Biome. The authors declare no additional competing interests.

Figures

**Extended Data Fig. 1. Development of a redox-sensing DNA-based cellular recorder for direct digital-to-biological data storage.**
This system is composed of two distinct modules: (i) a ‘sensing module’ that converts a desired biological signal into a change in copy number of a trigger plasmid (pTrig), and (ii) a ‘writing module’ that overexpresses Cas1-Cas2 from a recording plasmid (pRec) to unidirectionally expand genomic CRISPR arrays with novel ~33 bp spacers acquired from genomic or plasmid DNA sources in the cell. In the presence of the desired signal, cells experience a shift in their intracellular DNA pool, driven by an increase in pTrig copy number, which results in an acquisition bias for pTrig-derived spacers amongst expanding CRISPR arrays. **(a)** The *lacI* gene in the previous pRec was replaced with *soxR* gene from *E. coli*, and the *lac* promoter in the previous pTrig was replaced with *soxS* promoter from *E. coli*. P1 replication system is inactive in the absence of oxidative stress, and a mini-F origin keeps the pTrig plasmid copy number low. Upon induction with oxidative stress, SoxR detaches from *soxS* promoter and activates the P1 replication system to increase the copy number of the plasmid. **(b)** pTrig copy number in the presence of various concentrations of phenazine methosulfate (PMS) in aerobic condition. pRec (with an additional copy of *soxR* gene) helps get higher fold-change of pTrig copy number by more efficient repression in absence of the inducer. **(c)** pTrig copy numbers in the presence of pRec and various concentrations of PMS, and FCN(R) or FCN(O) in anaerobic condition. Fold change of the pTrig copy numbers at the given concentrations of FCN(R) or FCN(O) were plotted. **(d)** Various aTc concentrations and **(e)** induction time for the expression of *cas1* and *cas2* genes were tested for CRISPR array expansion. **(f)** Various FCN(R) and FCN(O) concentrations were tested for pTrig copy number induction and **(g)** pTrig-derived spacer incorporation. The proportions of pTrig-derived spacers among all newly incorporated spacers are displayed. All measurements are based on three biological replicates. Error bars represent standard deviation of three biological replicates.

**Extended Data Fig. 2. Construction of a multi-channel electrochemical redox controller.**
**(a)** In an anaerobic chamber, a Raspberry Pi controls 3 of 8-channel relay modules (total 24 relays), which turn on or off electrical signals into each chamber pair from a power supply, based on a python script running on a wirelessly connected PC. **(b)** A pair of working and counter chambers is connected by an agar salt bridge. In a working chamber, cells are incubated in M9 minimal medium supplemented with antibiotics, aTc, FCN(R) and PMS. M9 minimal medium supplemented with FCN(O) and PMS is filled in another chamber (counter). **(c)** A photograph of the multi-channel electrochemical redox controller in an anaerobic chamber. **(d)** Changes in electrochemical redox states of FCN(R) in a working chamber (left) and FCN(O) in a counter chamber (right) measured by absorbance at 420 nm with (0.5 V) and without (0.0 V) electronic signals. All measurements are based on three replicates. Error bars represent standard deviation of three biological replicates.

**Extended Data Fig. 3. Encoding of 3-bit binary data profiles.**
**(a)** Schematic diagram of experimental steps for multi-round encoding. After each round of electrical stimulation, the cell population was recovered in the rich medium (LB) aerobically so that the induced/uninduced plasmid copy number in the previous encoding round can be diluted out and reset low. **(b)** To determine the recovery condition, anaerobic and aerobic conditions were compared. **(c)** Overlaid distributions of the plasmid copy numbers with/without signals at each round over the course of the multi-round encoding (Figure 2b). **(d)** CRISPR array expansion over the course of the experiment. **(e)** The 3-bit binary data profiles are grouped by the number of electronic signals, and the proportions of pTrig-derived spacers among all newly incorporated spacers are displayed. **(f)** To enrich the sequencing reads for expanded arrays with more new spacers (longer arrays), the magnetic bead-based size enrichment was performed. Frequency of arrays of different lengths (unexpanded and L1-L4) with and without size enrichment are plotted. **(g)** Principal component analysis on the array-type frequency profiles for the 3-bit digital data profiles. All 9 independent biological replicates are shown for each 3-bit digital data profiles. The first three independent datasets used for training of the Random Forest classifier are highlighted. All measurements are based on two or more biological replicates. Error bars represent standard deviation of three or more biological replicates.

**Extended Data Fig. 4. Performance of a Random Forest classifier for data reconstruction.**
**(a)** Confusion matrix from cross validation of the Random Forest classifier for 10 times by training on randomly selected 2 datasets for each 3-bit digital data profile from the 3 independent experiments and testing the trained model on the left-out 1 dataset. **(b)** Importance of features (array-types) for the Random Forest classifier in Figure 2f. **(c)** Classification performance for the number of CRISPR arrays. CRISPR arrays with new uniquely mapping spacers were randomly subsampled to the various numbers for the 3-bit digital data profiles and classifications were performed. Recall accuracies for distinguishing 8 different types of 3-bit digital data profiles were displayed as a function of the number of expanded arrays with uniquely mapping spacers (grey: all arrays, red: L2/L3 arrays). The number of sequencing reads corresponding to the number of expanded arrays with uniquely mapping spacers (grey: all arrays) is also provided as an additional x-axis. Shaded regions represent 95% confidence interval of 10 iterations of subsampling and classification. **(d)** Recall accuracies for distinguishing 8 different types of 3-bit digital data profiles with varying proportions of randomly selected training datasets for each 3-bit digital data profile. Shaded regions represent 95% confidence interval of 100 iterations of subsampling and classification.

**Extended Data Fig. 5. Barcoding CRISPR arrays for multiplexed encoding.**
**(a)** CRISPR arrays can be barcoded with 8-bp unique sequences either downstream of the 1^st spacer region or within direct repeat (DR) region. **(b)** CRISPR array expansion rates (relative to wild-type array) of 69 DR-barcoded CRISPR arrays and 24 spacer-barcoded CRISPR arrays. **(c)** Distribution of array expansion rates of spacer-barcoded CRISPR arrays is much more uniform and consistent than that of DR-barcoded CRISPR arrays. A DR variant (d1) that was more efficient than the wild-type DR sequence in the initial 96-well plate-based test is highlighted. **(d)** The d1 DR variant was tested again in tube culture condition. In tube culture condition, however, the DR variant did not show significantly higher activity than that of the wild-type DR sequence. **(e)** Comparison of CRISPR array expansion rates measured individually or in pool. Shaded region represents 95% confidence interval for linear regression (dashed grey line). Sample sizes (n) and Person correlation coefficient (r) are shown. All measurements are based on three biological replicates. Error bars represent standard deviation of three biological replicates.

**Extended Data Fig. 6. Projections on the scale of DRIVES.**
**(a)** Data storage capacity (‘n’ bits of information or ‘n’ rounds of encoding) per cell population is estimated as a function of Cas1-Cas2 activity (‘X’ proportion of the cell population expanded arrays with a new spacer after a single round of encoding). Here, ‘Xⁿ’ proportion of the cell population would have expanded arrays every round resulting ‘n’ new spacers (Ln arrays) after ‘n’ rounds of encoding, and we assumed that the sampling capacity for the Ln array population governs the data storage capacity. We considered various sampling depths ‘D’, where ‘D’ proportion of the cell population can be sufficiently sampled. This ‘D’ could be affected by many factors including the sequencing depth and size enrichment efficiency. We assumed that if the ‘Xⁿ’ is same or higher than the given sampling depth constraint ‘D’, ‘n’ bits can be stored and reliably decoded. For example, when 0.001 of the cell population can be sufficiently sampled (D=0.001), maximum data storage capacity would be 3 bits (n=3) with the current Cas1-Cas2 activity level (X=0.1) as in our current experimental dataset (highlighted in red in the plot). And when 0.0001 of the cell population can be sufficiently sampled (D=0.0001), maximum data storage capacity would be 4 bits (n=4) with the current Cas1-Cas2 activity level (X=0.1). Although the Illumina MiSeq v2 300 cycles kit used in this study can read only up to 5 new spacers, we assumed that sequencing read length is not the limiting factor in this projection as other long read sequencing technologies could be employed. **(b)** Estimated total data storage capacity across barcoded cell populations as a function of Cas1-Cas2 activity and the number of parallel channels in the culture platform at two different sampling depths (D=0.001 and D=0.00001). A larger data per cell population would require more rounds of encoding which takes longer time, and a larger number of parallel channels would require more barcoded cell populations and more sophisticated design of the culture platform. Current capacity of the system with 24 channels in the culture platform is highlighted in blue in the plot.

**Extended Data Fig. 7. Design of 6-bit encoding tables for text messages.**
**(a)** Probability of correct classification for each of the 3-bit digital data profiles by the Random Forest classifier on the newly generated independent datasets is calculated based on the result in Figure 2f. **(b)** DEC and OPT encoding tables with estimated probabilities of correct classification for the 64 characters. OPT 6-bit encoding table was designed by considering the correct classification probability and the usage frequency of the characters (https://mdickens.me/typing/letter_frequency.html). **(c)** Probability of correct decoding for the 64 character (ordered by usage) with DEC and OPT 6-bit encoding tables. **(d)** Comparison of predicted probabilities of correct decoding for various text messages based on the two encoding tables. The predicted probabilities of correct decoding for each character or text message were calculated by multiplying the correct decoding probability values of each 3-bit digital data profile units.

**Extended Data Fig. 8. Reading ‘hello world!’ from subsampled sequencing reads.**
Sequencing reads from each barcode in the ‘hello world!’-encoded cell population using OPT table were randomly subsampled to the various numbers and classifications were performed. Recall accuracies for **(a)** distinguishing 3-bit digital data profiles for 24 barcoded populations or for **(b)** calling correct bits out of 72 bits were displayed as a function of the number of expanded arrays with uniquely mapping spacers (grey: all arrays, red: L2/L3 arrays). The number of sequencing reads corresponding to the number of expanded arrays with uniquely mapping spacers (grey: all arrays) is also provided as an additional x-axis. Shaded regions represent 95% confidence interval of 10 iterations of subsampling and classification.

**Extended Data Fig. 9. Improving data reconstruction with error correction.**
**(a)** By using every sixth bit as a check point (checksum) for the first 5 bits, errors in data reconstruction can be detected and corrected for the selected 32 combinations of 6-bit digital data profiles based on the classifier’s confusion probability in Figure 2f **and Extended Data Fig. 9b**. For example, for a digital input ‘011110’ could be classified as ‘011110’, ‘011010’, ‘001110’, or ‘001010’ with the probabilities of 69%, 14%, 14%, or 3%, respectively. Out of these 4 possible initial classifications, the last 3 are wrong and the 2 wrong classifications with a single bit error can be detected by the check point values and fixed. However, the classification result with 2 bits error cannot be detected by the check point value and therefore cannot be fixed. For all 32 combinations of 6-bit digital data profiles, possible classification results, their probabilities, and whether they can be fixed or not are summarized in Supplementary Table 2. **(b)** Confusion probability for each of the 3-bit digital data profiles based on Figure 2f. **(c)** The check point values for each combination of eight 3-bit and four 2-bit digital data profiles. **(d)** OPT2 encoding table with the estimated probabilities of correct classification for the 32 characters. **(e)** Probability of correct decoding for the 32 characters (ordered by usage) for OPT and OPT2 6-bit encoding tables. **(f)** ‘synbio@cu’ encoded in the genomes of barcoded *E. coli* populations using the OPT2 error correction strategy. Two errors from the initial classification were detected using the check points and successfully corrected as described in the figure. For classification of each barcoded cell population, an average of 492,289 total sequencing reads with 268,066 reads of expanded arrays (or 106,242 of L2/L3 arrays) that uniquely map spacers were used. Bead-based size enrichment was performed to enrich for expanded arrays and deplete unexpanded arrays. Frequencies of array-types are in log₁₀ scale. All measurements are based on a single experimental study.

**Extended Data Fig. 10. Data stability in replicating cells.**
A mixed pool of 24 barcoded cell population encoded with a 72-bits text message ‘hello world!’ in Figure 3 was subsequently diluted 1:100 every 24 hours into 3 mL fresh LB media with antibiotic for a total of 16 days (~106 generation, ~6.6 generations per day). **(a)** Data stability in the propagating cell population over 100 generations. Accuracy indicates the proportion of bits that are correctly classified. >90% of the 72 bits could be correctly retrieved up to ~80 generations. Shaded region represents standard deviation of three biological replicates. For classification of each barcoded cell population, an average of 82,860 of total sequencing reads with 40,502 reads of expanded arrays (or 17,139 of L2/L3 arrays) that uniquely map spacers were used. Bead-based size enrichment was performed to enrich for expanded arrays and deplete unexpanded arrays. **(b)** Gradual changes in the relative abundance of 24 barcoded cell population over time suggests adaptive mutations with fitness effects arising in some of the subpopulation. Samples were collected at the time points indicated by arrows (day 0, 4, 6, 8, 12, and 16). All measurements are based on three biological replicates.

**Figure 1.. Direct digital-to-biological data storage into CRISPR arrays.**
**(a)** Digital information can be directly encoded into CRISPR arrays of a bacterial population using electronic signals. The cell population can then be archived for long-term storage, propagated for data amplification, and sequenced for data retrieval. **(b)** Overexpression of the Cas1-Cas2 complex results in constant incorporation new spacers into CRISPR arrays of a cell population. Electronic signals induce a change in abundance of a copy number inducible plasmid (pTrig) and thus the proportion of pTrig-derived spacers. **(c)** At 0 state, the electrical signal is not applied (0.0 V) to keep FCN(R) and PMS reduced and pTrig copy number is low. At 1 state, the electrical signal (0.5 V) oxidizes FCN(R) and PMS, which activates the *soxS* promoter to increase pTrig copy number. FCN(R), ferrocyanide; FCN(O), ferricyanide; PMS, phenazine methosulfate. **(d)** Relative copy number of pTrig, **(e)** proportion of expanded CRISPR arrays and source of the new spacers without (0.0 V) and with (0.5 V) electrical signal for 14 hours. Ref, genome- and pRec-derived spacers; pTrig, pTrig-derived spacers. All measurements are based on three biological replicates. Error bars represent standard deviation of three biological replicates.

**Figure 2.. Encoding 3-bit binary data into *E. coli* populations.**
**(a)** Cells were subjected to electrical signals over three sequential rounds, constituting all 8 possible 3-bit binary data profiles. **(b)** pTrig copy number profiles for each round of the 3-bit binary data profiles. **(c)** CRISPR array populations can be described as a frequency distribution constituting of all permutations of reference spacers (R, grey) derived from genome or pRec and trigger spacers (T, red) derived from pTrig for a given array length (L). **(d)** Frequencies of array-types in log₁₀ scale for each array lengths for the 3-bit data encoded CRISPR array populations. **(e)** Clustering CRISPR arrays based on their array-type frequency profiles normalized to Z-score across all 3-bit binary profiles. **(f)** Performance of a Random Forest classifier trained on data from 3 independent experiments and tested on data from 6 subsequent independent experiments. For classification of each sample, an average of 172,788 total sequencing reads with 89,928 reads of expanded arrays (or 38,295 of L2/L3 arrays) that uniquely map spacers were used. Bead-based size enrichment was performed to enrich for expanded arrays and deplete unexpanded arrays (see Methods). All measurements are based on three or more biological replicates. Error bars represent standard deviation of three biological replicates.

**Figure 3.. Writing the text message ‘hello world!’ containing 72 bits into barcoded *E. coli* cells.**
**(a)** Uniquely barcoded cell populations in each chamber on the multi-channel electrochemical redox controller can receive and store 3-bit binary profiles in parallel split from an original data. The 3-bit encoded cells in each chamber can be pooled and stored. Data can be retrieved by sequencing and demultiplexing barcode sequences for data reconstruction. **(b)** The OPT 6-bit character table that leverages letter usage frequency and retrieval bias is shown. The 6-bit binary data for each character is split into two barcoded cell populations. An example of encoding ‘h’ is shown. **(c)** Array-type frequencies (in log₁₀ scale) from a ‘hello world!’ encoded cell population is shown. For classification of each barcoded cell population, an average of 443,051 total sequencing reads with 271,725 reads of expanded arrays (or 179,174 of L2/L3 arrays) that uniquely map spacers were used. Bead-based size enrichment was performed to enrich for expanded arrays and deplete unexpanded arrays (see Methods). All measurements are based on a single encoding experiment.

**Figure 4.. Cell envelope as a physical barrier to protect data.**
**(a)** A data-encoded cell population or naked genomic DNA extracted from the same amount of the data-encoded population was challenged to a natural soil environment. **(b)** Retrieval of a text message (‘synbio@cu’) from a mixed microbial community of data-encoded *E. coli* cells and natural soil microbiota with and without selective growth enrichment. Accuracy is defined as the proportion of bits that are correctly classified. For classification of each barcoded cell population, an average of 41,740 total sequencing reads with 20,811 reads of expanded arrays (or 9,821 of L2/L3 arrays) that uniquely map spacers were used. Bead-based size enrichment was performed to enrich for expanded arrays and deplete unexpanded arrays. **(c)** Retrieval of text message ‘synbio@cu’ stored in naked DNA or in encoded cells after exposure to soil for 0 or 6 days. Accuracy is defined as the proportion of bits that are correctly classified. The plot displays the mean values and whiskers span the highest and lowest points. For classification of each barcoded cell population, an average of 20,542 total sequencing reads with 7,868 reads of expanded arrays (or 2,692 of L2/L3 arrays) that uniquely map spacers were used. Bead-based size enrichment was performed to enrich for expanded arrays and deplete unexpanded arrays. **(d)** Comparison of microbial compositions of a natural soil community with and without hidden data-encoded *E. coli* cells (*Escherichia/Shigella* genus is highlighted red, 4% spiked-in). OTUs (n) and Pearson correlation coefficient (r) are shown. Dashed line represents y=x. All measurements are based on two biological replicates.

See this image and copyright information in PMC

Comment in

One-step data storage in cellular DNA.
Bhattarai-Kline S, Lear SK, Shipman SL. Bhattarai-Kline S, et al. Nat Chem Biol. 2021 Mar;17(3):232-233. doi: 10.1038/s41589-021-00737-2. Nat Chem Biol. 2021. PMID: 33500580 No abstract available.

References

1. Church GM, Gao Y & Kosuri S Next-generation digital information storage in DNA. Science 337, 1628, doi: 10.1126/science.1226355originally (2012). - DOI - PubMed
1. Erlich Y & Zielinski D DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017). - PubMed
1. Allentoft ME et al. The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils. Proc Biol Sci 279, 4724–4733, doi: 10.1098/rspb.2012.1745 (2012). - DOI - PMC - PubMed
1. Ceze L, Nivala J & Strauss K Molecular digital data storage using DNA. Nat Rev Genet 20, 456–466, doi: 10.1038/s41576-019-0125-3 (2019). - DOI - PubMed
1. Newman S et al. High density DNA data storage library via dehydration with digital microfluidic retrieval. Nat Commun 10, 1706, doi: 10.1038/s41467-019-09517-y (2019). - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 AI132403/AI/NIAID NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Robust direct digital-to-biological data storage in living cells

Affiliations

Robust direct digital-to-biological data storage in living cells

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials