Scaling up DNA digital data storage by efficiently predicting DNA hybridisation using deep learning

David Buterez^{1

2}

Affiliations

¹ Department of Computer Science and Technology, University of Cambridge, Cambridge, UK. db804@cam.ac.uk.
² Department of Computing, Imperial College London, London, UK. db804@cam.ac.uk.

PMID: 34654863
PMCID: PMC8519920
DOI: 10.1038/s41598-021-97238-y

Scaling up DNA digital data storage by efficiently predicting DNA hybridisation using deep learning

David Buterez. Sci Rep. 2021.

. 2021 Oct 15;11(1):20517.

doi: 10.1038/s41598-021-97238-y.

Author

David Buterez^{1

2}

Affiliations

¹ Department of Computer Science and Technology, University of Cambridge, Cambridge, UK. db804@cam.ac.uk.
² Department of Computing, Imperial College London, London, UK. db804@cam.ac.uk.

PMID: 34654863
PMCID: PMC8519920
DOI: 10.1038/s41598-021-97238-y

Abstract

Deoxyribonucleic acid (DNA) has shown great promise in enabling computational applications, most notably in the fields of DNA digital data storage and DNA computing. Information is encoded as DNA strands, which will naturally bind in solution, thus enabling search and pattern-matching capabilities. Being able to control and predict the process of DNA hybridisation is crucial for the ambitious future of Hybrid Molecular-Electronic Computing. Current tools are, however, limited in terms of throughput and applicability to large-scale problems. We present the first comprehensive study of machine learning methods applied to the task of predicting DNA hybridisation. For this purpose, we introduce an in silico-generated hybridisation dataset of over 2.5 million data points, enabling the use of deep learning. Depending on hardware, we achieve a reduction in inference time ranging from one to over two orders of magnitude compared to the state-of-the-art, while retaining high fidelity. We then discuss the integration of our methods in modern, scalable workflows.

PubMed Disclaimer

Conflict of interest statement

The author declares no competing interests.

Figures

**Figure 1**
A high-level overview on how to integrate hybridisation prediction into DNA storage workflows. A trained machine learning model can be used as a standalone tool to assemble orthogonal or similar libraries of DNA sequences (left half of the figure). Alternatively, the neural network can be seamlessly integrated into a larger machine learning model as a subcomponent. The presented example is content-based search in a DNA database, where document features are extracted by a neural network (CNN for images, but text, video or audio inputs are conceivable) in a pairwise manner, another neural component generates appropriate encodings (usually one-hot) and the hybridisation predictor outputs the expected yield of the pair. Such a model is trained to associate similar documents to similar single stranded DNA sequences that form stable duplexes with the query sequence (right half of the figure).

**Figure 2**
Visual summary of the yield distribution at 57 °C (the temperature used throughout the paper). For a discussion on the behaviour of the yield at different temperatures, see Supplementary Information 14. The yield is binned in 10 groups, each spanning a 0.1 interval. Brighter colours correspond to higher yield and are shared for the two subfigures. (a) Low and high values are the most numerous, which is expected considering our generative procedure. (b) The highest density is achieved at the extremes of 0, respectively 1. Given how sensitive the molecules are even to a 1-base change, we count the entire intermediate range of yields [0.1, 0.9) as one entity when considering how balanced the dataset is. In this regard, 1,058,364 pairs achieve low yields ( $< 0.1$ ), 769,750 achieve very high yields ( $\geq 0.9$ ) and 728,862 are in-between. It is important that samples with extremely low or high yields are well represented. In particular, there are virtually endless combinations of base pairs resulting in minimum yield.

**Figure 3**
Alignment and thermodynamic properties for various DNA sequences as reported by NUPACK. (a,b) Duplex structure predicted by NUPACK (see parasail alignments in Supplementary Information 1). The parasail alignment is not in full agreement with the predicted binding. (c,d) For single-stranded DNA, a low MFE indicates a high probability that the molecule develops self-complementarity or knots. (c) A sequence that is expected to be stable (AGTACAAGTAGGACAGGAAGATA). (d) A sequence that is expected to be more problematic in hybridisation reactions (TTTCGCACGGACGAGGACGTCCGTTA). (e,f) A sequence can be similar enough to its reverse complement that it is more probable to find it in duplex formations with different instances of itself rather than in the normal single-stranded state. Illustrated is the sequence CCATGGAGGCGCGCCTTT in a complex of size 2, each strand initially present in solution at concentration $1 μ mol$ . The duplex formation of this sequence (f) is more than 5 times as abundant as the single-stranded conformation (e).

**Figure 4**
(a,b) Simplified representations of the deep learning architectures. (a) Convolutional Neural Network overview. Each of the columns in the 2-channel grid (image) corresponds to a one-hot encoding of the four nucleobases for that particular strand position, while each channel represents an entire strand. 2D convolutions on the 2-channel one-hot encoded DNA strands are followed by 1D convolutions (only 3 channels shown) and fully-connected layers. (b) Recurrent Neural Network overview. LSTM layers are widely used and recognised for their performance in language modelling tasks, as well as other sequence-based tasks. For readability, we represent bi-directional interactions with two-headed arrows between the sequence elements. (c–e) Classification and regression results for the evaluated machine learning models. The choice of metrics is explained in the “Methods” section. (c,d) Graphical summary of precision, recall and $F_{1}$ score for the two classes of machine learning models. The **Low** class is represented by dark grey and the **High** class corresponds to the four bright colours (one colour for each method eases readability). (c) The four baseline ML algorithms. Exact numerical values are provided in Supplementary Table 1. (d) Classification metrics for the four deep learning models, after yield binarisation (with a threshold of 0.2). The numerical values are provided in Supplementary Table 2. (e) The AUROC and MCC summarise the classification performance of all evaluated machine learning techniques; additionally, the MSE (Mean Squared Error) is reported for the four deep learning models.

See this image and copyright information in PMC

References

1. Adleman LM. Molecular computation of solutions to combinatorial problems. Science. 1994;266:1021–1024. doi: 10.1126/science.7973651. - DOI - PubMed
1. Reinsel, D., Gantz, J. & Rydning, J. The Digitization of the World From Edge to Core tech. rep. (2018). https://www.seagate.com/files/www-content/our-story/trends/files/idc-sea.... Accessed 1 June 2019.
1. Carmean D, et al. DNA data storage and hybrid molecular-electronic computing. Proc. IEEE. 2019;107:63–72. doi: 10.1109/JPROC.2018.2875386. - DOI
1. Allentoft, M. E. et al. The half-life of DNA in bone: Measuring decay kinetics in 158 dated fossils. Proc. R. Soc. B: Biol. Sci. ISSN: 14712954 (2012). - PMC - PubMed
1. Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angewandte Chemie - International Edition. ISSN: 15213773 (2015). - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Scaling up DNA digital data storage by efficiently predicting DNA hybridisation using deep learning

Affiliations

Scaling up DNA digital data storage by efficiently predicting DNA hybridisation using deep learning

Author

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources