. 2021 Aug 6;12(1):4764.

doi: 10.1038/s41467-021-24991-z.

Molecular-level similarity search brings computing to DNA data storage

Callista Bee¹, Yuan-Jyue Chen², Melissa Queen¹, David Ward¹, Xiaomeng Liu¹, Lee Organick¹, Georg Seelig^{1

3}, Karin Strauss⁴, Luis Ceze⁵

Affiliations

¹ Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA.
² Microsoft Research, Redmond, WA, USA.
³ Department of Electrical & Computer Engineering, University of Washington, Seattle, WA, USA.
⁴ Microsoft Research, Redmond, WA, USA. kstrauss@microsoft.com.
⁵ Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA. luisceze@cs.washington.edu.

PMID: 34362913
PMCID: PMC8346626
DOI: 10.1038/s41467-021-24991-z

Molecular-level similarity search brings computing to DNA data storage

Callista Bee et al. Nat Commun. 2021.

. 2021 Aug 6;12(1):4764.

doi: 10.1038/s41467-021-24991-z.

Authors

Callista Bee¹, Yuan-Jyue Chen², Melissa Queen¹, David Ward¹, Xiaomeng Liu¹, Lee Organick¹, Georg Seelig^{1

3}, Karin Strauss⁴, Luis Ceze⁵

Affiliations

¹ Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA.
² Microsoft Research, Redmond, WA, USA.
³ Department of Electrical & Computer Engineering, University of Washington, Seattle, WA, USA.
⁴ Microsoft Research, Redmond, WA, USA. kstrauss@microsoft.com.
⁵ Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA. luisceze@cs.washington.edu.

PMID: 34362913
PMCID: PMC8346626
DOI: 10.1038/s41467-021-24991-z

Abstract

As global demand for digital storage capacity grows, storage technologies based on synthetic DNA have emerged as a dense and durable alternative to traditional media. Existing approaches leverage robust error correcting codes and precise molecular mechanisms to reliably retrieve specific files from large databases. Typically, files are retrieved using a pre-specified key, analogous to a filename. However, these approaches lack the ability to perform more complex computations over the stored data, such as similarity search: e.g., finding images that look similar to an image of interest without prior knowledge of their file names. Here we demonstrate a technique for executing similarity search over a DNA-based database of 1.6 million images. Queries are implemented as hybridization probes, and a key step in our approach was to learn an image-to-sequence encoding ensuring that queries preferentially bind to targets representing visually similar images. Experimental results show that our molecular implementation performs comparably to state-of-the-art in silico algorithms for similarity search.

PubMed Disclaimer

Conflict of interest statement

The following authors C.B., Y.C., G.S., K.S., and L.C. declare the following competing interests: they filed a patent application on the core idea. The patent applicant is University of Washington, the application number is PCT/US2020/027545, application is being evaluated, and it covers the system and method for similarity search, including the encoding method. The following authors: K.S. and Y.C. declare the following competing interests: employed by Microsoft. The remaining authors declare no competing interests.

Figures

**Fig. 1. Overview DNA-based similarity search.**
A Illustration of a feature space where neighboring documents are subjectively similar. B A similarity-preserving DNA encoding is one where the reverse complement of a query document’s sequence hybridizes with a neighboring target document’s sequence, but not with a distant target’s. Note that the query is color-coded green and the targets with other colors. **C–H** The retrieval process. A database (C) is encoded and synthesized to produce an DNA-based index (D). Arrowheads indicate 3’ ends of DNA. An encoded query (E) is annealed with a sample of the database (F), which is filtered with magnetic beads (G). The filtered database is sequenced to reveal the IDs of retrieved documents, which are used to look them up in the original database (H).

**Fig. 2. Overview of our neural network architectures and training process.**
**A, B** The neural network architectures for the image-to-sequence encoder, and the hybridization predictor. Boxes represent layers, and trapezoids represent operations to go from one layer to the next. Only the colored operations have parameters that change during training. C The training loop for the neural networks. Lines indicate data flow; dashed lines indicate parameter gradients calculated using backpropagation. Green indicates operations only performed during encoder training, while pink indicates operations used only during yield predictor training. All other operations are used in both training phases. D Simulated performance of an untrained model, evaluated on n = 1.6 million random pairs of independent images. Each violin depicts the distribution of simulated hybridizations for pairs whose feature vectors’ Euclidean distance lie within a certain range. E Simulated performance of a trained model, evaluated on the same set of random pairs (i.e., n = 1.6 million random pairs of independent images).

**Fig. 3. Experimental results for three different query images.**
Janelle, the cat (top), a building with fireworks (center), and Lego pieces assembled in the shape of sushi (bottom). A Distribution of Euclidean distances to the query image, among sets of images with sequencing read depth above a certain threshold. n = 1.6 million independent images. B The proportion of the entire dataset that must be retrieved (y-axis) to retrieve a certain proportion of the 100 most similar images (x-axis). Each point represents a threshold for which images with read depth above that threshold are considered “retrieved”. The dashed line indicates chance performance, while the dashed-and-dotted line indicates perfect performance. Colored triangles indicate the thresholds depicted in the other subfigures. C The top five closest images to the query from result sets where images above a certain read depth threshold are considered “retrieved”.

**Fig. 4. Comparison of our technique (“primo”, shown in blue) with state-of-the-art algorithms for in silico similarity search.**
Dashed gray and dashed-and-dotted gray lines represent chance performance and perfect performance, respectively. Not all of the algorithms could produce results towards the lower-left (low recall and low proportion retrieved). We assume these algorithms could be stopped early to produce fewer results with a linear decrease in recall; dashed continuations represent these linear interpolations.

**Fig. 5. Results of a NUPACK simulation of using a query image of a cat with a larger database consisting of 5.5 million images.**
A Distributions of simulated yields for each target image, categorized by their feature vectors’ Euclidean distances to the query feature vector. The widths of each violin are normalized to be equal. B The top 20 images in the database, sorted by the NUPACK-simulated yield of the reaction between their encoded DNA representation and the reverse complement of the query’s DNA representation.

See this image and copyright information in PMC

References

1. Benenson Y, Gil B, Ben-Dor U, Adar R, Shapiro E. An autonomous molecular computer for logical control of gene expression. Nature. 2004;429:423–429. doi: 10.1038/nature02551. - DOI - PubMed
1. Lopez R, Wang R, Seelig G. A molecular multi-gene classifier for disease diagnostics. Nat. Chem. 2018;10:746–754. doi: 10.1038/s41557-018-0056-1. - DOI - PubMed
1. Xie Z, Wroblewska L, Prochazka L, Weiss R, Benenson Y. Multi-input RNAi-based logic circuit for identification of specific cancer cells. Science. 2011;333:1307–1311. doi: 10.1126/science.1205527. - DOI - PubMed
1. Zhang C, et al. Cancer diagnosis with DNA molecular computation. Nat. Nanotechnol. 2020;15:709–715. doi: 10.1038/s41565-020-0699-0. - DOI - PubMed
1. Adleman LM. Computing with DNA. Sci. Am. 1998;279:54–61. doi: 10.1038/scientificamerican0898-54. - DOI

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Molecular-level similarity search brings computing to DNA data storage

Affiliations

Molecular-level similarity search brings computing to DNA data storage

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources