Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 6;12(1):4764.
doi: 10.1038/s41467-021-24991-z.

Molecular-level similarity search brings computing to DNA data storage

Affiliations

Molecular-level similarity search brings computing to DNA data storage

Callista Bee et al. Nat Commun. .

Abstract

As global demand for digital storage capacity grows, storage technologies based on synthetic DNA have emerged as a dense and durable alternative to traditional media. Existing approaches leverage robust error correcting codes and precise molecular mechanisms to reliably retrieve specific files from large databases. Typically, files are retrieved using a pre-specified key, analogous to a filename. However, these approaches lack the ability to perform more complex computations over the stored data, such as similarity search: e.g., finding images that look similar to an image of interest without prior knowledge of their file names. Here we demonstrate a technique for executing similarity search over a DNA-based database of 1.6 million images. Queries are implemented as hybridization probes, and a key step in our approach was to learn an image-to-sequence encoding ensuring that queries preferentially bind to targets representing visually similar images. Experimental results show that our molecular implementation performs comparably to state-of-the-art in silico algorithms for similarity search.

PubMed Disclaimer

Conflict of interest statement

The following authors C.B., Y.C., G.S., K.S., and L.C. declare the following competing interests: they filed a patent application on the core idea. The patent applicant is University of Washington, the application number is PCT/US2020/027545, application is being evaluated, and it covers the system and method for similarity search, including the encoding method. The following authors: K.S. and Y.C. declare the following competing interests: employed by Microsoft. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview DNA-based similarity search.
A Illustration of a feature space where neighboring documents are subjectively similar. B A similarity-preserving DNA encoding is one where the reverse complement of a query document’s sequence hybridizes with a neighboring target document’s sequence, but not with a distant target’s. Note that the query is color-coded green and the targets with other colors. C–H The retrieval process. A database (C) is encoded and synthesized to produce an DNA-based index (D). Arrowheads indicate 3’ ends of DNA. An encoded query (E) is annealed with a sample of the database (F), which is filtered with magnetic beads (G). The filtered database is sequenced to reveal the IDs of retrieved documents, which are used to look them up in the original database (H).
Fig. 2
Fig. 2. Overview of our neural network architectures and training process.
A, B The neural network architectures for the image-to-sequence encoder, and the hybridization predictor. Boxes represent layers, and trapezoids represent operations to go from one layer to the next. Only the colored operations have parameters that change during training. C The training loop for the neural networks. Lines indicate data flow; dashed lines indicate parameter gradients calculated using backpropagation. Green indicates operations only performed during encoder training, while pink indicates operations used only during yield predictor training. All other operations are used in both training phases. D Simulated performance of an untrained model, evaluated on n = 1.6 million random pairs of independent images. Each violin depicts the distribution of simulated hybridizations for pairs whose feature vectors’ Euclidean distance lie within a certain range. E Simulated performance of a trained model, evaluated on the same set of random pairs (i.e., n = 1.6 million random pairs of independent images).
Fig. 3
Fig. 3. Experimental results for three different query images.
Janelle, the cat (top), a building with fireworks (center), and Lego pieces assembled in the shape of sushi (bottom). A Distribution of Euclidean distances to the query image, among sets of images with sequencing read depth above a certain threshold. n = 1.6 million independent images. B The proportion of the entire dataset that must be retrieved (y-axis) to retrieve a certain proportion of the 100 most similar images (x-axis). Each point represents a threshold for which images with read depth above that threshold are considered “retrieved”. The dashed line indicates chance performance, while the dashed-and-dotted line indicates perfect performance. Colored triangles indicate the thresholds depicted in the other subfigures. C The top five closest images to the query from result sets where images above a certain read depth threshold are considered “retrieved”.
Fig. 4
Fig. 4. Comparison of our technique (“primo”, shown in blue) with state-of-the-art algorithms for in silico similarity search.
Dashed gray and dashed-and-dotted gray lines represent chance performance and perfect performance, respectively. Not all of the algorithms could produce results towards the lower-left (low recall and low proportion retrieved). We assume these algorithms could be stopped early to produce fewer results with a linear decrease in recall; dashed continuations represent these linear interpolations.
Fig. 5
Fig. 5. Results of a NUPACK simulation of using a query image of a cat with a larger database consisting of 5.5 million images.
A Distributions of simulated yields for each target image, categorized by their feature vectors’ Euclidean distances to the query feature vector. The widths of each violin are normalized to be equal. B The top 20 images in the database, sorted by the NUPACK-simulated yield of the reaction between their encoded DNA representation and the reverse complement of the query’s DNA representation.

References

    1. Benenson Y, Gil B, Ben-Dor U, Adar R, Shapiro E. An autonomous molecular computer for logical control of gene expression. Nature. 2004;429:423–429. doi: 10.1038/nature02551. - DOI - PubMed
    1. Lopez R, Wang R, Seelig G. A molecular multi-gene classifier for disease diagnostics. Nat. Chem. 2018;10:746–754. doi: 10.1038/s41557-018-0056-1. - DOI - PubMed
    1. Xie Z, Wroblewska L, Prochazka L, Weiss R, Benenson Y. Multi-input RNAi-based logic circuit for identification of specific cancer cells. Science. 2011;333:1307–1311. doi: 10.1126/science.1205527. - DOI - PubMed
    1. Zhang C, et al. Cancer diagnosis with DNA molecular computation. Nat. Nanotechnol. 2020;15:709–715. doi: 10.1038/s41565-020-0699-0. - DOI - PubMed
    1. Adleman LM. Computing with DNA. Sci. Am. 1998;279:54–61. doi: 10.1038/scientificamerican0898-54. - DOI

Publication types