Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 25;25(5):bbae363.
doi: 10.1093/bib/bbae363.

Explorer: efficient DNA coding by De Bruijn graph toward arbitrary local and global biochemical constraints

Affiliations

Explorer: efficient DNA coding by De Bruijn graph toward arbitrary local and global biochemical constraints

Chang Dou et al. Brief Bioinform. .

Abstract

With the exponential growth of digital data, there is a pressing need for innovative storage media and techniques. DNA molecules, due to their stability, storage capacity, and density, offer a promising solution for information storage. However, DNA storage also faces numerous challenges, such as complex biochemical constraints and encoding efficiency. This paper presents Explorer, a high-efficiency DNA coding algorithm based on the De Bruijn graph, which leverages its capability to characterize local sequences. Explorer enables coding under various biochemical constraints, such as homopolymers, GC content, and undesired motifs. This paper also introduces Codeformer, a fast decoding algorithm based on the transformer architecture, to further enhance decoding efficiency. Numerical experiments indicate that, compared with other advanced algorithms, Explorer not only achieves stable encoding and decoding under various biochemical constraints but also increases the encoding efficiency and bit rate by ¿10%. Additionally, Codeformer demonstrates the ability to efficiently decode large quantities of DNA sequences. Under different parameter settings, its decoding efficiency exceeds that of traditional algorithms by more than two-fold. When Codeformer is combined with Reed-Solomon code, its decoding accuracy exceeds 99%, making it a good choice for high-speed decoding applications. These advancements are expected to contribute to the development of DNA-based data storage systems and the broader exploration of DNA as a novel information storage medium.

Keywords: DNA storage; De Bruijn graph; biochemical constraint; transformer.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The complete process of DNA storage pipeline and the Explorer encoding–decoding procedure; (A) the complete process of DNA storage pipeline; (B) global demonstration of the Explorer; the algorithm employs binary information as input and takes into account specific biochemical constraints to encode DNA sequences that satisfy said constraints, utilizing the De Bruijn graph and its vertices; (C) in the Explorer encoding process, the available adjacent vertices are selected from the vertices in the De Bruijn graph according to the set biochemical constraints; (D) application example of the Explorer encoding algorithm.
Figure 2
Figure 2
Decoding and error correction in the case of the Explorer.
Figure 3
Figure 3
Decoding and error correction in the case of the Codeformer.
Figure 4
Figure 4
Variation of coding efficiency affected by different conditions; (A) the variation of coding efficiency under different constraint lengths under seven kinds of constraints; (B) the variation of coding efficiency at different sequence lengths and different constraint lengths.
Figure 5
Figure 5
(A) Comparison of Fountain, Hedges, Explorer, Yin-Yang, SPIDER-WEB, Explorer-8 on the code efficiency; (B) comparison of Fountain, Hedges, Explorer, Yin-Yang, SPIDER-WEB, Explorer-8 on the code rate.
Figure 6
Figure 6
Comparison between Explorer, SPIDER-WEB, and Yin-Yang; (A) the coding efficiency comparison between Explorer, SPIDER-WEB, and Yin-Yang; the scatter point represents the results of individual experiments, while the histogram heights are constructed from the means of seven experiments; (B) the number of the parameter comparison between Explorer and SPIDER-WEB.
Figure 7
Figure 7
The time and space complexity with respect to different sizes of data; (A) the time complexity of Explorer with different input data sizes; (B) the space complexity of Explorer with different input data sizes.
Figure 8
Figure 8
The evaluation of decoding accuracy; (A) the correction rate of Explorer under different error rates; (B) the minimum reads number required to support that the sequence with the highest frequency is the correct sequence.
Figure 9
Figure 9
(A) Comparison of code efficiency of Codeformer, Explorer, and models based on LSTM and GRU; (B) code efficiency of the Codeformer under different search widths.
Figure 10
Figure 10
Decoding accuracy of the Codeformer and the number of chains corrected by the RS code.

References

    1. Cao B, Zhang X, Cui S. et al. . Adaptive coding for dna storage with high storage density and low coverage. NPJ Syst Biol Appl 2022;8:23. 10.1038/s41540-022-00233-w. - DOI - PMC - PubMed
    1. Xu C, Ma B, Gao Z. et al. . Electrochemical dna synthesis and sequencing on a single electrode with scalability for integrated data storage. Sci Adv 2021;7:eabk0100. 10.1126/sciadv.abk0100. - DOI - PMC - PubMed
    1. Nguyen BH, Takahashi CN, Gupta G. et al. . Scaling dna data storage with nanoscale electrode wells. Sci Adv 2021;7:eabi6714. 10.1126/sciadv.abi6714. - DOI - PMC - PubMed
    1. Guanjin Q, Yan Z, Huaming W. Clover: tree structure-based efficient dna clustering for dna-based data storage. Brief Bioinform 2022;23:bbac336. - PubMed
    1. Zhang S, Huang B, Song X. et al. . A high storage density strategy for digital information based on synthetic dna. 3 Biotech 2019;9:342–10. 10.1007/s13205-019-1868-4. - DOI - PMC - PubMed

LinkOut - more resources