Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug;11(30):e2402951.
doi: 10.1002/advs.202402951. Epub 2024 Jun 14.

Overcoming the High Error Rate of Composite DNA Letters-Based Digital Storage through Soft-Decision Decoding

Affiliations

Overcoming the High Error Rate of Composite DNA Letters-Based Digital Storage through Soft-Decision Decoding

Yaping Xu et al. Adv Sci (Weinh). 2024 Aug.

Abstract

Composite DNA letters, by merging all four DNA nucleotides in specified ratios, offer a pathway to substantially increase the logical density of DNA digital storage (DDS) systems. However, these letters are susceptible to nucleotide errors and sampling bias, leading to a high letter error rate, which complicates precise data retrieval and augments reading expenses. To address this, Derrick-cp is introduced as an innovative soft-decision decoding algorithm tailored for DDS utilizing composite letters. Derrick-cp capitalizes on the distinctive error sensitivities among letters to accurately predict and rectify letter errors, thus enhancing the error-correcting performance of Reed-Solomon codes beyond traditional hard-decision decoding limits. Through comparative analyses in the existing dataset and simulated experiments, Derrick-cp's superiority is validated, notably halving the sequencing depth requirement and slashing costs by up to 22% against conventional hard-decision strategies. This advancement signals Derrick-cp's significant role in elevating both the precision and cost-efficiency of composite letter-based DDS.

Keywords: DNA digital storage (DDS); composite DNA letter; error‐correcting code (ECC); soft‐decision decoding.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
The core principles of Derrick‐cp decoding algorithm. A) A graphical representation of the letter transition library for k=2 and a sampling depth of 10. The concentric circle consists of ten sectors. Each sector describes the composite letter uncertainty, with three tracks. The middle track is the main track, representing the letter error rate. The inner track represents a transition outward, while the outer track represents a transition inward. Notably, for a given letter in the middle track, the corresponding letters in both the inner and outer tracks are almost identical. B) The schematic diagram (a) and corresponding example (b–e) of letter inference and transition library constructing strategies. In the schematic diagram, (b) the central orange point σi represents an original letter, and the first set of arrows from the orange point σi indicate the process of synthesis, storage, and sequencing, (c) resulting in observed frequencies at a specific sampling depth N, represented by the blue points Xseq(N). Assuming a sampling depth of N, there are CN+3N different combinations of observed frequencies, each with a distinct probability of occurrence, which is reflected in the varying sizes of the blue points in the blue layer. (d) The second set of arrows from the blue layer illustrates the process of letter inference, clustered by the closest letter represented by green points σj*. (e) By calculating the proportion of each cluster, the transition probability from σi to σj in k is obtained. Overall, beginning with the central orange point σi, the process culminates in the outermost green point σj, representing the completion of the transition library construction. The size of the green points reflects the probability of transition from σi to σj.
Figure 2
Figure 2
The decoding processes with an example of decoding an RS (45, 41) block. The soft decision decoder operates through four steps. A) Employs MAP probability to infer the composite letter based on the observed base frequencies. B) It assesses the reliability of the inferred letter by measuring the Euclidean distance between the observed frequencies and the inferred letter. C) It generates an alternative letterset for amending the previously inferred letter. D) It iterates the predicted error positions and corresponding values for soft decision decoding. The iterating process begins by attempting to correct a single position. In this example, there are four potential error positions, each with two alternative values. The iteration follows a specific order determined by reliability sorting, I>II>III>IV>V. Subsequently, the algorithm proceeds to correct two positions and the number of candidate error positions increased by one in the algorithm, when correcting one position decoding fails. When correcting two positions, there are six possible combinations of positions. The figure showcases the decoding paths and their sequential order when positions 4 and 41 are for correction, represented as: ①>②>③>④. The remaining errors are corrected by the RS code, which is within the correction capability of the RS code.
Figure 3
Figure 3
Decoding performance of Derrick‐cp algorithm in simulation tests. A) The accuracy of letter inference utilizing the MAP probability model. In a composite DNA‐based storage system using different alphabets, within the sequencing depth that the Derrick‐cp can decode with 100% accuracy, the letters Inference accuracy can reach 98%. B) Comparison of Derrick‐cp and hard‐decision decoding by the number of failed matrices with different alphabet tests. The horizontal axis represents the alphabet size, while the vertical axis represents the sequencing depth. The blue blocks indicate the failed matrices by utilizing the hard‐decision decoding strategy, while the orange blocks represent the adoption of the soft‐decision decoding strategy. Notably, the horizontal line labeled “490×” indicates that with the implementation of soft‐decision decoding, data encoded with an alphabet size of 256 can still be recovered with 100% accuracy. Conversely, with hard‐decision decoding, only data encoded with a reduced alphabet size of 64 can be recovered with the same level of accuracy. C) Comparison of overall costs of DDS systems employing different alphabets and decoding strategies. The horizontal axis represents the alphabet size, while the vertical axis represents the normalized cost. In this context, the costs are standardized by normalizing them to the total cost of a standard DDS system that utilizes four standard nucleotides and employs hard‐decision decoding.

Similar articles

Cited by

References

    1. a) Dong Y., Sun F., Ping Z., Ouyang Q., Qian L., Nat. Sci. Rev. 2020, 7, 1092; - PMC - PubMed
    2. b) Meiser L. C., Antkowiak P. L., Koch J., Chen W. D., Kohll A. X., Stark W. J., Heckel R., Grass R. N., Nat. Protoc. 2020, 15, 86; - PubMed
    3. c) Ping Z., Ma D., Huang X., Chen S., Liu L., Guo F., Zhu S. J., Shen Y., GigaScience 2019, 8, giz075; - PMC - PubMed
    4. d) Zhirnov V., Zadegan R. M., Sandhu G. S., Church G. M., Hughes W. L., Nat. Mater. 2016, 15, 366. - PMC - PubMed
    1. a) Nguyen T. T., Cai K., Immink K. A. S., Kiah H. M., in IEEE International Symposium on Information Theory (ISIT) 2020, 694;
    2. b) Wang Y., Noor‐A‐Rahim M., Gunawan E., Guan Y. L., Poh C. L., IEEE Communications Letters 2019, 23, 963;
    3. c) Ceze L., Nivala J., Strauss K., Nat. Rev. Genet. 2019, 20, 456. - PubMed
    1. Anavy L., Vaknin I., Atar O., Amit R., Yakhini Z., Nat. Biotechnol. 2019, 37, 1229. - PubMed
    1. a) Aird D., Ross M. G., Chen W.‐S., Danielsson M., Fennell T., Russ C., Jaffe D. B., Nusbaum C., Gnirke A., Genome Biol. 2011, 12, R18; - PMC - PubMed
    2. b) Heckel R., Mikutis G., Grass R. N., Sci. Rep. 2019, 9, 9663; - PMC - PubMed
    3. c) Jagers P., Klebaner F., J. Theor. Biol. 2003, 224, 299. - PubMed
    1. a) Chen Y.‐J., Takahashi C. N., Organick L., Bee C., Ang S. D., Weiss P., Peck B., Seelig G., Ceze L., Strauss K., Nat. Commun. 2020, 11, 3264; - PMC - PubMed
    2. b) Organick L., Ang S. D., Chen Y.‐J., Lopez R., Yekhanin S., Makarychev K., Racz M. Z., Kamath G., Gopalan P., Nguyen B., Takahashi C. N., Newman S., Parker H.‐Y., Rashtchian C., Stewart K., Gupta G., Carlson R., Mulligan J., Carmean D., Seelig G., Ceze L., Strauss K., Nat. Biotechnol. 2018, 36, 242. - PubMed

MeSH terms

LinkOut - more resources