. 2024 Aug;11(30):e2402951.

doi: 10.1002/advs.202402951. Epub 2024 Jun 14.

Overcoming the High Error Rate of Composite DNA Letters-Based Digital Storage through Soft-Decision Decoding

Yaping Xu¹, Lulu Ding^{1

2}, Shigang Wu¹, Jue Ruan¹

Affiliations

¹ Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, 7 Pengfei Street Dapeng New District, Shenzhen, 518120, P. R. China.
² National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, P. R. China.

PMID: 38874370
PMCID: PMC11321706
DOI: 10.1002/advs.202402951

Overcoming the High Error Rate of Composite DNA Letters-Based Digital Storage through Soft-Decision Decoding

Yaping Xu et al. Adv Sci (Weinh). 2024 Aug.

. 2024 Aug;11(30):e2402951.

doi: 10.1002/advs.202402951. Epub 2024 Jun 14.

Authors

Yaping Xu¹, Lulu Ding^{1

2}, Shigang Wu¹, Jue Ruan¹

Affiliations

¹ Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, 7 Pengfei Street Dapeng New District, Shenzhen, 518120, P. R. China.
² National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, P. R. China.

PMID: 38874370
PMCID: PMC11321706
DOI: 10.1002/advs.202402951

Abstract

Composite DNA letters, by merging all four DNA nucleotides in specified ratios, offer a pathway to substantially increase the logical density of DNA digital storage (DDS) systems. However, these letters are susceptible to nucleotide errors and sampling bias, leading to a high letter error rate, which complicates precise data retrieval and augments reading expenses. To address this, Derrick-cp is introduced as an innovative soft-decision decoding algorithm tailored for DDS utilizing composite letters. Derrick-cp capitalizes on the distinctive error sensitivities among letters to accurately predict and rectify letter errors, thus enhancing the error-correcting performance of Reed-Solomon codes beyond traditional hard-decision decoding limits. Through comparative analyses in the existing dataset and simulated experiments, Derrick-cp's superiority is validated, notably halving the sequencing depth requirement and slashing costs by up to 22% against conventional hard-decision strategies. This advancement signals Derrick-cp's significant role in elevating both the precision and cost-efficiency of composite letter-based DDS.

Keywords: DNA digital storage (DDS); composite DNA letter; error‐correcting code (ECC); soft‐decision decoding.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
The core principles of Derrick‐cp decoding algorithm. A) A graphical representation of the letter transition library for $k = 2$ and a sampling depth of 10. The concentric circle consists of ten sectors. Each sector describes the composite letter uncertainty, with three tracks. The middle track is the main track, representing the letter error rate. The inner track represents a transition outward, while the outer track represents a transition inward. Notably, for a given letter in the middle track, the corresponding letters in both the inner and outer tracks are almost identical. B) The schematic diagram (a) and corresponding example (b–e) of letter inference and transition library constructing strategies. In the schematic diagram, (b) the central orange point $σ_{i}$ represents an original letter, and the first set of arrows from the orange point $σ_{i}$ indicate the process of synthesis, storage, and sequencing, (c) resulting in observed frequencies at a specific sampling depth $N$ , represented by the blue points $X_{seq}^{(N)}$ . Assuming a sampling depth of $N$ , there are $C_{N + 3}^{N}$ different combinations of observed frequencies, each with a distinct probability of occurrence, which is reflected in the varying sizes of the blue points in the blue layer. (d) The second set of arrows from the blue layer illustrates the process of letter inference, clustered by the closest letter represented by green points $σ_{j}^{*}$ . (e) By calculating the proportion of each cluster, the transition probability from $σ_{i}$ to $σ_{j}^{*}$ in $k$ is obtained. Overall, beginning with the central orange point $σ_{i}$ , the process culminates in the outermost green point $σ_{j}^{*}$ , representing the completion of the transition library construction. The size of the green points reflects the probability of transition from $σ_{i}$ to $σ_{j}^{*}$ .

**Figure 2**
The decoding processes with an example of decoding an RS (45, 41) block. The soft decision decoder operates through four steps. A) Employs MAP probability to infer the composite letter based on the observed base frequencies. B) It assesses the reliability of the inferred letter by measuring the Euclidean distance between the observed frequencies and the inferred letter. C) It generates an alternative letterset for amending the previously inferred letter. D) It iterates the predicted error positions and corresponding values for soft decision decoding. The iterating process begins by attempting to correct a single position. In this example, there are four potential error positions, each with two alternative values. The iteration follows a specific order determined by reliability sorting, I>II>III>IV>V. Subsequently, the algorithm proceeds to correct two positions and the number of candidate error positions increased by one in the algorithm, when correcting one position decoding fails. When correcting two positions, there are six possible combinations of positions. The figure showcases the decoding paths and their sequential order when positions 4 and 41 are for correction, represented as: ①>②>③>④. The remaining errors are corrected by the RS code, which is within the correction capability of the RS code.

**Figure 3**
Decoding performance of Derrick‐cp algorithm in simulation tests. A) The accuracy of letter inference utilizing the MAP probability model. In a composite DNA‐based storage system using different alphabets, within the sequencing depth that the Derrick‐cp can decode with 100% accuracy, the letters Inference accuracy can reach 98%. B) Comparison of Derrick‐cp and hard‐decision decoding by the number of failed matrices with different alphabet tests. The horizontal axis represents the alphabet size, while the vertical axis represents the sequencing depth. The blue blocks indicate the failed matrices by utilizing the hard‐decision decoding strategy, while the orange blocks represent the adoption of the soft‐decision decoding strategy. Notably, the horizontal line labeled “490×” indicates that with the implementation of soft‐decision decoding, data encoded with an alphabet size of 256 can still be recovered with 100% accuracy. Conversely, with hard‐decision decoding, only data encoded with a reduced alphabet size of 64 can be recovered with the same level of accuracy. C) Comparison of overall costs of DDS systems employing different alphabets and decoding strategies. The horizontal axis represents the alphabet size, while the vertical axis represents the normalized cost. In this context, the costs are standardized by normalizing them to the total cost of a standard DDS system that utilizes four standard nucleotides and employs hard‐decision decoding.

See this image and copyright information in PMC

Cited by

Pragmatic soft-decision data readout of encoded large DNA.
Ge Q, Qin R, Liu S, Guo Q, Han C, Chen W. Ge Q, et al. Brief Bioinform. 2025 Mar 4;26(2):bbaf102. doi: 10.1093/bib/bbaf102. Brief Bioinform. 2025. PMID: 40091194 Free PMC article.

References

1. a) Dong Y., Sun F., Ping Z., Ouyang Q., Qian L., Nat. Sci. Rev. 2020, 7, 1092; - PMC - PubMed
2. b) Meiser L. C., Antkowiak P. L., Koch J., Chen W. D., Kohll A. X., Stark W. J., Heckel R., Grass R. N., Nat. Protoc. 2020, 15, 86; - PubMed
3. c) Ping Z., Ma D., Huang X., Chen S., Liu L., Guo F., Zhu S. J., Shen Y., GigaScience 2019, 8, giz075; - PMC - PubMed
4. d) Zhirnov V., Zadegan R. M., Sandhu G. S., Church G. M., Hughes W. L., Nat. Mater. 2016, 15, 366. - PMC - PubMed
1. a) Nguyen T. T., Cai K., Immink K. A. S., Kiah H. M., in IEEE International Symposium on Information Theory (ISIT) 2020, 694;
2. b) Wang Y., Noor‐A‐Rahim M., Gunawan E., Guan Y. L., Poh C. L., IEEE Communications Letters 2019, 23, 963;
3. c) Ceze L., Nivala J., Strauss K., Nat. Rev. Genet. 2019, 20, 456. - PubMed
1. Anavy L., Vaknin I., Atar O., Amit R., Yakhini Z., Nat. Biotechnol. 2019, 37, 1229. - PubMed
1. a) Aird D., Ross M. G., Chen W.‐S., Danielsson M., Fennell T., Russ C., Jaffe D. B., Nusbaum C., Gnirke A., Genome Biol. 2011, 12, R18; - PMC - PubMed
2. b) Heckel R., Mikutis G., Grass R. N., Sci. Rep. 2019, 9, 9663; - PMC - PubMed
3. c) Jagers P., Klebaner F., J. Theor. Biol. 2003, 224, 299. - PubMed
1. a) Chen Y.‐J., Takahashi C. N., Organick L., Bee C., Ang S. D., Weiss P., Peck B., Seelig G., Ceze L., Strauss K., Nat. Commun. 2020, 11, 3264; - PMC - PubMed
2. b) Organick L., Ang S. D., Chen Y.‐J., Lopez R., Yekhanin S., Makarychev K., Racz M. Z., Kamath G., Gopalan P., Nguyen B., Takahashi C. N., Newman S., Parker H.‐Y., Rashtchian C., Stewart K., Gupta G., Carlson R., Mulligan J., Carmean D., Seelig G., Ceze L., Strauss K., Nat. Biotechnol. 2018, 36, 242. - PubMed

MeSH terms

Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Overcoming the High Error Rate of Composite DNA Letters-Based Digital Storage through Soft-Decision Decoding

Affiliations

Overcoming the High Error Rate of Composite DNA Letters-Based Digital Storage through Soft-Decision Decoding

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous