. 2013 Apr 9:14:121.

doi: 10.1186/1471-2105-14-121.

BioCode: two biologically compatible Algorithms for embedding data in non-coding and coding regions of DNA

David Haughton¹, Félix Balado

Affiliations

PMID: 23570444
PMCID: PMC3698116
DOI: 10.1186/1471-2105-14-121

BioCode: two biologically compatible Algorithms for embedding data in non-coding and coding regions of DNA

David Haughton et al. BMC Bioinformatics. 2013.

. 2013 Apr 9:14:121.

doi: 10.1186/1471-2105-14-121.

Authors

David Haughton¹, Félix Balado

Affiliation

¹ School of Computer Science and Informatics, University College Dublin, Belfield, Co, Dublin, Ireland. david.haughton@ucdconnect.ie

PMID: 23570444
PMCID: PMC3698116
DOI: 10.1186/1471-2105-14-121

Abstract

Background: In recent times, the application of deoxyribonucleic acid (DNA) has diversified with the emergence of fields such as DNA computing and DNA data embedding. DNA data embedding, also known as DNA watermarking or DNA steganography, aims to develop robust algorithms for encoding non-genetic information in DNA. Inherently DNA is a digital medium whereby the nucleotide bases act as digital symbols, a fact which underpins all bioinformatics techniques, and which also makes trivial information encoding using DNA straightforward. However, the situation is more complex in methods which aim at embedding information in the genomes of living organisms. DNA is susceptible to mutations, which act as a noisy channel from the point of view of information encoded using DNA. This means that the DNA data embedding field is closely related to digital communications. Moreover it is a particularly unique digital communications area, because important biological constraints must be observed by all methods. Many DNA data embedding algorithms have been presented to date, all of which operate in one of two regions: non-coding DNA (ncDNA) or protein-coding DNA (pcDNA).

Results: This paper proposes two novel DNA data embedding algorithms jointly called BioCode, which operate in ncDNA and pcDNA, respectively, and which comply fully with stricter biological restrictions. Existing methods comply with some elementary biological constraints, such as preserving protein translation in pcDNA. However there exist further biological restrictions which no DNA data embedding methods to date account for. Observing these constraints is key to increasing the biocompatibility and in turn, the robustness of information encoded in DNA.

Conclusion: The algorithms encode information in near optimal ways from a coding point of view, as we demonstrate by means of theoretical and empirical (in silico) analyses. Also, they are shown to encode information in a robust way, such that mutations have isolated effects. Furthermore, the preservation of codon statistics, while achieving a near-optimum embedding rate, implies that BioCode pcDNA is also a near-optimum first-order steganographic method.

PubMed Disclaimer

Figures

**Figure 1**
**Typical communications channel model.** An embedding function f (·) encodes a message m in a DNA sequence to produce y. If necessary this is done so with a host DNA sequence x and key k. y is transmitted through a channel to produce z, which is decoded using d (·).

**Figure 2**
**A schematic of the BioCode ncDNA algorithm.** The input message m, in conjunction with the trailing dinucleotide sequence [y_i−2,y_i−1] is used to perform a lookup of Table 1.

**Figure 3**
**A schematic of the BioCode pcDNA algorithm.** The message m and host DNA sequence $\bar{x}$ are inputs to the algorithm. The encoded sequence $\bar{y}$ is output, guaranteeing that the *codon bias preservation* and the *primary structure preservation* constraints are adhered to.

**Figure 4**
**Markov chain representing the probability of transition between trailing dinucleotide states.** $X^{2} ∖ D$ in this diagram represents all the dinucleotide sequences excluding those which may create start codons.

**Figure 5**
**Empirical results of BioCode ncDNA.** Shown is the probability of bit error using the Hamming distance ( $P_{b}^{H}$ ), for BioCode ncDNA (blue). Also shown is $P_{b}^{H}$ for BioCode ncDNA, first encoded with a watermark code (purple).

**Figure 6**
**Empirical results of BioCode ncDNA.** This is a log-log plot of the mutual information content of BioCode ncDNA compared to an optimal bound. Also shown is BioCode ncDNA encoded with the watermark code. Information content is given in bits/base.

**Figure 7**
**Empirical analysis of BioCode-pcDNA for different genes.** Shown is the probability of bit error using the Hamming distance ( $P_{b}^{H}$ ). BioCode pcDNA was used for encoding the data. Two of the genes have been used for encoding data in prior works [4,5].

**Figure 8**
**Empirical analysis of BioCode-pcDNA for different genes.** The mutual information content for three genes encoded with BioCode pcDNA is shown. It is given in bits/codon.

**Figure 9**
**BioCode pcDNA versus optimal bound.** The mutual information content for BioCode pcDNA and the optimal bound. The gene used for encoding and in computing the bound was the “ftsZ” gene.

**Figure 10**
**Empirical analysis of BioCode-pcDNA using resynchronisation error correction.** Shown is the probability of bit error using the Hamming distance for BioCode pcDNA alone (blue), BioCode pcDNA using a marker code (light blue) and BioCode pcDNA using a watermark code (purple). The gene used was the “ftsZ” gene.

**Figure 11**
**Empirical analysis of BioCode-pcDNA using resynchronisation error correction.** This plot shows the mutual information content for BioCode pcDNA alone, with a marker code and with a watermark code.

**Figure 12**
**Empirical analysis of BioCode-pcDNA using resynchronisation error correction.** This is a log-log plot of Figure 11 from 10⁴ to 10⁸ generations, showing the mutual information content for BioCode pcDNA alone, with a marker code and with a watermark code.

**Figure 13**
**Empirical analysis of BCE, Arita’s algorithm and DNA-Crypt.** This probability of bit error plot compares binary codon equivalency (BCE), Arita and Ohashi’s algorithm and DNA-Crypt. Arita and Ohashi’s algorithm requires that the original DNA sequence be available for decoding. BCE is a particular instance of BioCode pcDNA when the *codon bias preservation* constraint is not applied.

**Figure 14**
**Empirical analysis of BCE, Arita’s algorithm and DNA-Crypt.** This plot shows the mutual information content of BCE, Arita and Ohashi’s algorithm and DNA-Crypt.

See this image and copyright information in PMC

References

1. Clelland CT, Risca V, Bancroft C. Hiding messages in DNA microdots. Nature. 1999;399(6736):533–534. doi: 10.1038/21092. - DOI - PubMed
1. Wong PC, Wong K, Foote H. Organic data memory using the DNA, approach. Comms ACM. 2003;46:95–98.
1. Shimanovsky B, Feng J, Potkonjak M. Procs. of the 5th Intl. Workshop in Information Hiding Noordwijkerhout. The Netherlands; 2002. Hiding data in DNA; pp. 373–386.
1. Arita M, Ohashi Y. Secret signatures inside genomic DNA. Biotechnol Prog. 2004;20(5):1605–1607. doi: 10.1021/bp049917i. - DOI - PubMed
1. Heider D, Barnekow A. DNA-based Watermarks using the DNA-Crypt Algorithm. BMC Bioinformatics. 2007;8(176) - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

Associated data

Actions
- Search in PubMed
- Search in Nucleotide
Actions
- Search in PubMed
- Search in Nucleotide
Actions
- Search in PubMed
- Search in Nucleotide

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

BioCode: two biologically compatible Algorithms for embedding data in non-coding and coding regions of DNA

Affiliation

BioCode: two biologically compatible Algorithms for embedding data in non-coding and coding regions of DNA

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Associated data

LinkOut - more resources

Full Text Sources

Other Literature Sources