Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Apr 9:14:121.
doi: 10.1186/1471-2105-14-121.

BioCode: two biologically compatible Algorithms for embedding data in non-coding and coding regions of DNA

Affiliations

BioCode: two biologically compatible Algorithms for embedding data in non-coding and coding regions of DNA

David Haughton et al. BMC Bioinformatics. .

Abstract

Background: In recent times, the application of deoxyribonucleic acid (DNA) has diversified with the emergence of fields such as DNA computing and DNA data embedding. DNA data embedding, also known as DNA watermarking or DNA steganography, aims to develop robust algorithms for encoding non-genetic information in DNA. Inherently DNA is a digital medium whereby the nucleotide bases act as digital symbols, a fact which underpins all bioinformatics techniques, and which also makes trivial information encoding using DNA straightforward. However, the situation is more complex in methods which aim at embedding information in the genomes of living organisms. DNA is susceptible to mutations, which act as a noisy channel from the point of view of information encoded using DNA. This means that the DNA data embedding field is closely related to digital communications. Moreover it is a particularly unique digital communications area, because important biological constraints must be observed by all methods. Many DNA data embedding algorithms have been presented to date, all of which operate in one of two regions: non-coding DNA (ncDNA) or protein-coding DNA (pcDNA).

Results: This paper proposes two novel DNA data embedding algorithms jointly called BioCode, which operate in ncDNA and pcDNA, respectively, and which comply fully with stricter biological restrictions. Existing methods comply with some elementary biological constraints, such as preserving protein translation in pcDNA. However there exist further biological restrictions which no DNA data embedding methods to date account for. Observing these constraints is key to increasing the biocompatibility and in turn, the robustness of information encoded in DNA.

Conclusion: The algorithms encode information in near optimal ways from a coding point of view, as we demonstrate by means of theoretical and empirical (in silico) analyses. Also, they are shown to encode information in a robust way, such that mutations have isolated effects. Furthermore, the preservation of codon statistics, while achieving a near-optimum embedding rate, implies that BioCode pcDNA is also a near-optimum first-order steganographic method.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Typical communications channel model. An embedding function f (·) encodes a message m in a DNA sequence to produce y. If necessary this is done so with a host DNA sequence x and key k. y is transmitted through a channel to produce z, which is decoded using d (·).
Figure 2
Figure 2
A schematic of the BioCode ncDNA algorithm. The input message m, in conjunction with the trailing dinucleotide sequence [yi−2,yi−1] is used to perform a lookup of Table 1.
Figure 3
Figure 3
A schematic of the BioCode pcDNA algorithm. The message m and host DNA sequence x¯ are inputs to the algorithm. The encoded sequence y¯ is output, guaranteeing that the codon bias preservation and the primary structure preservation constraints are adhered to.
Figure 4
Figure 4
Markov chain representing the probability of transition between trailing dinucleotide states.X2D in this diagram represents all the dinucleotide sequences excluding those which may create start codons.
Figure 5
Figure 5
Empirical results of BioCode ncDNA. Shown is the probability of bit error using the Hamming distance (PbH), for BioCode ncDNA (blue). Also shown is PbH for BioCode ncDNA, first encoded with a watermark code (purple).
Figure 6
Figure 6
Empirical results of BioCode ncDNA. This is a log-log plot of the mutual information content of BioCode ncDNA compared to an optimal bound. Also shown is BioCode ncDNA encoded with the watermark code. Information content is given in bits/base.
Figure 7
Figure 7
Empirical analysis of BioCode-pcDNA for different genes. Shown is the probability of bit error using the Hamming distance (PbH). BioCode pcDNA was used for encoding the data. Two of the genes have been used for encoding data in prior works [4,5].
Figure 8
Figure 8
Empirical analysis of BioCode-pcDNA for different genes. The mutual information content for three genes encoded with BioCode pcDNA is shown. It is given in bits/codon.
Figure 9
Figure 9
BioCode pcDNA versus optimal bound. The mutual information content for BioCode pcDNA and the optimal bound. The gene used for encoding and in computing the bound was the “ftsZ” gene.
Figure 10
Figure 10
Empirical analysis of BioCode-pcDNA using resynchronisation error correction. Shown is the probability of bit error using the Hamming distance for BioCode pcDNA alone (blue), BioCode pcDNA using a marker code (light blue) and BioCode pcDNA using a watermark code (purple). The gene used was the “ftsZ” gene.
Figure 11
Figure 11
Empirical analysis of BioCode-pcDNA using resynchronisation error correction. This plot shows the mutual information content for BioCode pcDNA alone, with a marker code and with a watermark code.
Figure 12
Figure 12
Empirical analysis of BioCode-pcDNA using resynchronisation error correction. This is a log-log plot of Figure 11 from 104 to 108 generations, showing the mutual information content for BioCode pcDNA alone, with a marker code and with a watermark code.
Figure 13
Figure 13
Empirical analysis of BCE, Arita’s algorithm and DNA-Crypt. This probability of bit error plot compares binary codon equivalency (BCE), Arita and Ohashi’s algorithm and DNA-Crypt. Arita and Ohashi’s algorithm requires that the original DNA sequence be available for decoding. BCE is a particular instance of BioCode pcDNA when the codon bias preservation constraint is not applied.
Figure 14
Figure 14
Empirical analysis of BCE, Arita’s algorithm and DNA-Crypt. This plot shows the mutual information content of BCE, Arita and Ohashi’s algorithm and DNA-Crypt.

References

    1. Clelland CT, Risca V, Bancroft C. Hiding messages in DNA microdots. Nature. 1999;399(6736):533–534. doi: 10.1038/21092. - DOI - PubMed
    1. Wong PC, Wong K, Foote H. Organic data memory using the DNA, approach. Comms ACM. 2003;46:95–98.
    1. Shimanovsky B, Feng J, Potkonjak M. Procs. of the 5th Intl. Workshop in Information Hiding Noordwijkerhout. The Netherlands; 2002. Hiding data in DNA; pp. 373–386.
    1. Arita M, Ohashi Y. Secret signatures inside genomic DNA. Biotechnol Prog. 2004;20(5):1605–1607. doi: 10.1021/bp049917i. - DOI - PubMed
    1. Heider D, Barnekow A. DNA-based Watermarks using the DNA-Crypt Algorithm. BMC Bioinformatics. 2007;8(176) - PMC - PubMed

Publication types

Associated data