Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Feb;40(4):e27.
doi: 10.1093/nar/gkr1124. Epub 2011 Dec 1.

GReEn: a tool for efficient compression of genome resequencing data

Affiliations

GReEn: a tool for efficient compression of genome resequencing data

Armando J Pinho et al. Nucleic Acids Res. 2012 Feb.

Abstract

Research in the genomic sciences is confronted with the volume of sequencing and resequencing data increasing at a higher pace than that of data storage and communication resources, shifting a significant part of research budgets from the sequencing component of a project to the computational one. Hence, being able to efficiently store sequencing and resequencing data is a problem of paramount importance. In this article, we describe GReEn (Genome Resequencing Encoding), a tool for compressing genome resequencing data using a reference genome sequence. It overcomes some drawbacks of the recently proposed tool GRS, namely, the possibility of compressing sequences that cannot be handled by GRS, faster running times and compression gains of over 100-fold for some sequences. This tool is freely available for non-commercial use at ftp://ftp.ieeta.pt/~ap/codecs/GReEn1.tar.gz.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The copy model. In this example, the copy model was restarted at position 341 587 of the reference sequence, corresponding to position 327 829 of the target sequence. Since then, it has correctly predicted 5 characters, if the case is considered, and a total of 11 characters if the case is ignored. The dashed arrow indicates a failed prediction. According to this example, the next character to be predicted is ‘G’.
Figure 2.
Figure 2.
Data organized in a hash table.

References

    1. Grumbach S, Tahi F. Proceedings of the Data Compression Conference, DCC-93. Snowbird. Utah: IEEE; 1993. Compression of DNA sequences; pp. 340–350.
    1. Grumbach S, Tahi F. A new challenge for compression algorithms: genetic sequences. Inform. Process. Manag. 1994;30:875–886.
    1. Rivals E, Delahaye J-P, Dauchet M, Delgrange O. Proceedings of the Data Compression Conference, DCC-96. Snowbird. Utah: IEEE; 1996. A guaranteed compression scheme for repetitive DNA sequences; p. 453.
    1. Loewenstern D, Yianilos PN. Proceedings of the Data Compression Conf., DCC-97. Snowbird. Utah: IEEE; 1997. Significantly lower entropy estimates for natural DNA sequences; pp. 151–160. - PubMed
    1. Chen X, Kwong S, Li M. A compression algorithm for DNA sequences and its applications in genome comparison. In: Asai K, Miyano S, Takagi T, editors. Genome Informatics 1999: Proc. of the 10th Workshop. Tokyo, Japan: Universal Academy Press, Inc; 1999. pp. 51–61. - PubMed

Publication types