Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Nov 18;8(1):25.
doi: 10.1186/1748-7188-8-25.

Data compression for sequencing data

Affiliations

Data compression for sequencing data

Sebastian Deorowicz et al. Algorithms Mol Biol. .

Abstract

: Post-Sanger sequencing methods produce tons of data, and there is a general agreement that the challenge to store and process them must be addressed with data compression. In this review we first answer the question "why compression" in a quantitative manner. Then we also answer the questions "what" and "how", by sketching the fundamental compression ideas, describing the main sequencing data types and formats, and comparing the specialized compression algorithms and tools. Finally, we go back to the question "why compression" and give other, perhaps surprising answers, demonstrating the pervasiveness of data compression techniques in computational biology.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Trends in storage, transfer, and sequencing costs. The historic costs of low-end hard disk drives were taken from http://www.jcmit.com/diskprice.htm. They had been halving every 12 months in the 1990s and around 2000–2004. Then, the doubling time lengthened suddenly, to about 25 months. The real costs of sequencing, taken from the NHGRI Web page [5], reflect not only reagent costs like some studies show, but also include labor, administration, amortization of sequencing instruments, submission of data to a public database, etc. The significant change in sequencing costs around 2008 was caused by the popularization of the second generation technologies. The prices of the Amazon storage and transfer reflect the real market offers from the top data centers. It is interesting that the storage costs at data centers drop very slowly, mainly because the costs of blank hard disks are only a part of the total costs of maintenance. The curves were not corrected for inflation.

References

    1. Metzker ML. Sequencing technologies–the next generation. Nat Rev Genet. 2010;11:31–46. - PubMed
    1. Kahn SD. On the future of genomic data. Science. 2011;331:728–729. - PubMed
    1. Roberts JP. Million veterans sequenced. Nat Biotechnol. 2013;31(6):470.
    1. Hall N. After the gold rush. Genome Biol. 2013;14(5):115. - PMC - PubMed
    1. National Human Genome Research Institute, DNA Sequencing Costs. [ http://www.genome.gov/sequencingcosts/] (accessed February 14, 2013)