A new efficient data structure for storage and retrieval of multiple biosequences

Sascha Steinbiss¹, Stefan Kurtz

Affiliations

PMID: 22084150
DOI: 10.1109/TCBB.2011.146

A new efficient data structure for storage and retrieval of multiple biosequences

Sascha Steinbiss et al. IEEE/ACM Trans Comput Biol Bioinform. 2012.

. 2012;9(2):330-44.

doi: 10.1109/TCBB.2011.146. Epub 2011 Nov 10.

Authors

Sascha Steinbiss¹, Stefan Kurtz

Affiliation

¹ University of Hamburg, Hamburg.

PMID: 22084150
DOI: 10.1109/TCBB.2011.146

Abstract

Today's genome analysis applications require sequence representations allowing for fast access to their contents while also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly reusable or programming language-specific implementations. We present a novel, space-efficient data structure (GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character transformations, wildcard support and an assortment of internal representations optimized for different distributions of wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our representation requires only 2 + 8 × 10^-6bits per character. Implemented in C, our portable software implementation provides a variety of methods for random and sequential access to characters and substrings (including different reading directions) using an object-oriented interface. In addition, it includes access to metadata like sequence descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show that it is competitive with respect to space and time requirements.

PubMed Disclaimer

Cited by

Data compression for sequencing data.
Deorowicz S, Grabowski S. Deorowicz S, et al. Algorithms Mol Biol. 2013 Nov 18;8(1):25. doi: 10.1186/1748-7188-8-25. Algorithms Mol Biol. 2013. PMID: 24252160 Free PMC article.
An automated real-time integration and interoperability framework for bioinformatics.
Lopes P, Oliveira JL. Lopes P, et al. BMC Bioinformatics. 2015 Oct 13;16:328. doi: 10.1186/s12859-015-0761-3. BMC Bioinformatics. 2015. PMID: 26464306 Free PMC article.
Readjoiner: a fast and memory efficient string graph-based sequence assembler.
Gonnella G, Kurtz S. Gonnella G, et al. BMC Bioinformatics. 2012 May 6;13:82. doi: 10.1186/1471-2105-13-82. BMC Bioinformatics. 2012. PMID: 22559072 Free PMC article.
LTRsift: a graphical user interface for semi-automatic classification and postprocessing of de novo detected LTR retrotransposons.
Steinbiss S, Kastens S, Kurtz S. Steinbiss S, et al. Mob DNA. 2012 Nov 7;3(1):18. doi: 10.1186/1759-8753-3-18. Mob DNA. 2012. PMID: 23131050 Free PMC article.
Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.
Kredens KV, Martins JV, Dordal OB, Ferrandin M, Herai RH, Scalabrin EE, Ávila BC. Kredens KV, et al. PLoS One. 2020 May 26;15(5):e0232942. doi: 10.1371/journal.pone.0232942. eCollection 2020. PLoS One. 2020. PMID: 32453750 Free PMC article.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- IEEE Computer Society
- IEEE Engineering in Medicine and Biology Society
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A new efficient data structure for storage and retrieval of multiple biosequences

Affiliation

A new efficient data structure for storage and retrieval of multiple biosequences

Authors

Affiliation

Abstract

Similar articles

Cited by

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources