A new efficient data structure for storage and retrieval of multiple biosequences
- PMID: 22084150
- DOI: 10.1109/TCBB.2011.146
A new efficient data structure for storage and retrieval of multiple biosequences
Abstract
Today's genome analysis applications require sequence representations allowing for fast access to their contents while also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly reusable or programming language-specific implementations. We present a novel, space-efficient data structure (GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character transformations, wildcard support and an assortment of internal representations optimized for different distributions of wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our representation requires only 2 + 8 × 10^-6bits per character. Implemented in C, our portable software implementation provides a variety of methods for random and sequential access to characters and substrings (including different reading directions) using an object-oriented interface. In addition, it includes access to metadata like sequence descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show that it is competitive with respect to space and time requirements.
Similar articles
-
GLAD: a system for developing and deploying large-scale bioinformatics grid.Bioinformatics. 2005 Mar;21(6):794-802. doi: 10.1093/bioinformatics/bti034. Epub 2004 Sep 23. Bioinformatics. 2005. PMID: 15388517
-
Bio::NEXUS: a Perl API for the NEXUS format for comparative biological data.BMC Bioinformatics. 2007 Jun 8;8:191. doi: 10.1186/1471-2105-8-191. BMC Bioinformatics. 2007. PMID: 17559666 Free PMC article.
-
HotSwap for bioinformatics: a STRAP tutorial.BMC Bioinformatics. 2006 Feb 9;7:64. doi: 10.1186/1471-2105-7-64. BMC Bioinformatics. 2006. PMID: 16469097 Free PMC article.
-
A library of efficient bioinformatics algorithms.Appl Bioinformatics. 2003;2(2):117-21. Appl Bioinformatics. 2003. PMID: 15130828 Review.
-
Automation of in-silico data analysis processes through workflow management systems.Brief Bioinform. 2008 Jan;9(1):57-68. doi: 10.1093/bib/bbm056. Epub 2007 Dec 2. Brief Bioinform. 2008. PMID: 18056132 Review.
Cited by
-
Data compression for sequencing data.Algorithms Mol Biol. 2013 Nov 18;8(1):25. doi: 10.1186/1748-7188-8-25. Algorithms Mol Biol. 2013. PMID: 24252160 Free PMC article.
-
An automated real-time integration and interoperability framework for bioinformatics.BMC Bioinformatics. 2015 Oct 13;16:328. doi: 10.1186/s12859-015-0761-3. BMC Bioinformatics. 2015. PMID: 26464306 Free PMC article.
-
Readjoiner: a fast and memory efficient string graph-based sequence assembler.BMC Bioinformatics. 2012 May 6;13:82. doi: 10.1186/1471-2105-13-82. BMC Bioinformatics. 2012. PMID: 22559072 Free PMC article.
-
LTRsift: a graphical user interface for semi-automatic classification and postprocessing of de novo detected LTR retrotransposons.Mob DNA. 2012 Nov 7;3(1):18. doi: 10.1186/1759-8753-3-18. Mob DNA. 2012. PMID: 23131050 Free PMC article.
-
Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.PLoS One. 2020 May 26;15(5):e0232942. doi: 10.1371/journal.pone.0232942. eCollection 2020. PLoS One. 2020. PMID: 32453750 Free PMC article.
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources