Compression for population genetic data through finite-state entropy
- PMID: 34590992
- DOI: 10.1142/S0219720021500268
Compression for population genetic data through finite-state entropy
Abstract
We improve the efficiency of population genetic file formats and GWAS computation by leveraging the distribution of samples in population-level genetic data. We identify conditional exchangeability of these data, recommending finite state entropy algorithms as an arithmetic code naturally suited for compression of population genetic data. We show between [Formula: see text] and [Formula: see text] speed and size improvements over modern dictionary compression methods that are often used for population genetic data such as Zstd and Zlib in computation and decompression tasks. We provide open source prototype software for multi-phenotype GWAS with finite state entropy compression demonstrating significant space saving and speed comparable to the state-of-the-art.
Keywords: Statistical genetics; big data; genome-wide association study; genotype compression; multi-phenotype analysis.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources