Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Aug 1;33(15):2251-2257.
doi: 10.1093/bioinformatics/btx145.

SeqArray-a storage-efficient high-performance data format for WGS variant calls

Affiliations

SeqArray-a storage-efficient high-performance data format for WGS variant calls

Xiuwen Zheng et al. Bioinformatics. .

Abstract

Motivation: Whole-genome sequencing (WGS) data are being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant call format (VCF) is a general text-based format developed to store variant genotypes and their annotations. However, VCF files are large and data retrieval is relatively slow. Here we introduce a new WGS variant data format implemented in the R/Bioconductor package 'SeqArray' for storing variant calls in an array-oriented manner which provides the same capabilities as VCF, but with multiple high compression options and data access using high-performance parallel computing.

Results: Benchmarks using 1000 Genomes Phase 3 data show file sizes are 14.0 Gb (VCF), 12.3 Gb (BCF, binary VCF), 3.5 Gb (BGT) and 2.6 Gb (SeqArray) respectively. Reading genotypes in the SeqArray package are two to three times faster compared with the htslib C library using BCF files. For the allele frequency calculation, the implementation in the SeqArray package is over 5 times faster than PLINK v1.9 with VCF and BCF files, and over 16 times faster than vcftools. When used in conjunction with R/Bioconductor packages, the SeqArray package provides users a flexible, feature-rich, high-performance programming environment for analysis of WGS variant data.

Availability and implementation: http://www.bioconductor.org/packages/SeqArray.

Contact: zhengx@u.washington.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
SeqArray framework and ecosystem. The SeqArray file format is built on top of the GDS format, a generic data container with hierarchical structure for storing multiple array-oriented data sets. Access to GDS is either through an efficient C ++ library or a high-level R interface. The SeqArray package creates GDS files and offers functionality specific to WGS variant data. At a minimum a SeqArray file contains sample and variant identifiers, position, chromosome and reference and alternate alleles for each variant. The functionality of the SeqArray package is extended by other R/Bioconductor packages such as SeqVarTools, SNPRelate and GENESIS that provide an ecosystem for WGS analyses on top of the SeqArray file format

References

    1. 1000 Genomes Project Consortium. (2010) A map of human genome variation from population-scale sequencing. Nature, 467, 1061–1073. - PMC - PubMed
    1. Chang C.C. et al. (2015) Second-generation plink: rising to the challenge of larger and richer datasets. GigaScience, 4, 7.. - PMC - PubMed
    1. Collins F.S., Varmus H. (2015) A new initiative on precision medicine. N. Engl. J. Med., 372, 793–795. - PMC - PubMed
    1. Conomos M.P. et al. (2015) Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genet. Epidemiol., 39, 276–293. - PMC - PubMed
    1. Conomos M.P. et al. (2016) Model-free estimation of recent genetic relatedness. Am. J. Hum. Genet., 98, 127–148. - PMC - PubMed