Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Dec 1;33(23):3709-3715.
doi: 10.1093/bioinformatics/btx468.

Cloud-based interactive analytics for terabytes of genomic variants data

Affiliations

Cloud-based interactive analytics for terabytes of genomic variants data

Cuiping Pan et al. Bioinformatics. .

Abstract

Motivation: Large scale genomic sequencing is now widely used to decipher questions in diverse realms such as biological function, human diseases, evolution, ecosystems, and agriculture. With the quantity and diversity these data harbor, a robust and scalable data handling and analysis solution is desired.

Results: We present interactive analytics using a cloud-based columnar database built on Dremel to perform information compression, comprehensive quality controls, and biological information retrieval in large volumes of genomic data. We demonstrate such Big Data computing paradigms can provide orders of magnitude faster turnaround for common genomic analyses, transforming long-running batch jobs submitted via a Linux shell into questions that can be asked from a web browser in seconds. Using this method, we assessed a study population of 475 deeply sequenced human genomes for genomic call rate, genotype and allele frequency distribution, variant density across the genome, and pharmacogenomic information.

Availability and implementation: Our analysis framework is implemented in Google Cloud Platform and BigQuery. Codes are available at https://github.com/StanfordBioinformatics/mvp_aaa_codelabs.

Contact: cuiping@stanford.edu or ptsao@stanford.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The computational paradigm of cloud platform-based data processing and Dremel-based interactive analytics for large-scale genomic data. Shown here is the Google Cloud Platform-enabled solution. Variant calling from raw reads to genotypes is performed by GATK via Google Genomics API in Compute Engine Virtual Machines (VMs). Genomic data is represented in a Dremel database to enable interactive data QC and analytics
Fig. 2.
Fig. 2.
Callability assessment. (A) Diagram shows the overall categories of genomic positions by callability and quality. (B) Percentage of detected genomic positions in each chromosome for each genome in this WGS study, based on all calls reported by GATK. (C) Uncalled regions (URs) and the length distribution, (left) number of URs per chromosome in each genome, categorized by different length groups; (middle) number of URs per chromosome in each genome, normalized by chromosome lengths; (right): commonality of URs across all genomes. (D) Numbers of SNVs and INDELs passing different QC levels. No_QC: all calls by GATK without any filtering. Seq_QC: calls passing the VQSR filtering in GATK. Full_QC: calls passing all levels of QC. (E) Percentage of genomic positions called as reference bases, SNVs and INDELs
Fig. 3.
Fig. 3.
Saturation call rate for SNVs and INDELs. (A) Number of unique SNVs and INDELs by increasing number of genomes. (B) Distribution of allele frequencies for SNVs. (C) Distribution of allele frequencies for INDELs
Fig. 4.
Fig. 4.
Dremel query performance assessment on real and simulated genomic data. Shown here are the query run times and cost for five representative queries on tables containing real genomic data, ranging from 5 to 461 genomes (A and C), and simulated genomic data, ranging from 1000 to 5000 genomes (B and D)

References

    1. 1000 Genomes Project Consortium. et al. (2015) A global reference for human genetic variation. Nature, 526, 68–74. - PMC - PubMed
    1. Abul-Husn N.S. et al. (2016) Genetic identification of familial hypercholesterolemia within a single U.S. health care system. Science (New York, N.Y.), 354. - PubMed
    1. Afgan E. et al. (2015) Genomics virtual laboratory: a practical bioinformatics workbench for the cloud. PloS One, 10, e0140829.. - PMC - PubMed
    1. Akbani R. et al. (2014) A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nat. Commun., 5, 3887.. - PMC - PubMed
    1. Athanasiu L. et al. (2017) A genetic association study of CSMD1 and CSMD2 with cognitive function. Brain Behav. Immun., 61, 209–216. - PubMed