Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 13:2023:baad043.
doi: 10.1093/database/baad043.

GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture

Affiliations

GeniePool: genomic database with corresponding annotated samples based on a cloud data lake architecture

Noam Hadar et al. Database (Oxford). .

Abstract

In recent years, there are a huge influx of genomic data and a growing need for its phenotypic correlations, yet existing genomic databases do not allow easy storage and accessibility to the combined phenotypic-genotypic information. Freely accessible allele frequency (AF) databases, such as gnomAD, are crucial for evaluating variants but lack correlated phenotype data. The Sequence Read Archive (SRA) accumulates hundreds of thousands of next-generation sequencing (NGS) samples tagged by their submitters and various attributes. However, samples are stored in large raw format files, inaccessible for a common user. To make thousands of NGS samples and their corresponding additional attributes easily available to clinicians and researchers, we generated a pipeline that continuously downloads raw human NGS data uploaded to SRA using SRAtoolkit and preprocesses them using GATK pipeline. Data are then stored efficiently in a cloud data lake and can be accessed via a representational state transfer application programming interface (REST API) and a user-friendly website. We thus generated GeniePool, a simple and intuitive web service and API for querying NGS data from SRA with direct access to information related to each sample and related studies, providing significant advantages over existing databases for both clinical and research usages. Utilizing data lake infrastructure, we were able to generate a multi-purpose tool that can serve many clinical and research use cases. We expect users to explore the meta-data served via GeniePool both in daily clinical practice and in versatile research endeavours. Database URL https://geniepool.link.

PubMed Disclaimer

Conflict of interest statement

The authors have no conflicts of interest to declare.

Figures

Figure 1.
Figure 1.
GeniePool workflow. Raw genomic NGS data from SRA are preprocessed according to GATK’s best practices and stored efficiently using the Parquet format in a cloud data lake architecture. Preprocessed data are available via either a REST API or a designated web UI accompanied by BioSample data that provide information regarding specific samples.
Figure 2.
Figure 2.
GeniePool’s UI. Genomic coordinates can be searched for variants within NGS samples from SRA. Results are displayed in a table with selectable rows. Variants can be filtered by sample attributes. Selecting a variant generates an interactive graph displaying relevant samples per study. Clicking a bar provides direct links for additional information regarding the study and each of the samples harbouring the variant.

Similar articles

Cited by

References

    1. Richards S., Aziz N., Bale S.. et al. (2015) Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med., 17, 405–424. - PMC - PubMed
    1. Karczewski K.J., Francioli L.C., Tiao G.. et al. (2020) The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581, 434–443. - PMC - PubMed
    1. Auton A., Brooks L.D., Durbin R.M.. et al. (2015) A global reference for human genetic variation. Nature, 526, 68–74. - PMC - PubMed
    1. Ferreira C.R. (2019) The burden of rare diseases. Am. J. Med. Genet. A, 179, 885–892. - PubMed
    1. Gudmundsson S., Singer‐Berk M., Watts N.A.. et al. (2021) Variant interpretation using population databases: lessons from gnomAD. Hum. Mutat., 43, 1012–1030. - PMC - PubMed

Publication types