Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 12;2(1):100085.
doi: 10.1016/j.xgen.2021.100085. Epub 2022 Jan 13.

Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space

Affiliations

Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space

Michael C Schatz et al. Cell Genom. .

Abstract

The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts. The AnVIL is a federated cloud platform designed to manage and store genomics and related data, enable population-scale analysis, and facilitate collaboration through the sharing of data, code, and analysis results. By inverting the traditional model of data sharing, the AnVIL eliminates the need for data movement while also adding security measures for active threat detection and monitoring and provides scalable, shared computing resources for any researcher. We describe the core data management and analysis components of the AnVIL, which currently consists of Terra, Gen3, Galaxy, RStudio/Bioconductor, Dockstore, and Jupyter, and describe several flagship genomics datasets available within the AnVIL. We continue to extend and innovate the AnVIL ecosystem by implementing new capabilities, including mechanisms for interoperability and responsible data sharing, while streamlining access management. The AnVIL opens many new opportunities for analysis, collaboration, and data sharing that are needed to drive research and to make discoveries through the joint analysis of hundreds of thousands to millions of genomes along with associated clinical and molecular data types.

PubMed Disclaimer

Figures

None
Graphical abstract
Figure 1
Figure 1
Inverting the model for data sharing (Left) In the traditional model, project data (shown in purple, orange, and green) are copied to multiple sites where they are accessed by users on institutional computing clusters. Under this model, each institution must establish its own data center, and collaboration is achieved primarily through copying files between data centers. (Right) In the inverted model, users connect to a cloud-enabled resource such as the AnVIL to remotely access and analyze the data without copying. In this model, users virtually access a unified data center, allowing for deeper collaboration and sharing of the results.
Figure 2
Figure 2
Overview of the AnVIL ecosystem (Top) The AnVIL is a federated cloud environment for the analysis of large genomic and related datasets. The AnVIL is built on a set of established components that bring together widely used platforms. The Terra platform provides a compute environment with secure data and analysis sharing capabilities. Dockstore provides standards-based sharing of containerized tools and workflows. R/Bioconductor, Jupyter, and Galaxy provide environments for users at different skill levels to construct and execute analyses. The Gen3 data commons framework provides data and metadata ingest, querying, and organization. (Bottom) The AnVIL has been used in a number of flagship NHGRI and other genomics projects. Summary of the genomics datasets available within the AnVIL as of December 2021, as shown at https://anvilproject.org/data. WGS, whole-genome sequencing; WXS, whole-exome sequencing.

References

    1. Stephens Z.D., Lee S.Y., Faghri F., Campbell R.H., Zhai C., Efron M.J., Iyer R., Schatz M.C., Sinha S., Robinson G.E. Big Data: Astronomical or Genomical? PLoS Biol. 2015;13:e1002195. - PMC - PubMed
    1. Rehm H.L., Page A.J.H., Smith L., Adams J.B., Alterovitz G., Babb L.J., Barkley M.P., Baudis M., Beauvais M.J.S., Beck T., et al. GA4GH: International policies and standards for data sharing across genomic research and healthcare. Cell Genom. 2021;1:100029. - PMC - PubMed
    1. Koboldt D.C., Steinberg K.M., Larson D.E., Wilson R.K., Mardis E.R. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155:27–38. - PMC - PubMed
    1. Green E.D., Gunter C., Biesecker L.G., Di Francesco V., Easter C.L., Feingold E.A., Felsenfeld A.L., Kaufman D.J., Ostrander E.A., Pavan W.J., et al. Strategic vision for improving human health at The Forefront of Genomics. Nature. 2020;586:683–692. - PMC - PubMed
    1. Byrska-Bishop M., Evani U.S., Zhao X., Basile A.O., Abel H.J., Regier A.A., Corvelo A., Clarke W.E., Musunuri R., Nagulapalli K., et al. High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. bioRxiv. 2021 doi: 10.1101/2021.02.06/430068. - DOI - PMC - PubMed