Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2019 Mar;35(3):223-234.
doi: 10.1016/j.tig.2018.12.006. Epub 2019 Jan 25.

Data Lakes, Clouds, and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data

Affiliations
Review

Data Lakes, Clouds, and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data

Robert L Grossman. Trends Genet. 2019 Mar.

Abstract

Data commons collate data with cloud computing infrastructure and commonly used software services, tools, and applications to create biomedical resources for the large-scale management, analysis, harmonization, and sharing of biomedical data. Over the past few years, data commons have been used to analyze, harmonize, and share large-scale genomics datasets. Data ecosystems can be built by interoperating multiple data commons. It can be quite labor intensive to curate, import, and analyze the data in a data commons. Data lakes provide an alternative to data commons and simply provide access to data, with the data curation and analysis deferred until later and delegated to those that access the data. We review software platforms for managing, analyzing, and sharing genomic data, with an emphasis on data commons, but also cover data ecosystems and data lakes.

Keywords: cancer genomics clouds; data clouds; data commons; data sharing.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Some of the Important Differences between Data Clouds and Data Commons.
Figure 2.
Figure 2.. Building a Data Commons.
Data commons support the entire life cycle of data, including defining the data model, importing data, cleaning data, exploring data, analyzing data, and then sharing new research discoveries.
Figure 3.
Figure 3.. Key Figure Data Commons Framework Services
This diagram shows how data commons framework services can support multiple data commons and an ecosystem of workspaces, notebooks, and applications.
Figure 4.
Figure 4.. Data Platforms.
Data platforms can be categorized along four axes: the data architecture, the extent of the data curation and harmonization, the analysis architecture of a resource, and the analysis architecture of the ecosystem. The red lines can be viewed as classifying platforms using parallel coordinates and these four dimensions. The top line is the parallel coordinates associated with the National Cancer Institute (NCI) Cancer Research Data Commons, the line below is the parallel coordinates for the NCI Genomic Data Commons, the two lines below are two possible architectures for data lakes, while the bottom line is an architecture for a repository of files. Abbreviations: API, application programming interface; DCF, data commons framework; NA, not applicable; SaaS, software as a service.

References

    1. Tomczak K et al. (2015) The Cancer Genome Atlas (TCGA): an immeasurable source ofknowledge. Contemp. Oncol. (Pozn.) 19, A68–A77 - PMC - PubMed
    1. Rozenblatt-Rosen O et al. (2017) The Human Cell Atlas: from vision to reality. Nature 550, 451–453 - PubMed
    1. Leek JT et al. (2010)Tackling thewidespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet 11, 733–739 - PMC - PubMed
    1. Council NR (2011) Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease, The National Academies Press - PubMed
    1. Panel BR (2016) Cancer Moonshot Blue Ribbon Panel Report. https://www.cancer.gov/research/key-initiatives/moonshot-cancer-initiative (accessed 2018)

Publication types