Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2019 May 29;20(1):109.
doi: 10.1186/s13059-019-1724-1.

Genomics and data science: an application within an umbrella

Affiliations
Review

Genomics and data science: an application within an umbrella

Fábio C P Navarro et al. Genome Biol. .

Abstract

Data science allows the extraction of practical insights from large-scale data. Here, we contextualize it as an umbrella term, encompassing several disparate subdomains. We focus on how genomics fits as a specific application subdomain, in terms of well-known 3 V data and 4 M process frameworks (volume-velocity-variety and measurement-mining-modeling-manipulation, respectively). We further analyze the technical and cultural "exports" and "imports" between genomics and other data-science subdomains (e.g., astronomy). Finally, we discuss how data value, privacy, and ownership are pressing issues for data science applications, in general, and are especially relevant to genomics, due to the persistent nature of DNA.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
A holistic view of biomedical data science. a Biomedical data science emerged at the confluence of large-scale datasets connecting genomics, metabolomics, wearable devices, proteomics, health records, and imaging to statistics and computer science. b The 4 M processes framework. c The 5 V data framework
Fig. 2
Fig. 2
Data volume growth in genomics versus other disciplines. a Data volume growth in genomics in the context of other domains and data infrastructure (computing power and network throughput). Continuous lines indicate the amount of data archived in public repositories in genomics (SRA), astronomy (Earth Data, NASA), and sociology (Harvard dataverse). Data infrastructure such as computing power (TOP500 SuperComputers) and network throughput (IPTraffic) are also included. Dashed lines indicate projections of future growth in data volume and infrastructure capacity for the next decade. b Cumulative number of datasets being generated for whole genome sequencing (WGS) and whole exome sequencing (WES) in comparison with molecular structure datasets such as X-ray and electron microscopy (EM). PDB Protein Data Base, SRA Sequence Read Archive
Fig. 3
Fig. 3
Variety of sequencing assays. Number of new sequencing protocols published per year. Popular protocols are highlighted in their year of publication and their connection to omes
Fig. 4
Fig. 4
Technical exchanges between genomics and other data science subdisciplines. The background area displays the total number of publications per year for the terms. a Hidden Markov model, b Scale-free network, c latent Dirichlet allocation. Continuous lines indicate the fraction of papers related to topics in genomics and in other disciplines
Fig. 5
Fig. 5
Open source adoption in genomics and other data science subdisciplines. The number of GitHub commits (upper panel) and new GitHub repositories (lower panel) per year for a variety of subfields. Subfield repositories were selected by GitHub topics such as genomics, astronomy, geography, molecular dynamics (Mol. Dynamics), quantum chemistry (Quantum Chem.), and ecology

References

    1. Davenport TH, Patil DJ. Data scientist: the sexiest job of the 21st century. Harv Bus Rev. 2012;90:70–76. - PubMed
    1. Provost F, Fawcett T. Data science and its relationship to big data and data-driven decision making. Big Data. 2013;1:51–59. doi: 10.1089/big.2013.1508. - DOI - PubMed
    1. Tukey JW. The future of data analysis. Ann Math Stat. 1962;33:1–67. doi: 10.1214/aoms/1177704711. - DOI
    1. Tansley S, Tolle KM. The fourth paradigm: Microsoft Press; 2009.
    1. Jordan MI, Mitchell TM. Machine learning: trends, perspectives, and prospects. Science. 2015;349:255–260. doi: 10.1126/science.aaa8415. - DOI - PubMed

Publication types

LinkOut - more resources