Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2014 Dec;229(12):1896-900.
doi: 10.1002/jcp.24662.

Big data bioinformatics

Affiliations
Review

Big data bioinformatics

Casey S Greene et al. J Cell Physiol. 2014 Dec.

Erratum in

Abstract

Recent technological advances allow for high throughput profiling of biological systems in a cost-efficient manner. The low cost of data generation is leading us to the "big data" era. The availability of big data provides unprecedented opportunities but also raises new challenges for data mining and analysis. In this review, we introduce key concepts in the analysis of big data, including both "machine learning" algorithms as well as "unsupervised" and "supervised" examples of each. We note packages for the R programming language that are available to perform machine learning analyses. In addition to programming based solutions, we review webservers that allow users with limited or no programming background to perform these analyses on large data compendia.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The difference between supervised and unsupervised machine learning. A: In supervised machine learning, a training dataset with labeled classes, for example, case or control, is provided. A model is trained to maximally differentiate between cases and controls, and then the classes of new samples are determined. B: In an unsupervised machine learning model, all samples are unlabeled. Clustering algorithms, an example of unsupervised methods, discover groups of samples that are highly similar to each other and distinct from other samples.
Fig. 2
Fig. 2
Unsupervised analyses discover the predominant signals in the data. For example, principle component analysis applied to a dataset combined from two large studies of breast cancer identifies the study (METABRIC or TCGA) as the most important principle component (PC1). Such confounding factors have thus far made applying unsupervised analysis methods to broad compendia challenging, so this class of methods is most frequently used within large homogenous datasets.

References

    1. The Cancer Genome Atlas. http://cancergenome.nih.gov/
    1. The R package “Cluster”. http://cran.r-project.org/web/packages/cluster/citation.html.
    1. Cancer Genome Atlas N. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70. - PMC - PubMed
    1. Cancer Genome Atlas Research N. Integrated genomic analyses of ovarian carcinoma. Nature. 2011;474:609–615. - PMC - PubMed
    1. Cheng C, Alexander R, Min R, Leng J, Yip KY, Rozowsky J, Yan KK, Dong X, Djebali S, Ruan Y, Davis CA, Carninci P, Lassman T, Gingeras TR, Guigo R, Birney E, Weng Z, Snyder M, Gerstein M. Understanding transcriptional regulation by integrative analysis of transcription factor binding data. Genome Res. 2012;22:1658–1667. - PMC - PubMed

Publication types