Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun;1(2):293-314.
doi: 10.1093/nsr/nwt032.

Challenges of Big Data Analysis

Affiliations

Challenges of Big Data Analysis

Jianqing Fan et al. Natl Sci Rev. 2014 Jun.

Abstract

Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

Keywords: Big Data; data storage; high dimensional data; incidental endogeneity; large-scale optimization; massive data; massively parallel data processing; noise accumulation; random projection; scalability; spurious correlation.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Scatter plots of projections of the observed data (n = 100 from each class) onto the first two principal components of the best m-dimensional selected feature space. A projected data with “•” indicates the first class and “▲” indicates the second class.
Figure 2
Figure 2
Illustration of spurious correlation. (a): Distribution of the maximum absolute sample correlation coefficients between X1 and {Xj}j≠1. (b): Distribution of the maximum absolute sample correlation coefficients between X1 and the closest linear projections of any 4 members of {Xj}j≠1 to X1. Here the dimension d is 800 and 6,400, the sample size n is 60. The result is based on 1,000 simulations.
Figure 3
Figure 3
Illustration of incidental endogeneity on a microarry gene expression data. Left panel: Distribution of the sample correlation Corr^(Xj,Y)(j=1,,12,718). Right panel: Distribution of the sample correlation Corr^(Xj,ε^). Here ε^ represents the residual noise after the Lasso fit. We provide the distributions of the sample correlations using both the raw data and permuted data.
Figure 4
Figure 4
Visualization of the penalty functions. In all cases, λ = 1. For SCAD and MCP, different values of γ are chosen as shown in graphs.
Figure 5
Figure 5
Diagnostics of the modeling assumptions of the FGMM on a microarry gene expression data. Left panel: Distribution of the sample correlations Corr^(Xj,ε^)(j=1,,12,718). Right panel: Distribution of the sample correlations Corr^(Xj,ε^) and Corr^(Xj2,ε^) for only 18 selected genes. Here ε^ the residual noise after the FGMM fit.
Figure 6
Figure 6
An illustration of Cloudera's open-source Hadoop distribution (source: cloudera website).
Figure 7
Figure 7
An illustration of the HDFS architecture.
Figure 8
Figure 8
An illustration of the MapReduce paradigm for the symbol counting task. Mappers are applied to every element of the input sequences and emit intermediate (key, value)-pairs. Reducers are applied to all values associated with the same key. Between the map and reduce stages are some intermediate steps involving distributed sorting and grouping.
Figure 9
Figure 9
A typical Hadoop cluster (source: wikipedia).
Figure 10
Figure 10
An illustration of the cloud computing paradigm.
Figure 11
Figure 11
Plots of the median errors in preserving the distances between pairs of data points versus the reduced dimension k in large scale microarray data. Here “RP” stands for the random projection and “PCA” stands for the principal component analysis.

References

    1. Stein Lincoln. The case for cloud computing in genome informatics. Genome Biol. 2010;11(5):207. - PMC - PubMed
    1. Donoho David. High-dimensional data analysis: The curses and blessings of dimensionality.. the American Mathematical Society Conference; Los Angeles, CA, United States. August 7–12, 2000.
    1. Bickel Peter. Fan, Roy Lv. J., editors. Discussion on the paper “Sure independence screening for ultrahigh dimensional feature space”. Statist. Soc. Ser. B. 2008;70(5):883–884. - PMC - PubMed
    1. Fan Jianqing, Fan Yingying. High dimensional classification using features annealed independence rules. Ann. Stat. 2008;36(6):2605–2637. - PMC - PubMed
    1. Hall Peter, Pittelkow Yvonne, Ghosh Malay. Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes. J. Roy. Statist. Soc. Ser. B. 2008;70(1):159–173.

LinkOut - more resources