Challenges of Big Data Analysis

Jianqing Fan¹, Fang Han², Han Liu³

Affiliations

¹ Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; jqfan@princeton.edu .
² Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA; fhan@jhsph.edu .
³ Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; hanliu@princeton.edu .

PMID: 25419469
PMCID: PMC4236847
DOI: 10.1093/nsr/nwt032

Challenges of Big Data Analysis

Jianqing Fan et al. Natl Sci Rev. 2014 Jun.

. 2014 Jun;1(2):293-314.

doi: 10.1093/nsr/nwt032.

Authors

Jianqing Fan¹, Fang Han², Han Liu³

Affiliations

¹ Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; jqfan@princeton.edu .
² Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA; fhan@jhsph.edu .
³ Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; hanliu@princeton.edu .

PMID: 25419469
PMCID: PMC4236847
DOI: 10.1093/nsr/nwt032

Abstract

Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

Keywords: Big Data; data storage; high dimensional data; incidental endogeneity; large-scale optimization; massive data; massively parallel data processing; noise accumulation; random projection; scalability; spurious correlation.

PubMed Disclaimer

Figures

**Figure 1**
Scatter plots of projections of the observed data (n = 100 from each class) onto the first two principal components of the best m-dimensional selected feature space. A projected data with “•” indicates the first class and “▲” indicates the second class.

**Figure 2**
Illustration of spurious correlation. (a): Distribution of the maximum absolute sample correlation coefficients between X₁ and {*X_j*}_j≠1. (b): Distribution of the maximum absolute sample correlation coefficients between X₁ and the closest linear projections of any 4 members of {*X_j*}_j≠1 to X₁. Here the dimension d is 800 and 6,400, the sample size n is 60. The result is based on 1,000 simulations.

**Figure 3**
Illustration of incidental endogeneity on a microarry gene expression data. Left panel: Distribution of the sample correlation $\hat{Corr} (X_{j}, Y) (j = 1, \dots, 12, 718)$ . Right panel: Distribution of the sample correlation $\hat{Corr} (X_{j}, \hat{ε})$ . Here $\hat{ε}$ represents the residual noise after the Lasso fit. We provide the distributions of the sample correlations using both the raw data and permuted data.

**Figure 4**
Visualization of the penalty functions. In all cases, λ = 1. For SCAD and MCP, different values of γ are chosen as shown in graphs.

**Figure 5**
Diagnostics of the modeling assumptions of the FGMM on a microarry gene expression data. Left panel: Distribution of the sample correlations $\hat{Corr} (X_{j}, \hat{ε}) (j = 1, \dots, 12, 718)$ . Right panel: Distribution of the sample correlations $\hat{Corr} (X_{j}, \hat{ε})$ and $\hat{Corr} (X_{j}^{2}, \hat{ε})$ for only 18 selected genes. Here $\hat{ε}$ the residual noise after the FGMM fit.

**Figure 6**
An illustration of Cloudera's open-source Hadoop distribution (source: cloudera website).

**Figure 7**
An illustration of the HDFS architecture.

**Figure 8**
An illustration of the MapReduce paradigm for the symbol counting task. Mappers are applied to every element of the input sequences and emit intermediate (key, value)-pairs. Reducers are applied to all values associated with the same key. Between the map and reduce stages are some intermediate steps involving distributed sorting and grouping.

**Figure 9**
A typical Hadoop cluster (source: wikipedia).

**Figure 10**
An illustration of the cloud computing paradigm.

**Figure 11**
Plots of the median errors in preserving the distances between pairs of data points versus the reduced dimension k in large scale microarray data. Here “RP” stands for the random projection and “PCA” stands for the principal component analysis.

See this image and copyright information in PMC

References

1. Stein Lincoln. The case for cloud computing in genome informatics. Genome Biol. 2010;11(5):207. - PMC - PubMed
1. Donoho David. High-dimensional data analysis: The curses and blessings of dimensionality.. the American Mathematical Society Conference; Los Angeles, CA, United States. August 7–12, 2000.
1. Bickel Peter. Fan, Roy Lv. J., editors. Discussion on the paper “Sure independence screening for ultrahigh dimensional feature space”. Statist. Soc. Ser. B. 2008;70(5):883–884. - PMC - PubMed
1. Fan Jianqing, Fan Yingying. High dimensional classification using features annealed independence rules. Ann. Stat. 2008;36(6):2605–2637. - PMC - PubMed
1. Hall Peter, Pittelkow Yvonne, Ghosh Malay. Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes. J. Roy. Statist. Soc. Ser. B. 2008;70(1):159–173.

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Challenges of Big Data Analysis

Affiliations

Challenges of Big Data Analysis

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources