Fast principal component analysis of large-scale genome-wide data
- PMID: 24718290
- PMCID: PMC3981753
- DOI: 10.1371/journal.pone.0093766
Fast principal component analysis of large-scale genome-wide data
Abstract
Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.
Conflict of interest statement
Figures
50,000 subsets, and smartpca was stopped after 100,000 sec.The results shown are averages over 3 runs. Results for
15,000 are based on subsamples of the original dataset
= 16,002 (light blue background), whereas results for
50,000 are based on duplicating the original samples (light yellow background).References
-
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–909. - PubMed
-
- Halko N, Martinsson PG, Shkolnisky Y, Tygert M (2011) An Algorithm for the Principal Component Analysis of Large Data Sets. SIAM Journal on Scientific Computing 33: 2580–2594.
-
- Halko N, Martinsson PG, Tropp JA (2011) Finding Structure with Randomness: Probabilistic Algorithms for Matrix Decompositions. SIAM Review 53: 217–288.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
