Scalable probabilistic PCA for large-scale genetic variation data

Aman Agrawal¹, Alec M Chiu², Minh Le³, Eran Halperin^{3

4

5

6

7}, Sriram Sankararaman^{3

4

6}

Affiliations

¹ Department of Computer Science, Indian Institute of Technology, Delhi, India.
² Bioinformatics Interdepartmental Program, University of California, Los Angeles, California, United States of America.
³ Department of Computer Science, University of California, Los Angeles, California, United States of America.
⁴ Department of Human Genetics, University of California, Los Angeles, California, United States of America.
⁵ Department of Anesthesiology and Perioperative Medicine, University of California, Los Angeles, California, United States of America.
⁶ Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, California, United States of America.
⁷ Institute of Precision Health, University of California, Los Angeles, California, United States of America.

PMID: 32469896
PMCID: PMC7286535
DOI: 10.1371/journal.pgen.1008773

Scalable probabilistic PCA for large-scale genetic variation data

Aman Agrawal et al. PLoS Genet. 2020.

. 2020 May 29;16(5):e1008773.

doi: 10.1371/journal.pgen.1008773. eCollection 2020 May.

Authors

Aman Agrawal¹, Alec M Chiu², Minh Le³, Eran Halperin^{3

4

5

6

7}, Sriram Sankararaman^{3

4

6}

Affiliations

¹ Department of Computer Science, Indian Institute of Technology, Delhi, India.
² Bioinformatics Interdepartmental Program, University of California, Los Angeles, California, United States of America.
³ Department of Computer Science, University of California, Los Angeles, California, United States of America.
⁴ Department of Human Genetics, University of California, Los Angeles, California, United States of America.
⁵ Department of Anesthesiology and Perioperative Medicine, University of California, Los Angeles, California, United States of America.
⁶ Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, California, United States of America.
⁷ Institute of Precision Health, University of California, Los Angeles, California, United States of America.

PMID: 32469896
PMCID: PMC7286535
DOI: 10.1371/journal.pgen.1008773

Abstract

Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. ProPCA is computationally efficient.**
Comparison of runtimes over simulated genotype data varied over individuals and SNPs. Figures 1a and 1b display the total runtime containing 100, 000 SNPs, six subpopulations, F_st = 0.01 and individuals varying from 10, 000 to 1, 000, 000. We report the mean and standard deviation over ten trials. Figure 1b compares the runtimes of all algorithms excluding PLINK_SVD which could only run successfully up to a sample size of 70, 000. Figure 1c displays the total runtime containing 100, 000 individuals, six subpopulations, F_st = 0.01, and SNPs varying from 10, 000 to 1, 000, 000. All methods were capped to a maximum of 100 hours and a maximum memory of 64 GB and run using default settings. We were unable to include bigstatsr in the SNP benchmark as it does not allow for monomorphic SNPs.

**Fig 2. Principal components uncover population and geographic structure in the UK Biobank.**
We used ProPCA to compute PCs on the UK Biobank data. Figure 2a shows the first two principal components to reveal population structure. Figure 2b shows geographic structure by plotting the score of 276, 736 unrelated White British individuals on the first principal component on their birth location coordinates.

**Fig 3. Selection scan for the first five principal components in the white British individuals in the UK Biobank.**
A Manhattan plot with the −log₁₀ p values associated with the test of selection displayed for the first five principal components for the unrelated White British subset of the UK Biobank. The red line represents the Bonferroni adjusted significance level (α = 0.05). Significant loci are labeled. Signals above −log₁₀(p) = 18 were capped at this value for better visualization.

See this image and copyright information in PMC

References

1. Novembre J, Ramachandran S. Perspectives on human population structure at the cusp of the sequencing era. Annual review of genomics and human genetics. 2011;12:245–274. - PubMed
1. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko A, Auton A, et al. Genes mirror geography within Europe. Nature. 2008;456(7219):274. - PMC - PubMed
1. Yang WY, Novembre J, Eskin E, Halperin E. A model-based approach for analysis of spatial structure in genetic data. Nature genetics. 2012;44(6):725–731. 10.1038/ng.2285 - DOI - PMC - PubMed
1. Baran Y, Quintela I,Carracedo Á, Pasaniuc B, Halperin E. Enhanced localization of genetic samples through linkage-disequilibrium correction. The American Journal of Human Genetics. 2013;92(6):882–894. 10.1016/j.ajhg.2013.04.023 - DOI - PMC - PubMed
1. Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nature reviews Genetics. 2010;11(7):459 10.1038/nrg2813 - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Scalable probabilistic PCA for large-scale genetic variation data

Affiliations

Scalable probabilistic PCA for large-scale genetic variation data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources