Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 29;16(5):e1008773.
doi: 10.1371/journal.pgen.1008773. eCollection 2020 May.

Scalable probabilistic PCA for large-scale genetic variation data

Affiliations

Scalable probabilistic PCA for large-scale genetic variation data

Aman Agrawal et al. PLoS Genet. .

Abstract

Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. ProPCA is computationally efficient.
Comparison of runtimes over simulated genotype data varied over individuals and SNPs. Figures 1a and 1b display the total runtime containing 100, 000 SNPs, six subpopulations, Fst = 0.01 and individuals varying from 10, 000 to 1, 000, 000. We report the mean and standard deviation over ten trials. Figure 1b compares the runtimes of all algorithms excluding PLINK_SVD which could only run successfully up to a sample size of 70, 000. Figure 1c displays the total runtime containing 100, 000 individuals, six subpopulations, Fst = 0.01, and SNPs varying from 10, 000 to 1, 000, 000. All methods were capped to a maximum of 100 hours and a maximum memory of 64 GB and run using default settings. We were unable to include bigstatsr in the SNP benchmark as it does not allow for monomorphic SNPs.
Fig 2
Fig 2. Principal components uncover population and geographic structure in the UK Biobank.
We used ProPCA to compute PCs on the UK Biobank data. Figure 2a shows the first two principal components to reveal population structure. Figure 2b shows geographic structure by plotting the score of 276, 736 unrelated White British individuals on the first principal component on their birth location coordinates.
Fig 3
Fig 3. Selection scan for the first five principal components in the white British individuals in the UK Biobank.
A Manhattan plot with the −log10 p values associated with the test of selection displayed for the first five principal components for the unrelated White British subset of the UK Biobank. The red line represents the Bonferroni adjusted significance level (α = 0.05). Significant loci are labeled. Signals above −log10(p) = 18 were capped at this value for better visualization.

References

    1. Novembre J, Ramachandran S. Perspectives on human population structure at the cusp of the sequencing era. Annual review of genomics and human genetics. 2011;12:245–274. - PubMed
    1. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko A, Auton A, et al. Genes mirror geography within Europe. Nature. 2008;456(7219):274. - PMC - PubMed
    1. Yang WY, Novembre J, Eskin E, Halperin E. A model-based approach for analysis of spatial structure in genetic data. Nature genetics. 2012;44(6):725–731. 10.1038/ng.2285 - DOI - PMC - PubMed
    1. Baran Y, Quintela I,Carracedo Á, Pasaniuc B, Halperin E. Enhanced localization of genetic samples through linkage-disequilibrium correction. The American Journal of Human Genetics. 2013;92(6):882–894. 10.1016/j.ajhg.2013.04.023 - DOI - PMC - PubMed
    1. Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nature reviews Genetics. 2010;11(7):459 10.1038/nrg2813 - DOI - PMC - PubMed

Publication types

Substances