Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug 15;36(16):4449-4457.
doi: 10.1093/bioinformatics/btaa520.

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Affiliations

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Florian Privé et al. Bioinformatics. .

Abstract

Motivation: Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.

Results: For example, we find that PC19-PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16-18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.

Availability and implementation: R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
PC loadings 1–40 reported by the UK Biobank. Column indices of variants in the data, ordered by chromosome and physical position, are represented on the x-axis, and the value of loadings is represented on the y-axis. Points are hex-binned. (a) Distribution of statistics (S). (b) PC scores 13–20 of 1000G, colored by the statistic (S) used to define outliers. A few points with higher values for this statistic S appear as outliers in PC17–PC20. (c) PC scores 13–20 of 1000G, colored by being detected as an outlier. Threshold of being an outlier is determined based on histogram (a) (Color version of this figure is available at Bioinformatics online.)
Fig. 2.
Fig. 2.
Outlier detection in the 1000 Genomes (1000G) project, using prob_dist (Section 3.4)
Fig. 3.
Fig. 3.
PC scores 1–8 of the 1000 Genomes project. Black points are the 60% individuals used for computing PCA. Red points are the 40% remaining individuals, projected by simply multiplying their genotypes by the corresponding PC loadings. Blue points are the 40% remaining individuals, projected using the OADP transformation. Estimated shrinkage coefficients for these eight PCs are 1.01 (PC1), 1.02, 1.06, 1.09, 1.50 (PC5), 1.69, 1.98 and 1.39. (Color version of this figure is available at Bioinformatics online.)
Fig. 4.
Fig. 4.
PC scores 27–50 computed on the UK Biobank using 48 942 individuals of diverse ancestries. These individuals are the ones resulting from removing all related individuals and randomly subsampling the British and Irish individuals. Different colors represent different self-reported ancestries. (Color version of this figure is available at Bioinformatics online.)
Fig. 5.
Fig. 5.
Proposed pipeline for computing PCs using R packages bigsnpr and bigutilsr

References

    1. 1000 Genomes Project Consortium et al. (2015) A global reference for human genetic variation. Nature, 526, 68. - PMC - PubMed
    1. Abdellaoui A. et al. (2013) Population structure, migration, and diversifying selection in the Netherlands. Eur. J. Hum. Genet., 21, 1277–1285. - PMC - PubMed
    1. Abraham G. et al. (2017) FlashPCA2: principal component analysis of biobank-scale genotype datasets. Bioinformatics, 33, 2776–2778. - PubMed
    1. Agrawal A. et al. (2019) Scalable probabilistic PCA for large-scale genetic variation data. 10.1371/journal.pgen.1008773. - DOI - PMC - PubMed
    1. Bellenguez C. et al. (2012) A robust clustering algorithm for identifying problematic samples in genome-wide association studies. Bioinformatics, 28, 134–135. - PMC - PubMed

Publication types