Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Feb;45(1):82-98.
doi: 10.1002/gepi.22356. Epub 2020 Sep 14.

locStra: Fast analysis of regional/global stratification in whole-genome sequencing studies

Affiliations

locStra: Fast analysis of regional/global stratification in whole-genome sequencing studies

Georg Hahn et al. Genet Epidemiol. 2021 Feb.

Abstract

locStra is an R -package for the analysis of regional and global population stratification in whole-genome sequencing (WGS) studies, where regional stratification refers to the substructure defined by the loci in a particular region on the genome. Population substructure can be assessed based on the genetic covariance matrix, the genomic relationship matrix, and the unweighted/weighted genetic Jaccard similarity matrix. Using a sliding window approach, the regional similarity matrices are compared with the global ones, based on user-defined window sizes and metrics, for example, the correlation between regional and global eigenvectors. An algorithm for the specification of the window size is provided. As the implementation fully exploits sparse matrix algebra and is written in C++, the analysis is highly efficient. Even on single cores, for realistic study sizes (several thousand subjects, several million rare variants per subject), the runtime for the genome-wide computation of all regional similarity matrices does typically not exceed one hour, enabling an unprecedented investigation of regional stratification across the entire genome. The package is applied to three WGS studies, illustrating the varying patterns of regional substructure across the genome and its beneficial effects on association testing.

Keywords: population stratification; population substructure; regional analysis; similarity matrix; whole-genome sequencing.

PubMed Disclaimer

Conflict of interest statement

CONFLICT OF INTERESTS

The authors declare that there are no conflict of interests.

Figures

FIGURE 1
FIGURE 1
Rare variants (RVs) for the super population EUR of the 1000 Genomes Project. Correlation of regional to global eigenvectors for chromosomes 5 (top left), 7 (top right), 12 (bottom left), and 16 (bottom right). Covariance matrix, GRM matrix, s-matrix, and Jaccard matrix. Window size 128,000 RVs
FIGURE 2
FIGURE 2
Common variants for the super population EUR of the 1000 Genomes Project. Correlation of regional to global eigenvectors for chromosomes 5 (top left), 7 (top right), 12 (bottom left), and 16 (bottom right). Covariance matrix, GRM matrix, s-matrix, and Jaccard matrix. Window size 100 RVs
FIGURE 3
FIGURE 3
Super population EUR of the 1000 Genomes Project. Runtime in seconds as a function of the window sizes across all chromosomes for the computation of the covariance matrix (top left), GRM matrix (top right), s-matrix (bottom left), and Jaccard matrix (bottom right). All plots show the minimal and maximal runtimes for any of the chromosomes, as well as the mean runtime averaged across all chromosomes. Logarithmic scale on the x- and y-axes
FIGURE 4
FIGURE 4
Super population EUR of the 1000 Genomes Project with British (GBR), Finnish (FIN), Iberian (IBS), Utah resident (CEU), and Toscani (TSI) subgroups. First two principal components for the GRM similarity matrix computed with PLINK2 (left) and locStra (right) for chromosome 1 (top) and chromosome 6 (bottom)
FIGURE 5
FIGURE 5
Super population EUR of the 1000 Genomes Project. Left: Mean correlation across all windows as a function of the window size. Right: Mean correlation across all windows multiplied by the logarithm of the number of windows, again as a function of the window size. Input data are the correlations between global and regional eigenvectors of the Jaccard matrices for different window sizes
FIGURE 6
FIGURE 6
Super population AFR of the 1000 Genomes Project. Setting as in Figure 5. Left: Mean correlation across all windows as a function of the window size. Right: Mean correlation across all windows multiplied by the logarithm of the number of windows, again as a function of the window size
FIGURE 7
FIGURE 7
Costa Rica population isolate. First two principal components for the Jaccard similarity matrix. All chromosomes combined. Three outliers are marked with red crosses
FIGURE 8
FIGURE 8
All 22 chromosomes of the Costa Rica population isolate. First two principal components for the Jaccard similarity matrix computed for the middle window of a stratification scan with window size 105. Separate plot for each chromosome (starting with chromosome 1 in the top left corner and continuing in a row-wise fashion). The three outliers of Figure 7 are marked again in each subplot in red
FIGURE 9
FIGURE 9
Costa Rica population isolate. First two principal components for the Jaccard similarity matrix. All chromosomes combined. The samples contained in the branches of the 22 regional plots of Figure 8 are both colored by chromosome and labeled with their chromosome number
FIGURE 10
FIGURE 10
Q-Q plot for a chromosome-wide regression of all common SNPs using a global, regional, and global and regional adjustment according to the model of Equation (1). Chromosome 1 of the 1000 Genome Project. Global and regional eigenvectors were computed on the covariance matrix

References

    1. Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, Eng C, … Halperin E (2012). Fast and accurate inference of local ancestry in Latino populations. Bioinformatics, 28(10), 1359–1367. - PMC - PubMed
    1. Bates D, & Eddelbuettel D (2013). Fast and elegant numerical linear algebra using the RcppEigen package. Journal of Statistical Software, 52(5), 1–24. - PubMed
    1. Bodmer W, & Bonilla C (2008). Common and rare variants in multifactorial susceptibility to common diseases. Nature Genetics, 40, 695–701. - PMC - PubMed
    1. Devlin B, & Roeder K (1999). Genomic control for association studies. Biometrics, 55(4), 997–1004. - PubMed
    1. Gazal S, Finucane H, Furlotte N, Loh P, Palamara P, Liu X, . Price A (2017). Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nature Genetics, 49(10), 1421–1427. - PMC - PubMed

Publication types