. 2021 Feb;45(1):82-98.

doi: 10.1002/gepi.22356. Epub 2020 Sep 14.

locStra: Fast analysis of regional/global stratification in whole-genome sequencing studies

Georg Hahn¹, Sharon M Lutz¹, Julian Hecker², Dmitry Prokopenko³, Michael H Cho², Edwin K Silverman², Scott T Weiss², Christoph Lange¹; NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium

Affiliations

¹ Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA.
² Department of Medicine, Brigham and Women's Hospital, Harvard University, Boston, Massachusetts, USA.
³ Massachusetts General Hospital, Harvard University, Boston, Massachusetts, USA.

PMID: 32929743
PMCID: PMC7856019
DOI: 10.1002/gepi.22356

locStra: Fast analysis of regional/global stratification in whole-genome sequencing studies

Georg Hahn et al. Genet Epidemiol. 2021 Feb.

. 2021 Feb;45(1):82-98.

doi: 10.1002/gepi.22356. Epub 2020 Sep 14.

Authors

Affiliations

¹ Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA.
² Department of Medicine, Brigham and Women's Hospital, Harvard University, Boston, Massachusetts, USA.
³ Massachusetts General Hospital, Harvard University, Boston, Massachusetts, USA.

PMID: 32929743
PMCID: PMC7856019
DOI: 10.1002/gepi.22356

Abstract

locStra is an $R$ -package for the analysis of regional and global population stratification in whole-genome sequencing (WGS) studies, where regional stratification refers to the substructure defined by the loci in a particular region on the genome. Population substructure can be assessed based on the genetic covariance matrix, the genomic relationship matrix, and the unweighted/weighted genetic Jaccard similarity matrix. Using a sliding window approach, the regional similarity matrices are compared with the global ones, based on user-defined window sizes and metrics, for example, the correlation between regional and global eigenvectors. An algorithm for the specification of the window size is provided. As the implementation fully exploits sparse matrix algebra and is written in C++, the analysis is highly efficient. Even on single cores, for realistic study sizes (several thousand subjects, several million rare variants per subject), the runtime for the genome-wide computation of all regional similarity matrices does typically not exceed one hour, enabling an unprecedented investigation of regional stratification across the entire genome. The package is applied to three WGS studies, illustrating the varying patterns of regional substructure across the genome and its beneficial effects on association testing.

Keywords: population stratification; population substructure; regional analysis; similarity matrix; whole-genome sequencing.

PubMed Disclaimer

Conflict of interest statement

CONFLICT OF INTERESTS

The authors declare that there are no conflict of interests.

Figures

**FIGURE 1**
Rare variants (RVs) for the super population EUR of the 1000 Genomes Project. Correlation of regional to global eigenvectors for chromosomes 5 (top left), 7 (top right), 12 (bottom left), and 16 (bottom right). Covariance matrix, GRM matrix, s-matrix, and Jaccard matrix. Window size 128,000 RVs

**FIGURE 2**
Common variants for the super population EUR of the 1000 Genomes Project. Correlation of regional to global eigenvectors for chromosomes 5 (top left), 7 (top right), 12 (bottom left), and 16 (bottom right). Covariance matrix, GRM matrix, s-matrix, and Jaccard matrix. Window size 100 RVs

**FIGURE 3**
Super population EUR of the 1000 Genomes Project. Runtime in seconds as a function of the window sizes across all chromosomes for the computation of the covariance matrix (top left), GRM matrix (top right), s-matrix (bottom left), and Jaccard matrix (bottom right). All plots show the minimal and maximal runtimes for any of the chromosomes, as well as the mean runtime averaged across all chromosomes. Logarithmic scale on the x- and y-axes

**FIGURE 4**
Super population EUR of the 1000 Genomes Project with British (GBR), Finnish (FIN), Iberian (IBS), Utah resident (CEU), and Toscani (TSI) subgroups. First two principal components for the GRM similarity matrix computed with PLINK2 (left) and *locStra* (right) for chromosome 1 (top) and chromosome 6 (bottom)

**FIGURE 5**
Super population EUR of the 1000 Genomes Project. Left: Mean correlation across all windows as a function of the window size. Right: Mean correlation across all windows multiplied by the logarithm of the number of windows, again as a function of the window size. Input data are the correlations between global and regional eigenvectors of the Jaccard matrices for different window sizes

**FIGURE 6**
Super population AFR of the 1000 Genomes Project. Setting as in Figure 5. Left: Mean correlation across all windows as a function of the window size. Right: Mean correlation across all windows multiplied by the logarithm of the number of windows, again as a function of the window size

**FIGURE 7**
Costa Rica population isolate. First two principal components for the Jaccard similarity matrix. All chromosomes combined. Three outliers are marked with red crosses

**FIGURE 8**
All 22 chromosomes of the Costa Rica population isolate. First two principal components for the Jaccard similarity matrix computed for the middle window of a stratification scan with window size 10⁵. Separate plot for each chromosome (starting with chromosome 1 in the top left corner and continuing in a row-wise fashion). The three outliers of Figure 7 are marked again in each subplot in red

**FIGURE 9**
Costa Rica population isolate. First two principal components for the Jaccard similarity matrix. All chromosomes combined. The samples contained in the branches of the 22 regional plots of Figure 8 are both colored by chromosome and labeled with their chromosome number

**FIGURE 10**
Q-Q plot for a chromosome-wide regression of all common SNPs using a global, regional, and global and regional adjustment according to the model of Equation (1). Chromosome 1 of the 1000 Genome Project. Global and regional eigenvectors were computed on the covariance matrix

See this image and copyright information in PMC

References

1. Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, Eng C, … Halperin E (2012). Fast and accurate inference of local ancestry in Latino populations. Bioinformatics, 28(10), 1359–1367. - PMC - PubMed
1. Bates D, & Eddelbuettel D (2013). Fast and elegant numerical linear algebra using the RcppEigen package. Journal of Statistical Software, 52(5), 1–24. - PubMed
1. Bodmer W, & Bonilla C (2008). Common and rare variants in multifactorial susceptibility to common diseases. Nature Genetics, 40, 695–701. - PMC - PubMed
1. Devlin B, & Roeder K (1999). Genomic control for association studies. Biometrics, 55(4), 997–1004. - PubMed
1. Gazal S, Finucane H, Furlotte N, Loh P, Palamara P, Liu X, . Price A (2017). Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nature Genetics, 49(10), 1421–1427. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

locStra: Fast analysis of regional/global stratification in whole-genome sequencing studies

Affiliations

locStra: Fast analysis of regional/global stratification in whole-genome sequencing studies

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials