Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 8;51(8):e44.
doi: 10.1093/nar/gkad149.

Rye: genetic ancestry inference at biobank scale

Affiliations

Rye: genetic ancestry inference at biobank scale

Andrew B Conley et al. Nucleic Acids Res. .

Abstract

Biobank projects are generating genomic data for many thousands of individuals. Computational methods are needed to handle these massive data sets, including genetic ancestry (GA) inference tools. Current methods for GA inference do not scale to biobank-size genomic datasets. We present Rye-a new algorithm for GA inference at biobank scale. We compared the accuracy and runtime performance of Rye to the widely used RFMix, ADMIXTURE and iAdmix programs and applied it to a dataset of 488221 genome-wide variant samples from the UK Biobank. Rye infers GA based on principal component analysis of genomic variant samples from ancestral reference populations and query individuals. The algorithm's accuracy is powered by Metropolis-Hastings optimization and its speed is provided by non-negative least squares regression. Rye produces highly accurate GA estimates for three-way admixed populations-African, European and Native American-compared to RFMix and ADMIXTURE (${R}^2 = \ 0.998 - 1.00$), and shows 50× runtime improvement compared to ADMIXTURE on the UK Biobank dataset. Rye analysis of UK Biobank samples demonstrates how it can be used to infer GA at both continental and subcontinental levels. We discuss user consideration and options for the use of Rye; the program and its documentation are distributed on the GitHub repository: https://github.com/healthdisparities/rye.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of the Rye algorithm. Rye utilizes eigenvectors (PC vectors) and eigenvalues generated by PCA of reference and query individual genome-wide variant samples (left panel). Ancestry group-representative PC vectors are weighted via Metropolis-Hastings optimization of ancestry group-mean PC vectors. Non-negative least squares regression (NNLS) is used to estimate GA fractions via comparison of query individual PC vectors and the weighted ancestry group representative PC vectors.
Figure 2.
Figure 2.
Accuracy and runtime performance. (A) GA estimates—African (blue), European (orange), and Native American (red)—are compared for Rye (y-axis) and RFMix (x-axis) for n = 2190 Admixed American and reference individuals. (B) Accuracy of Rye measured by residual sum of squares (formula image) and formula image across a range of optimization rounds and iterations. (C) Runtime performance of Rye across a range of optimization rounds and iterations.
Figure 3.
Figure 3.
GA inference on the UK Biobank (UKBB). (A) PCA of UKBB participants (gray) and ancestry group reference samples (colored as shown). (B) PCA of UKBB participants labeled by self-identified ethnicity (colored as shown). (C) Ancestry and admixture patterns for UKBB participants, organized by self-identified ethnicity groups. Ancestry fractions (colored as shown) are indicated for each individual. The White ethnic group is not shown to scale owing to its large size; all other groups are scaled based on the number of participants. (D) Runtime comparison for ADMIXTURE and Rye, decomposed into model building (the optimization step for Rye) and GA projection steps.
Figure 4.
Figure 4.
Fine-scale GA inference with Rye. Results for UKBB, 1KGP and Native American reference and query individuals are shown. GA estimates for (A) East Asian, (B) South Asian, (C) African and (D) European query individuals from UKBB are shown along with 1KGP and Native American reference populations.

References

    1. Mathieson I., Scally A.. What is ancestry?. PLoS Genet. 2020; 16:e1008624. - PMC - PubMed
    1. Royal C.D., Novembre J., Fullerton S.M., Goldstein D.B., Long J.C., Bamshad M.J., Clark A.G.. Inferring genetic ancestry: opportunities, challenges, and implications. Am. J. Hum. Genet. 2010; 86:661–673. - PMC - PubMed
    1. Wohns A.W., Wong Y., Jeffery B., Akbari A., Mallick S., Pinhasi R., Patterson N., Reich D., Kelleher J., McVean G.. A unified genealogy of modern and ancient genomes. Science. 2022; 375:eabi8264. - PMC - PubMed
    1. Nielsen R., Akey J.M., Jakobsson M., Pritchard J.K., Tishkoff S., Willerslev E.. Tracing the peopling of the world through genomics. Nature. 2017; 541:302–310. - PMC - PubMed
    1. Hellenthal G., Busby G.B.J., Band G., Wilson J.F., Capelli C., Falush D., Myers S.. A genetic atlas of human admixture history. Science. 2014; 343:747–751. - PMC - PubMed

Publication types