This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Sep 25:2023.11.17.567650.

doi: 10.1101/2023.11.17.567650.

Fast and accurate local ancestry inference with Recomb-Mix

Yuan Wei¹, Degui Zhi², Shaojie Zhang¹

Affiliations

¹ Department of Computer Science, University of Central Florida, Orlando, FL, USA.
² McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.

PMID: 38014185
PMCID: PMC10680832
DOI: 10.1101/2023.11.17.567650

Fast and accurate local ancestry inference with Recomb-Mix

Yuan Wei et al. bioRxiv. 2024.

[Preprint]. 2024 Sep 25:2023.11.17.567650.

doi: 10.1101/2023.11.17.567650.

Authors

Yuan Wei¹, Degui Zhi², Shaojie Zhang¹

Affiliations

¹ Department of Computer Science, University of Central Florida, Orlando, FL, USA.
² McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.

PMID: 38014185
PMCID: PMC10680832
DOI: 10.1101/2023.11.17.567650

Update in

Recomb-Mix: fast and accurate local ancestry inference.
Wei Y, Zhi D, Zhang S. Wei Y, et al. Bioinformatics. 2025 Jul 1;41(Supplement_1):i180-i188. doi: 10.1093/bioinformatics/btaf227. Bioinformatics. 2025. PMID: 40662780 Free PMC article.

Abstract

The availability of large genotyped cohorts brings new opportunities for revealing the high-resolution genetic structure of admixed populations via local ancestry inference (LAI), the process of identifying the ancestry of each segment of an individual haplotype. Though current methods achieve high accuracy in standard cases, LAI is still challenging when reference populations are more similar (e.g., intra-continental), when the number of reference populations is too numerous, or when the admixture events are deep in time, all of which are increasingly unavoidable in large biobanks. Here, we present a new LAI method, Recomb-Mix. Recomb-Mix integrates the elements of existing methods of the site-based Li and Stephens model and introduces a new graph collapsing trick to simplify counting paths with the same ancestry label readout. Through comprehensive benchmarking on various simulated datasets, we show that Recomb-Mix is more accurate than existing methods in diverse sets of scenarios while being competitive in terms of resource efficiency. We expect that Recomb-Mix will be a useful method for advancing genetics studies of admixed populations.

PubMed Disclaimer

Figures

**Figure 1:**
An example of local ancestry inference with Recomb-Mix. (A) $G$ is a population graph representing the HMM in Recomb-Mix, constructed from a given reference panel. $G$ contains seven haplotypes with eight sites belonging to two populations (shown in red and blue). $Q$ is a query of an admixed individual haplotype. (B) A transformation process from nodes in sites three and four in $G$ to nodes in the corresponding sites in $G^{'}$ . Nodes in black boxes correspond to the nodes in sites three and four in $G$ . Nodes in green boxes correspond to the nodes in sites three and four in $G^{'}$ . The filled nodes in red and blue are population emission nodes in sites three and four. $r$ is a cross-population penalty. (C) $G^{'}$ is a compact population graph transformed from $G$ . $Q$ is assigned with estimated ancestral labels for each site (shown in red and blue on allele values), according to a threading path selected with minimum penalty score (shown as bold edges) in $G^{'}$ .

**Figure 2:**
The squared Pearson’s correlation coefficient $r^{2}$ of three-way inter-continental simulated datasets on FLARE, G-Nomix, Loter, Recomb-Mix, RFMix, and SALAI-Net. Markers were filtered with minor allele frequency $\leq 0.005$ and minor allele count $\leq 50$ . (A) The three-way 15-generation datasets with the reference panel sizes 100, 250, 500, and 1,000 (values are in Supplemental Table S3). (B) The three-way 500-reference datasets with the generations 15, 50, 100, and 200 (values are in Supplemental Table S4).

**Figure 3:**
The squared Pearson’s correlation coefficient $r^{2}$ of seven-way inter-continental simulated datasets on FLARE, G-Nomix, Loter, Recomb-Mix, RFMix, and SALAI-Net. Markers were filtered with minor allele frequency $\leq 0.005$ and minor allele count $\leq 50$ . (A) The seven-way 15-generation datasets with the reference panel sizes 250, 500, and 1,000 (values are in Supplemental Table S8). The reference panel size 100 case was not included because the number of markers was too small and may have influenced the outcome after the filtering. (B) The seven-way 500-reference datasets with the generations 15, 50, 100, and 200 (values are in Supplemental Table S9).

**Figure 4:**
Sample haplotypes inferred by FLARE, G-Nomix, Loter, Recomb-Mix, RFMix, and SALAI-Net with the ground truth of ancestry labels. (A) An inferred sample haplotype from a three-way 15-generation 500-reference inter-continental simulated dataset. (B) An inferred sample haplotype from a seven-way 15-generation 500-reference inter-continental simulated dataset.

**Figure 5:**
The squared Pearson’s correlation coefficient $r^{2}$ of three-way intra-continental simulated datasets on FLARE, G-Nomix, Loter, Recomb-Mix, RFMix, and SALAI-Net. Markers were filtered with minor allele frequency $\leq 0.005$ and minor allele count $\leq 50$ . (A) The three-way 15-generation datasets with the reference panel sizes 250, 500, and 1,000 (values are in Supplemental Table S12). The reference panel size 100 case was not included because the number of markers was too small and may have influenced the outcome after the filtering. (B) The three-way 500-reference datasets with the generations 15, 50, 100, and 200 (values are in Supplemental Table S13).

**Figure 6:**
The performance of local ancestry inference with generations 15, 50, 100, and 200 of the three-way 500-misspecified-reference inter-continental simulated datasets on FLARE, G-Nomix, Loter, Recomb-Mix, RFMix, and SALAI-Net. (A) The squared Pearson’s correlation coefficient $r^{2}$ (values are in Supplemental Table S17). Markers were filtered with minor allele frequency $\leq 0.005$ and minor allele count $\leq 50$ . (B) The average accuracy rates (values are in Supplemental Table S18).

**Figure 7:**
The average global ancestry proportions in the TGP Chromosome 18 data using four reference ancestries from the HGDP data. Descriptions of the populations are in Supplemental Table S1.

**Figure 8:**
The average global ancestry proportions in the HGDP Chromosome 18 data using four reference ancestries from the TGP data. Descriptions of the populations are in Supplemental Table S1.

**Figure 9:**
The discrete AIM (dAIM) density in the HGDP dataset per population on Chromosome 18. Each bin is 1 centiMorgan (cM), showing the markers’ dAIM percentage.

See this image and copyright information in PMC

References

1. Adrion JR, Cole CB, Dukler N, Galloway JG, Gladstein AL, Gower G, Kyriazis CC, Ragsdale AP, Tsambos G, Baumdicker F, et al. 2020. A community-maintained standard library of population genetic models. eLife 9: e54967. - PMC - PubMed
1. Atkinson EG, Maihofer AX, Kanai M, Martin AR, Karczewski KJ, Santoro ML, Ulirsch JC, Kamatani Y, Okada Y, Finucane HK, et al. 2021. Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nature Genetics 53: 195–204. - PMC - PubMed
1. Auton A, Abecasis GR, Altshuler DM, Durbin RM, Bentley DR, Chakravarti A, Clark AG, Donnelly P, Eichler EE, Flicek P, et al. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed
1. Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, Eng C, Rodriguez-Cintron W, Chapela R, Ford JG, Avila PC, et al. 2012. Fast and accurate inference of local ancestry in Latino populations. Bioinformatics 28: 1359–1367. - PMC - PubMed
1. Bergström A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P, Chen Y, Felkel S, Hallast P, Kamm J, et al. 2020. Insights into human genetic variation and population history from 929 diverse genomes. Science 367: eaay5012. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Fast and accurate local ancestry inference with Recomb-Mix

Affiliations

Fast and accurate local ancestry inference with Recomb-Mix

Authors

Affiliations

Update in

Abstract

Figures

Similar articles

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

This is a preprint.

Update in

Abstract

Figures

Similar articles

References

Publication types

Related information

Grants and funding

LinkOut - more resources

Full Text Sources