Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 20;25(1):76.
doi: 10.1186/s12859-024-05703-y.

SNVstory: inferring genetic ancestry from genome sequencing data

Affiliations

SNVstory: inferring genetic ancestry from genome sequencing data

Audrey E Bollas et al. BMC Bioinformatics. .

Abstract

Background: Genetic ancestry, inferred from genomic data, is a quantifiable biological parameter. While much of the human genome is identical across populations, it is estimated that as much as 0.4% of the genome can differ due to ancestry. This variation is primarily characterized by single nucleotide variants (SNVs), which are often unique to specific genetic populations. Knowledge of a patient's genetic ancestry can inform clinical decisions, from genetic testing and health screenings to medication dosages, based on ancestral disease predispositions. Nevertheless, the current reliance on self-reported ancestry can introduce subjectivity and exacerbate health disparities. While genomic sequencing data enables objective determination of a patient's genetic ancestry, existing approaches are limited to ancestry inference at the continental level.

Results: To address this challenge, and create an objective, measurable metric of genetic ancestry we present SNVstory, a method built upon three independent machine learning models for accurately inferring the sub-continental ancestry of individuals. We also introduce a novel method for simulating individual samples from aggregate allele frequencies from known populations. SNVstory includes a feature-importance scheme, unique among open-source ancestral tools, which allows the user to track the ancestral signal broadcast by a given gene or locus. We successfully evaluated SNVstory using a clinical exome sequencing dataset, comparing self-reported ethnicity and race to our inferred genetic ancestry, and demonstrate the capability of the algorithm to estimate ancestry from 36 different populations with high accuracy.

Conclusions: SNVstory represents a significant advance in methods to assign genetic ancestry, opening the door to ancestry-informed care. SNVstory, an open-source model, is packaged as a Docker container for enhanced reliability and interoperability. It can be accessed from https://github.com/nch-igm/snvstory .

Keywords: Genetic ancestry prediction; Genetic variation; Machine learning; Model interpretation; Personalized medicine.

PubMed Disclaimer

Conflict of interest statement

No competing interests: AEB, AR, DC, JBG, and PW. ERM: Qiagen N.V., supervisory board member, honorarium, and stock-based compensation. Singular Genomics Systems, Inc., board of directors, honorarium, and stock-based compensation.

Figures

Fig. 1
Fig. 1
Schematic of ancestry inference model strategy. The workflow visualizes each dataset separately with colored boxes and arrows: gnomAD (blue), 1kGP (yellow), and SGDP (red). For the gnomAD synthetic-based matrix, allele frequencies for each variant for each population given in gnomAD are used to create a distribution of reference, heterozygous and homozygous alleles for each population. A matrix format is created by converting the distributions into 0s, 1s, and 2s for each locus for samples in each population. For 1kGP and SGDP, a matrix format is built directly from variants in the VCF. For the model architecture, continental model labels are shown in white boxes, and the number of labels in the corresponding subcontinental models is below in brackets
Fig. 2
Fig. 2
Continental ancestry inference model performance. AD Confusion matrices of the 1kGP model using SGDP as validation (A), SGDP model using 1kGP as validation (B), gnomAD model using 1kGP as validation (C), and gnomAD model using SGDP as validation (D). E Macro-averaged ROC curves. F Macro-averaged precision–recall curves
Fig. 3
Fig. 3
Gene-level global feature importance in ancestry inference using SNVstory’s gnomAD continental model. This figure illustrates the mean absolute SHAP values aggregated for each gene, derived from 2800 training samples. The analysis highlights the top 20 genes that significantly influence ancestry inference, emphasizing the role of specific alleles in determining ancestry labels
Fig. 4
Fig. 4
SNVstory ancestry report. The representative output of model results from SNVstory for a European sample taken from the 1kGP dataset

References

    1. Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, et al. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. - DOI - PMC - PubMed
    1. Hauser D, Obeng AO, Fei K, Ramos MA, Horowitz CR. Views of primary care providers on testing patients for genetic risks for common chronic diseases. Health Aff Proj Hope. 2018;37:793–800. doi: 10.1377/hlthaff.2017.1548. - DOI - PMC - PubMed
    1. Jorde LB, Bamshad MJ. Genetic ancestry testing what is it and why is it important? JAMA. 2020;323:1089–1090. doi: 10.1001/jama.2020.0517. - DOI - PMC - PubMed
    1. Ramamoorthy A, Pacanowski MA, Bull J, Zhang L. Racial/ethnic differences in drug disposition and response: review of recently approved drugs. Clin Pharmacol Ther. 2015;97:263–273. doi: 10.1002/cpt.61. - DOI - PubMed
    1. Fujimura JH, Rajagopalan R. Different differences: the use of ‘genetic ancestry’ versus race in biomedical human genetic research. Soc Stud Sci. 2011;41:5–30. doi: 10.1177/0306312710379170. - DOI - PMC - PubMed

LinkOut - more resources