Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 11;24(20):15095.
doi: 10.3390/ijms242015095.

A Machine-Learning-Based Approach to Prediction of Biogeographic Ancestry within Europe

Affiliations

A Machine-Learning-Based Approach to Prediction of Biogeographic Ancestry within Europe

Anna Kloska et al. Int J Mol Sci. .

Abstract

Data obtained with the use of massive parallel sequencing (MPS) can be valuable in population genetics studies. In particular, such data harbor the potential for distinguishing samples from different populations, especially from those coming from adjacent populations of common origin. Machine learning (ML) techniques seem to be especially well suited for analyzing large datasets obtained using MPS. The Slavic populations constitute about a third of the population of Europe and inhabit a large area of the continent, while being relatively closely related in population genetics terms. In this proof-of-concept study, various ML techniques were used to classify DNA samples from Slavic and non-Slavic individuals. The primary objective of this study was to empirically evaluate the feasibility of discerning the genetic provenance of individuals of Slavic descent who exhibit genetic similarity, with the overarching goal of categorizing DNA specimens derived from diverse Slavic population representatives. Raw sequencing data were pre-processed, to obtain a 1200 character-long binary vector. A total of three classifiers were used-Random Forest, Support Vector Machine (SVM), and XGBoost. The most-promising results were obtained using SVM with a linear kernel, with 99.9% accuracy and F1-scores of 0.9846-1.000 for all classes.

Keywords: SVM; biogeographic ancestry; biogeographic origin; machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Polymorphism number evaluation histogram, based on GA results. As presented in the histogram for the chosen classifiers, the best results were obtained for 300 SNPs. A further increment resulted in the evaluation metrics’ decrement.
Figure 2
Figure 2
Confusion matrices for each fold for the most-promising classifier. The presented confusion matrices illustrate the aggregated results across all five folds of the experiment. The matrix provides a comprehensive overview of the classification outcomes for all four classes, aiding in the assessment of the model’s performance across different scenarios.
Figure 3
Figure 3
The pipeline of the proposed method. In order to prepare exome data in a form suitable for machine learning (ML) methods, the data were subjected to the pre-processing and feature-selection stage, which was aimed at selecting the key SNP classification.
Figure 4
Figure 4
Nationality distribution of the data. Three Slavic populations were tested along with a non-Slavic group, consisting of samples of European origin from The 1000 Genomes Project [7].
Figure 5
Figure 5
Data pre-processing scheme. Exome sequence from each individual was transformed into a summary table, where each sample is represented in one row. Each column represents the SNP and genotype. If there is such a genotype in a given SNP, the value of 1 is inserted in the summary table. When there is no such case, the value of 0 is inserted instead.

References

    1. Boidot R., Dalens L., Niogret J., Kaderbhei C.G. Is there a role for large exome sequencing in the management of metastatic nonsmall cell lung cancer: A brief report of real life. Front. Oncol. 2022;12:863057. - PMC - PubMed
    1. Nelis M., Esko T., Mägi R., Zimprich F., Zimprich A., Toncheva D., Karachanak S., Piskáčková T., Balaščák I., Peltonen L., et al. Genetic structure of Europeans: A view from the north–east. PLoS ONE. 2009;4:e5472. doi: 10.1371/journal.pone.0005472. - DOI - PMC - PubMed
    1. Zou J., Huss M., Abid A., Mohammadi P., Torkamani A., Telenti A. A primer on deep learning in genomics. Nat. Genet. 2019;51:12–18. doi: 10.1038/s41588-018-0295-5. - DOI - PMC - PubMed
    1. Sheehan S., Song Y.S. Deep learning for population genetic inference. PLoS Comput. Biol. 2016;12:e1004845. doi: 10.1371/journal.pcbi.1004845. - DOI - PMC - PubMed
    1. Angermueller C., Pärnamaa T., Parts L., Stegle O. Deep learning for computational biology. Mol. Syst. Biol. 2016;12:878. doi: 10.15252/msb.20156651. - DOI - PMC - PubMed