Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 3:13:971242.
doi: 10.3389/fgene.2022.971242. eCollection 2022.

A machine learning approach for missing persons cases with high genotyping errors

Affiliations

A machine learning approach for missing persons cases with high genotyping errors

Meng Huang et al. Front Genet. .

Abstract

Estimating the relationships between individuals is one of the fundamental challenges in many fields. In particular, relationship.ip estimation could provide valuable information for missing persons cases. The recently developed investigative genetic genealogy approach uses high-density single nucleotide polymorphisms (SNPs) to determine close and more distant relationships, in which hundreds of thousands to tens of millions of SNPs are generated either by microarray genotyping or whole-genome sequencing. The current studies usually assume the SNP profiles were generated with minimum errors. However, in the missing person cases, the DNA samples can be highly degraded, and the SNP profiles generated from these samples usually contain lots of errors. In this study, a machine learning approach was developed for estimating the relationships with high error SNP profiles. In this approach, a hierarchical classification strategy was employed first to classify the relationships by degree and then the relationship types within each degree separately. As for each classification, feature selection was implemented to gain better performance. Both simulated and real data sets with various genotyping error rates were utilized in evaluating this approach, and the accuracies of this approach were higher than individual measures; namely, this approach was more accurate and robust than the individual measures for SNP profiles with genotyping errors. In addition, the highest accuracy could be obtained by providing the same genotyping error rates in train and test sets, and thus estimating genotyping errors of the SNP profiles is critical to obtaining high accuracy of relationship estimation.

Keywords: feature selection; genetic genealogy; genotyping error; hierarchical classification; kinship estimation; machine learning; missing person; single nucleotide polymorphisms.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
The relationships among the 8 UTAH/CEPH cell line samples (filled in with blue color).
FIGURE 2
FIGURE 2
Experimental design and workflow of the whole study. The hierarchical classification was implemented with the simulation data, but not the real data, as the sample size of the real data was too small.
FIGURE 3
FIGURE 3
Classification algorithm comparisons between Random Forest (RF) and Support Vector Machine (SVM). Two algorithms (left four plots for RF; right four plots for SVM) were employed to conduct forward feature selection with 10-fold cross-validation for relationship degree, relationship types within the 1st degree, relationship types within the 2nd degree, and relationship types within the 3rd degree. Different genotyping error rates were presented with different colors. The x-axis is the number of selected measures (or features) in each step of the forward selection. GER = genotyping error rate of the test dataset.
FIGURE 4
FIGURE 4
Feature selection for classifying relationship degrees using data simulated with various genotyping errors. (A) the accuracies by the forward feature selection with various genotyping errors, in which the ranking of the features for different genotyping error rates (GERs) was different (e.g., the first features were K1 and j7 with GER = 0 and 0.1, respectively), and (B) the counts of the commonly selected features across all genotyping errors (e.g., K1 was selected in the feature selections with all six genotyping errors). The features on the left of the red dash line were selected as final features. Final = classification with the selected top-performing features; GER = genotyping error rate of the test dataset.
FIGURE 5
FIGURE 5
Classification accuracies with the selected features ranked as in Figures 3, 5 for relationship degree and types using data simulated with various genotyping errors. GER = genotyping error rate of the test dataset. (A) the accuracies by the forward features selection for the relationship degrees, (B) the accuracies by the forward features selection for the 1st degree relationships, (C) the accuracies by the forward features selection for the 2nd degree relationships, (D) the accuracies by the forward features selection for the 3rd degree relationships.
FIGURE 6
FIGURE 6
Forward feature selections and classification accuracies for relationship types using data simulated with various genotyping error rates. (A) the accuracies for the 1st degree relationships, (B) the accuracies for the 2nd degree relationships, (C) the accuracies for the 3rd degree relationships, (D) the counts of the selected features for the 1st degree relationships across all genotyping errors (e.g., K1 and K0 were selected in the feature selections with all six genotyping errors), (E) the counts of the selected features for the 2nd degree relationships across all genotyping errors, and (F) the counts of the selected features for the 3rd degree relationships across all genotyping errors. The features on the left of the red dash line were selected as final features. Final = classification with the final features; GER = genotyping error rate of the test dataset.
FIGURE 7
FIGURE 7
The impact of missing data on relationship degree classification. The label of the x-axis represents missing rates with two different sets of features (top-performing features vs. K1). GER = genotyping error rate of the test dataset.
FIGURE 8
FIGURE 8
The effect of genotyping error rate of the training dataset in relationship type classification. (A) the accuracies with the final selected features for the 1st degree relationships in the training and test datasets with different genotyping error rates, (B) the accuracies with the final selected features for the 2nd degree relationships in the training and test datasets with different genotyping error rates, (C) the accuracies with the final selected features for the 3rd degree relationships in the training and test datasets with different genotyping error rates. The labels of the x-axis (A–C) represent the different genotyping error rates of the test dataset. The labels of the y-axis (A–C) represent classification accuracy. GER-train = genotyping error rate of the training dataset.
FIGURE 9
FIGURE 9
The accuracies of classifying relationship degrees with the UTAH family in Figure 1. The accuracies were estimated with the final selected features for the relationship degrees in the simulated training and real test datasets (UTAH family) with different genotyping error rates. (A) test dataset with 419K SNPs, (B) test dataset with 4M SNPs. The labels of the x-axis (A,B) represent the different genotyping error rates of the training dataset. The labels of the y-axis (A,B) represent classification accuracy. In the legend, GER denotes genotyping error rate of the test dataset. The GERs of the test datasets were represented using different colors and icons. GEDmatch denotes the accuracies obtained from GEDmatch website. K1 denotes the accuracies obtained from KING-robust.

References

    1. Alexander D. H., Novembre J., Lange K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19 (9), 1655–1664. 10.1101/gr.094052.109.vidual - DOI - PMC - PubMed
    1. Auton A., Abecasis G. R., Altshuler D. M., Durbin R. M., Bentley D. R., Chakravarti A., et al. (2015). A global reference for human genetic variation. Nature 526 (7571), 68–74. 10.1038/nature15393 - DOI - PMC - PubMed
    1. Boehnke M., Cox N. J. (1997). Accurate inference of relationships in sib-pair linkage studies. Am. J. Hum. Genet. 61 (2), 423–429. 10.1086/514862 - DOI - PMC - PubMed
    1. Browning B. L., Browning S. R. (2011). A fast, powerful method for detecting identity by descent. Am. J. Hum. Genet. 88 (2), 173–182. 10.1016/j.ajhg.2011.01.010 - DOI - PMC - PubMed
    1. Caballero M., Seidman D. N., Qiao Y., Sannerud J., Dyer T. D., Lehman D. M., et al. (2019). Crossover interference and sex-specific genetic maps shape identical by descent sharing in close relatives. PLoS Genet. 15 (12), 10079799–e1008029. 10.1371/journal.pgen.1007979 - DOI - PMC - PubMed