. 2008 Jun 13;4(6):e1000093.

doi: 10.1371/journal.pcbi.1000093.

Machine-learning approaches for classifying haplogroup from Y chromosome STR data

Joseph Schlecht¹, Matthew E Kaplan, Kobus Barnard, Tatiana Karafet, Michael F Hammer, Nirav C Merchant

Affiliations

PMID: 18551166
PMCID: PMC2396484
DOI: 10.1371/journal.pcbi.1000093

Machine-learning approaches for classifying haplogroup from Y chromosome STR data

Joseph Schlecht et al. PLoS Comput Biol. 2008.

. 2008 Jun 13;4(6):e1000093.

doi: 10.1371/journal.pcbi.1000093.

Authors

Joseph Schlecht¹, Matthew E Kaplan, Kobus Barnard, Tatiana Karafet, Michael F Hammer, Nirav C Merchant

Affiliation

¹ Computer Science Department, University of Arizona, Tucson, Arizona, USA.

PMID: 18551166
PMCID: PMC2396484
DOI: 10.1371/journal.pcbi.1000093

Abstract

Genetic variation on the non-recombining portion of the Y chromosome contains information about the ancestry of male lineages. Because of their low rate of mutation, single nucleotide polymorphisms (SNPs) are the markers of choice for unambiguously classifying Y chromosomes into related sets of lineages known as haplogroups, which tend to show geographic structure in many parts of the world. However, performing the large number of SNP genotyping tests needed to properly infer haplogroup status is expensive and time consuming. A novel alternative for assigning a sampled Y chromosome to a haplogroup is presented here. We show that by applying modern machine-learning algorithms we can infer with high accuracy the proper Y chromosome haplogroup of a sample by scoring a relatively small number of Y-linked short tandem repeats (STRs). Learning is based on a diverse ground-truth data set comprising pairs of SNP test results (haplogroup) and corresponding STR scores. We apply several independent machine-learning methods in tandem to learn formal classification functions. The result is an integrated high-throughput analysis system that automatically classifies large numbers of samples into haplogroups in a cost-effective and accurate manner.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Frequency of 30 haplogroups determined by SNP-typing a geographically diverse sample of 8,414 chromosomes.**
This set of chromosomes, typed at 15 Y-linked STRs, was used as a ground-truth training set (see text for explanation). Haplogroups are named according to the mutation-based nomenclature , which retains the major haplogroup information (i.e., 18 capital letters) followed by the name of the terminal mutation that the sample is positive for (see Figure S1).

**Figure 2. Average accuracy of each classifier per haplogroup for cross-validation on the ground-truth training set.**
Standard error bars are shown for each point. The lower panel shows the frequency of the 8,414 samples among haplogroups. Support vector machines has the best overall performance, especially in the case of haplogroups with a smaller number of samples in the training data.

**Figure 3. Average accuracy of the tandem approach for cross-validation on the ground-truth training set.**
The average proportion of samples with agreement for all four classification methods is also shown. The haplogroups with the highest rate of tandem disagreement have a low representation in the training data.

**Figure 4. Frequency of 30 Y chromosome haplogroups inferred from a previously published sample of 1,527 Asian Y chromosomes.**
The samples were typed with 9 Y-STRs and a battery of Y-linked SNPs. Haplogroup frequencies are statistically significantly different from those in our ground-truth training set (Figure 1). Haplogroups are named according to the mutation-based nomenclature , which retains the major haplogroup information (i.e., 18 capital letters) followed by the name of the terminal mutation that the sample is positive for (see Figure S1).

**Figure 5. Average accuracy of each classifier per haplogroup for cross-validation on the 9-locus public STR data.**
Standard error bars are shown for each point. The lower panel shows the frequency of haplogroups in the 1,527 sample public data set.

**Figure 6. Average accuracy and agreement of the tandem approach for cross-validation on the 9-locus public STR data.**

**Figure 7. Test creation process for decision trees.**
Samples from four haplogroups in data set A are passed through locus-specific allele test conditions at each branch of the decision tree. The test for locus *X_i* is chosen so that i = arg max_l{IG(A, *X_l*)} and *X_j* so that j = arg maxl{ IG(B ₁, *X_l*)}.

**Figure 8. Bayesian likelihood construction and evaluation.**
For each haplogroup, the density functions f ₁,…*f_L* are constructed as normalized histograms from the training data . Given a sample x = (x ₁,…*x_L*), its likelihood under a haplogroup is the product of its evaluated locus bin frequencies.

formula image — **Figure 8. Bayesian likelihood construction and evaluation.**
For each haplogroup, the density functions f ₁,…*f_L* are constructed as normalized histograms from the training data . Given a sample x = (x ₁,…*x_L*), its likelihood under a haplogroup is the product of its evaluated locus bin frequencies.

**Figure 9. Maximum margin hyperplane used in support vector machines.**
Example showing the hyperplane with maximal margin of separation between samples from two different haplogroups. The shaded points lying on the margin define the support vectors.

**Figure 10. Y chromosome haplogroup hierarchy.**
Only the top-level haplogroups are shown.

See this image and copyright information in PMC

References

1. Jobling MA, Pandya A, Tyler-Smith C. The Y chromosome in forensic analysis and paternity testing. Int J Legal Med. 1997;110:118–124. - PubMed
1. Hammer MF, Chamberlain VF, Kearney VF, Stover D, Zhang G, et al. Population structure of Y chromosome SNP haplogroups in the United States and forensic implications for constructing Y chromosome STR databases. Forensic Sci Int. 2006;164:45–55. - PubMed
1. Jobling MA, Tyler-Smith C. New uses for new haplotypes - the human Y chromosome, disease and selection. Trends Genet. 2000;16:356–362. - PubMed
1. Jobling MA. In the name of the father: surnames and genetics. Trends Genet. 2001;17:353–357. - PubMed
1. Stone AC, Milner GR, Paabo S, Stoneking M. Sex determination of ancient human skeletons using DNA. Am J Phys Anthropol. 1996;99:231–238. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Machine-learning approaches for classifying haplogroup from Y chromosome STR data

Affiliation

Machine-learning approaches for classifying haplogroup from Y chromosome STR data

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources