Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jun 13;4(6):e1000093.
doi: 10.1371/journal.pcbi.1000093.

Machine-learning approaches for classifying haplogroup from Y chromosome STR data

Affiliations

Machine-learning approaches for classifying haplogroup from Y chromosome STR data

Joseph Schlecht et al. PLoS Comput Biol. .

Abstract

Genetic variation on the non-recombining portion of the Y chromosome contains information about the ancestry of male lineages. Because of their low rate of mutation, single nucleotide polymorphisms (SNPs) are the markers of choice for unambiguously classifying Y chromosomes into related sets of lineages known as haplogroups, which tend to show geographic structure in many parts of the world. However, performing the large number of SNP genotyping tests needed to properly infer haplogroup status is expensive and time consuming. A novel alternative for assigning a sampled Y chromosome to a haplogroup is presented here. We show that by applying modern machine-learning algorithms we can infer with high accuracy the proper Y chromosome haplogroup of a sample by scoring a relatively small number of Y-linked short tandem repeats (STRs). Learning is based on a diverse ground-truth data set comprising pairs of SNP test results (haplogroup) and corresponding STR scores. We apply several independent machine-learning methods in tandem to learn formal classification functions. The result is an integrated high-throughput analysis system that automatically classifies large numbers of samples into haplogroups in a cost-effective and accurate manner.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Frequency of 30 haplogroups determined by SNP-typing a geographically diverse sample of 8,414 chromosomes.
This set of chromosomes, typed at 15 Y-linked STRs, was used as a ground-truth training set (see text for explanation). Haplogroups are named according to the mutation-based nomenclature , which retains the major haplogroup information (i.e., 18 capital letters) followed by the name of the terminal mutation that the sample is positive for (see Figure S1).
Figure 2
Figure 2. Average accuracy of each classifier per haplogroup for cross-validation on the ground-truth training set.
Standard error bars are shown for each point. The lower panel shows the frequency of the 8,414 samples among haplogroups. Support vector machines has the best overall performance, especially in the case of haplogroups with a smaller number of samples in the training data.
Figure 3
Figure 3. Average accuracy of the tandem approach for cross-validation on the ground-truth training set.
The average proportion of samples with agreement for all four classification methods is also shown. The haplogroups with the highest rate of tandem disagreement have a low representation in the training data.
Figure 4
Figure 4. Frequency of 30 Y chromosome haplogroups inferred from a previously published sample of 1,527 Asian Y chromosomes.
The samples were typed with 9 Y-STRs and a battery of Y-linked SNPs. Haplogroup frequencies are statistically significantly different from those in our ground-truth training set (Figure 1). Haplogroups are named according to the mutation-based nomenclature , which retains the major haplogroup information (i.e., 18 capital letters) followed by the name of the terminal mutation that the sample is positive for (see Figure S1).
Figure 5
Figure 5. Average accuracy of each classifier per haplogroup for cross-validation on the 9-locus public STR data.
Standard error bars are shown for each point. The lower panel shows the frequency of haplogroups in the 1,527 sample public data set.
Figure 6
Figure 6. Average accuracy and agreement of the tandem approach for cross-validation on the 9-locus public STR data.
Figure 7
Figure 7. Test creation process for decision trees.
Samples from four haplogroups in data set A are passed through locus-specific allele test conditions at each branch of the decision tree. The test for locus Xi is chosen so that i = arg maxl{IG(A, Xl)} and Xj so that j = arg maxl{ IG(B 1, Xl)}.
Figure 8
Figure 8. Bayesian likelihood construction and evaluation.
For each haplogroup, the density functions f 1,…fL are constructed as normalized histograms from the training data formula image. Given a sample x = (x 1,…xL), its likelihood under a haplogroup is the product of its evaluated locus bin frequencies.
Figure 9
Figure 9. Maximum margin hyperplane used in support vector machines.
Example showing the hyperplane with maximal margin of separation between samples from two different haplogroups. The shaded points lying on the margin define the support vectors.
Figure 10
Figure 10. Y chromosome haplogroup hierarchy.
Only the top-level haplogroups are shown.

References

    1. Jobling MA, Pandya A, Tyler-Smith C. The Y chromosome in forensic analysis and paternity testing. Int J Legal Med. 1997;110:118–124. - PubMed
    1. Hammer MF, Chamberlain VF, Kearney VF, Stover D, Zhang G, et al. Population structure of Y chromosome SNP haplogroups in the United States and forensic implications for constructing Y chromosome STR databases. Forensic Sci Int. 2006;164:45–55. - PubMed
    1. Jobling MA, Tyler-Smith C. New uses for new haplotypes - the human Y chromosome, disease and selection. Trends Genet. 2000;16:356–362. - PubMed
    1. Jobling MA. In the name of the father: surnames and genetics. Trends Genet. 2001;17:353–357. - PubMed
    1. Stone AC, Milner GR, Paabo S, Stoneking M. Sex determination of ancient human skeletons using DNA. Am J Phys Anthropol. 1996;99:231–238. - PubMed

Publication types

MeSH terms