. 2011 May 12;6(5):e14802.

doi: 10.1371/journal.pone.0014802.

Genetic classification of populations using supervised learning

Michael Bridges¹, Elizabeth A Heron, Colm O'Dushlaine, Ricardo Segurado; International Schizophrenia Consortium (ISC); Derek Morris, Aiden Corvin, Michael Gill, Carlos Pinto

Collaborators, Affiliations

Collaborators

International Schizophrenia Consortium (ISC):
Derek W Morris, Colm O'Dushlaine, Elaine Kenny, Emma M Quinn, Michael Gill, Aiden Corvin, Michael C O'Donovan, George K Kirov, Nick J Craddock, Peter A Holmans, Nigel M Williams, Lucy Georgieva, Ivan Nikolov, N Norton, H Williams, Draga Toncheva, Vihra Milanova, Michael J Owen, Christina M Hultman, Paul Lichtenstein, Emma F Thelander, Patrick Sullivan, Andrew McQuillin, Khalid Choudhury, Susmita Datta, Jonathan Pimm, Srinivasa Thirumalai, Vinay Puri, Robert Krasucki, Jacob Lawrence, Digby Quested, Nicholas Bass, Hugh Gurling, Caroline Crombie, Gillian Fraser, Soh Leh Kuan, Nicholas Walker, David St Clair, Douglas H R Blackwood, Walter J Muir, Kevin A McGhee, Ben Pickard, Pat Malloy, Alan W Maclean, Margaret Van Beck, Naomi R Wray, Peter M Visscher, Stuart Macgregor, Michele T Pato, Helena Medeiros, Frank Middleton, Celia Carvalho, Christopher Morley, Ayman Fanous, David Conti, James A Knowles, Carlos Paz Ferreira, Antonio Macedo, M Helena Azevedo, Carlos N Pato, Jennifer L Stone, Douglas M Ruderfer, Manuel A R Ferreira, Shaun M Purcell, Jennifer L Stone, Kimberly Chambert, Douglas M Ruderfer, Finny Kuruvilla, Stacey B Gabriel, Kristin Ardlie, Mark J Daly, Edward M Scolnick, Pamela Sklar

Affiliation

¹ Astrophysics Group, Cavendish Laboratory, Cambridge, United Kingdom.

PMID: 21589856
PMCID: PMC3093382
DOI: 10.1371/journal.pone.0014802

Genetic classification of populations using supervised learning

Michael Bridges et al. PLoS One. 2011.

. 2011 May 12;6(5):e14802.

doi: 10.1371/journal.pone.0014802.

Authors

Michael Bridges¹, Elizabeth A Heron, Colm O'Dushlaine, Ricardo Segurado; International Schizophrenia Consortium (ISC); Derek Morris, Aiden Corvin, Michael Gill, Carlos Pinto

Collaborators

International Schizophrenia Consortium (ISC):
Derek W Morris, Colm O'Dushlaine, Elaine Kenny, Emma M Quinn, Michael Gill, Aiden Corvin, Michael C O'Donovan, George K Kirov, Nick J Craddock, Peter A Holmans, Nigel M Williams, Lucy Georgieva, Ivan Nikolov, N Norton, H Williams, Draga Toncheva, Vihra Milanova, Michael J Owen, Christina M Hultman, Paul Lichtenstein, Emma F Thelander, Patrick Sullivan, Andrew McQuillin, Khalid Choudhury, Susmita Datta, Jonathan Pimm, Srinivasa Thirumalai, Vinay Puri, Robert Krasucki, Jacob Lawrence, Digby Quested, Nicholas Bass, Hugh Gurling, Caroline Crombie, Gillian Fraser, Soh Leh Kuan, Nicholas Walker, David St Clair, Douglas H R Blackwood, Walter J Muir, Kevin A McGhee, Ben Pickard, Pat Malloy, Alan W Maclean, Margaret Van Beck, Naomi R Wray, Peter M Visscher, Stuart Macgregor, Michele T Pato, Helena Medeiros, Frank Middleton, Celia Carvalho, Christopher Morley, Ayman Fanous, David Conti, James A Knowles, Carlos Paz Ferreira, Antonio Macedo, M Helena Azevedo, Carlos N Pato, Jennifer L Stone, Douglas M Ruderfer, Manuel A R Ferreira, Shaun M Purcell, Jennifer L Stone, Kimberly Chambert, Douglas M Ruderfer, Finny Kuruvilla, Stacey B Gabriel, Kristin Ardlie, Mark J Daly, Edward M Scolnick, Pamela Sklar

Affiliation

¹ Astrophysics Group, Cavendish Laboratory, Cambridge, United Kingdom.

PMID: 21589856
PMCID: PMC3093382
DOI: 10.1371/journal.pone.0014802

Abstract

There are many instances in genetics in which we wish to determine whether two candidate populations are distinguishable on the basis of their genetic structure. Examples include populations which are geographically separated, case-control studies and quality control (when participants in a study have been genotyped at different laboratories). This latter application is of particular importance in the era of large scale genome wide association studies, when collections of individuals genotyped at different locations are being merged to provide increased power. The traditional method for detecting structure within a population is some form of exploratory technique such as principal components analysis. Such methods, which do not utilise our prior knowledge of the membership of the candidate populations. are termed unsupervised. Supervised methods, on the other hand are able to utilise this prior knowledge when it is available.In this paper we demonstrate that in such cases modern supervised approaches are a more appropriate tool for detecting genetic differences between populations. We apply two such methods, (neural networks and support vector machines) to the classification of three populations (two from Scotland and one from Bulgaria). The sensitivity exhibited by both these methods is considerably higher than that attained by principal components analysis and in fact comfortably exceeds a recently conjectured theoretical limit on the sensitivity of unsupervised methods. In particular, our methods can distinguish between the two Scottish populations, where principal components analysis cannot. We suggest, on the basis of our results that a supervised learning approach should be the method of choice when classifying individuals into pre-defined populations, particularly in quality control for large scale genome wide association studies.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. An example of a 3-layer neural network with 7 input nodes, 3 nodes in the hidden layer and 5 output nodes.**
Each line represents one weight.

**Figure 2. An example of a two-dimensional feature space for data of known class divided by three hyperplanes p1, p2 and p3.**
Clearly p1 divides most efficiently.

**Figure 3. Estimated values for a 50 SNP sliding window for P1∶P1 (top), P1∶P2 (middle), P1∶P3 (bottom).**
The is essentially zero everywhere except for a small region approximately halfway along the chromosome. The horizontal dotted line is the value of .

formula image — **Figure 3. Estimated values for a 50 SNP sliding window for P1∶P1 (top), P1∶P2 (middle), P1∶P3 (bottom).**
The is essentially zero everywhere except for a small region approximately halfway along the chromosome. The horizontal dotted line is the value of .

**Figure 4. Estimated values for a 100 SNP sliding window for P1∶P1 (top), P1∶P2 (middle), P1∶P3 (bottom).**
The horizontal dotted line is the value of . Note that although is always non-negative, the estimator may become negative for small values of .

**Figure 5. Estimated values for a 500 SNP sliding window for P1∶P1 (top), P1∶P2 (middle) and P1∶P3 (bottom).**
The horizontal dotted line is the value of .

Figure 6. Classification with windows of 50 contiguous, non-overlapping SNPs for P1 against P2 (solid lines) with classification results for a sample of P1 against P1 (dotted lines) shown for comparison.
The regions enclosed between the lines illustrate 1 confidence intervals.

Figure 7. Top panel shows classification with windows of 50 contiguous, non-overlapping SNPs for P1 against P3 (solid lines) with classification results for a sample of P3 against P3 (dotted lines) shown for comparison.
The regions enclosed between the lines illustrate 1 confidence intervals.

Figure 8. Top panel shows classification with windows of 50 contiguous, non-overlapping SNPs for P2 against P3 (solid lines) with classification results for a sample of P2 against P2 (dotted lines) shown for comparison.
The regions enclosed between the lines illustrate 1 confidence intervals.

**Figure 9. Classification with windows of 100 (dot-dashed), 50 (dashed) and 20 (solid) contiguous, non-overlapping SNPs for P1 against P2.**
Note that as the window size increases, the accuracy converges to the *most* accurate classification, indicating that the ANN is successfully discarding irrelevant information. For clarity we have added an offset to each spectrum and omitted the ordinate axis, the horizontal lines represent classification in each case.

Figure 10. Receiver Operating Characteristic (ROC) curve, that is a plot of true positive rate (TPR) against false positive rate (FPR) of the neural network classifier trained using the first 50 SNPs using P1∶P2 (solid curve).
A random classifier (dotted curve) is shown for comparison.

**Figure 11. Receiver Operating Characteristic (ROC) curve of the neural network classifier trained using 50 SNPs form 1950 to 2000 also for P1∶P2 (solid curve).**
A random classifier (dotted curve) is shown for comparison.

Figure 12. SVM classification with windows of 50 contiguous, non-overlapping SNPs for P1 against P2 (solid lines) with classification results for a sample of P1 against P1 (dotted lines) shown for comparison.

Figure 13. ANN classification with windows of 50 contiguous, non-overlapping SNPs for P1 against P2 (solid lines) with classification results for a sample of P1 against P1 (dotted lines) shown for comparison.

See this image and copyright information in PMC

References

1. Lao O, Lu T, NothNagel M, Junge O, Freitag-Wolf S, et al. Correlation Between Genetic and Geographic Structure in Europe. Curr Biol . 2008;18:1241–1248. - PubMed
1. Reich D, Thangaraj K, Patterson N, Price A, Singh L. Reconstructing Indian Population History. Nature. 2009;461:489–494. - PMC - PubMed
1. International Schizophrenia Consortium website. Available: http://pngu.mgh.harvard.edu/isc. Accessed 2011.
1. Patterson N, Price A, Reich D. Population Structure and Eigenanalysis. PLoS Genetics. 2006;2:2074–2093. - PMC - PubMed
1. Reich D, Kumarasamy T, Patterson N, Price AL, Singh L. Reconstructing Indian Population History. Nature. 2009;461:489–494. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Genetic classification of populations using supervised learning

Collaborators

Affiliation

Genetic classification of populations using supervised learning

Authors

Collaborators

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources