. 2010 Oct 26;11 Suppl 8(Suppl 8):S3.

doi: 10.1186/1471-2105-11-S8-S3.

A machine learning pipeline for quantitative phenotype prediction from genotype data

Giorgio Guzzetta¹, Giuseppe Jurman, Cesare Furlanello

Affiliations

PMID: 21034428
PMCID: PMC2966290
DOI: 10.1186/1471-2105-11-S8-S3

A machine learning pipeline for quantitative phenotype prediction from genotype data

Giorgio Guzzetta et al. BMC Bioinformatics. 2010.

. 2010 Oct 26;11 Suppl 8(Suppl 8):S3.

doi: 10.1186/1471-2105-11-S8-S3.

Authors

Giorgio Guzzetta¹, Giuseppe Jurman, Cesare Furlanello

Affiliation

¹ Fondazione Bruno Kessler, Trento, Italy.

PMID: 21034428
PMCID: PMC2966290
DOI: 10.1186/1471-2105-11-S8-S3

Abstract

Background: Quantitative phenotypes emerge everywhere in systems biology and biomedicine due to a direct interest for quantitative traits, or to high individual variability that makes hard or impossible to classify samples into distinct categories, often the case with complex common diseases. Machine learning approaches to genotype-phenotype mapping may significantly improve Genome-Wide Association Studies (GWAS) results by explicitly focusing on predictivity and optimal feature selection in a multivariate setting. It is however essential that stringent and well documented Data Analysis Protocols (DAP) are used to control sources of variability and ensure reproducibility of results. We present a genome-to-phenotype pipeline of machine learning modules for quantitative phenotype prediction. The pipeline can be applied for the direct use of whole-genome information in functional studies. As a realistic example, the problem of fitting complex phenotypic traits in heterogeneous stock mice from single nucleotide polymorphims (SNPs) is here considered.

Methods: The core element in the pipeline is the L1L2 regularization method based on the naïve elastic net. The method gives at the same time a regression model and a dimensionality reduction procedure suitable for correlated features. Model and SNP markers are selected through a DAP originally developed in the MAQC-II collaborative initiative of the U.S. FDA for the identification of clinical biomarkers from microarray data. The L1L2 approach is compared with standard Support Vector Regression (SVR) and with Recursive Jump Monte Carlo Markov Chain (MCMC). Algebraic indicators of stability of partial lists are used for model selection; the final panel of markers is obtained by a procedure at the chromosome scale, termed 'saturation', to recover SNPs in Linkage Disequilibrium with those selected.

Results: With respect to both MCMC and SVR, comparable accuracies are obtained by the L1L2 pipeline. Good agreement is also found between SNPs selected by the L1L2 algorithms and candidate loci previously identified by a standard GWAS. The combination of L1L2-based feature selection with a saturation procedure tackles the issue of neglecting highly correlated features that affects many feature selection algorithms.

Conclusions: The L1L2 pipeline has proven effective in terms of marker selection and prediction accuracy. This study indicates that machine learning techniques may support quantitative phenotype prediction, provided that adequate DAPs are employed to control bias in model selection.

PubMed Disclaimer

Figures

**Figure 1**
**Distance and regression weights for top-correlated SNPs** For each top-ranked SNP, a set of corresponding top-correlated SNPs at a given correlation threshold is identified. All the chromosome distances from reference top-ranked SNP, and all regression weights are pooled together across all top-ranked SNPs. Numbers inside boxplots indicate the number of top-correlated SNPs; the number of top-ranked SNPs is 51 for the CD8 phenotype shown here. (a) Distributions of chromosome distances between top-ranked and top-correlated SNPs (bp, natural log scale). (b) Distribution of L1L2 regression weights for top-correlated SNPs. Average weight of top-ranked SNPs is 0.07.

**Figure 2**
**Top-ranked and top-correlated SNPs for the CD8+ phenotype** SNPs selected for CD8+ phenotype by SVR (*red*), L1L2 (*green*) and GWAS [16] (horizontal segments, levels of gray indicates probability of association; *black:* probability 1, *white:* probability 0). For SVR and L1L2, top-ranked SNPs and the corresponding top-correlated SNPs are shown.

**Figure 3**
**Accuracy-stability plot for model selection** Accuracy-stability plot for the CD8+ phenotype for 15 development / validation splits. The measure for accuracy is the mean squared error between predicted and actual value of the phenotype, averaged over the 10 Cross-Validations; the measure for stability is the Canberra complete distance for partial lists [15].

**Figure 4**
Data Analysis Protocols for the machine learning methods

See this image and copyright information in PMC

References

1. Lee SH, van der Werf JHJ, Hayes BJ, Goddard ME, Visscher PM. Predicting unobserved phenotypes for complex traits form whole-genome SNP data. PLoS Genetics. 2008;4(10):e1000231. doi: 10.1371/journal.pgen.1000231. - DOI - PMC - PubMed
1. Casci T. Fitting phenotypes. Nature Reviews Genetics. 2008;9:896–897. doi: 10.1038/nrg2495. - DOI
1. Cupples LA, Beyene J, Bickeboller H, Daw EW, Fallin MD, Gauderman WJ, Ghosh S, Goode E, Hauser E, Hinrichs A, Kent J, Martin L, Martinez M, Neuman R, Province M, Szymczak S, Wilcox M, Ziegler A, MacCluer J, Almasy L. Genetic Analysis Workshop 16: Strategies for genome-wide association study analyses. BMC Proceedings. 2009;3(Suppl 7):S1. doi: 10.1186/1753-6561-3-s7-s1. - DOI - PMC - PubMed
1. Moore JH, Asselbergs FW, Williams SM. Bioinformatics challenges for genome-wide association studies. Bioinformatics. 2010;26(4):445–455. doi: 10.1093/bioinformatics/btp713. - DOI - PMC - PubMed
1. Wooten E, Iyer L, Montefusco M, Hedgepeth A, Payne D, Kapur N, Housman D, Mendelsohn M, Huggins G. Application of Gene Network Analysis Techniques Identifies AXIN1/PDIA2 and Endoglin Haplotypes Associated with Bicuspid Aortic Valve. PLoS ONE. 2010;5:e8830. doi: 10.1371/journal.pone.0008830. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A machine learning pipeline for quantitative phenotype prediction from genotype data

Affiliation

A machine learning pipeline for quantitative phenotype prediction from genotype data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials