The validation and assessment of machine learning: a game of prediction from high-dimensional data

Tune H Pers¹, Anders Albrechtsen, Claus Holst, Thorkild I A Sørensen, Thomas A Gerds

Affiliations

PMID: 19652722
PMCID: PMC2716515
DOI: 10.1371/journal.pone.0006287

The validation and assessment of machine learning: a game of prediction from high-dimensional data

Tune H Pers et al. PLoS One. 2009.

. 2009 Aug 4;4(8):e6287.

doi: 10.1371/journal.pone.0006287.

Authors

Tune H Pers¹, Anders Albrechtsen, Claus Holst, Thorkild I A Sørensen, Thomas A Gerds

Affiliation

¹ Tune H Pers Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark, Kongens Lyngby, Denmark.

PMID: 19652722
PMCID: PMC2716515
DOI: 10.1371/journal.pone.0006287

Abstract

In applied statistics, tools from machine learning are popular for analyzing complex and high-dimensional data. However, few theoretical results are available that could guide to the appropriate machine learning tool in a new application. Initial development of an overall strategy thus often implies that multiple methods are tested and compared on the same set of data. This is particularly difficult in situations that are prone to over-fitting where the number of subjects is low compared to the number of potential predictors. The article presents a game which provides some grounds for conducting a fair model comparison. Each player selects a modeling strategy for predicting individual response from potential predictors. A strictly proper scoring rule, bootstrap cross-validation, and a set of rules are used to make the results obtained with different strategies comparable. To illustrate the ideas, the game is applied to data from the Nugenob Study where the aim is to predict the fat oxidation capacity based on conventional factors and high-dimensional metabolomics data. Three players have chosen to use support vector machines, LASSO, and random forests, respectively.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Game setup in R.**
Extracts from the R script used for setting up the VAML Nugenob game.

**Figure 2. Random forest model.**
Extracts from the R script that THP used for building the random forest model. The number of trees (NT) and the number of variables tried at each split (MT) are obtained as described in the text.

**Figure 3. Support vector machine model.**
Extracts from the R script that AA used for building the support vector machine (SVM) model.

**Figure 4. LASSO model.**
Extracts from the R script that TAG used for building the LASSO model. The shrinkage parameter s is obtained as described in the text.

**Figure 5. Model evaluation.**
Extracts from the R script used for evaluating the random forest model in the VAML Nugenob game. The elements of the list RfPredOob are obtained as described in Figure 2. The other two strategies are evaluated similarly.

**Figure 6. Prediction error curves.**
Performance of the three strategies and the null model. The gray lines represent the performances of the respective prediction model estimated in the 100 bootstrap cross-validation steps. The solid lines represent the mean bootstrap cross-validation performance and the dashed lines represent the apparent performance.

See this image and copyright information in PMC

References

1. Mjolsness E, DeCoste D. Machine learning for science: State of the art and future prospects. Science. 2001;93:2051–2055. - PubMed
1. Bishop CM. Springer; 2006. Pattern Recognition and Machine Learning (Information Science and Statistics).
1. Hand D. Measuring diagnostic accuracy of statistical prediction rules. Statistica Neerlandica. 2001;55:3–16.
1. Breiman L. Random forests. Machine Learning. 2001;45:5–32.
1. Claeskens G, Hjort NL. Cambridge University Press; 2008. Model selection and model averaging.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The validation and assessment of machine learning: a game of prediction from high-dimensional data

Affiliation

The validation and assessment of machine learning: a game of prediction from high-dimensional data

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources