Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2014 Apr 24;9(4):e94137.
doi: 10.1371/journal.pone.0094137. eCollection 2014.

A systematic comparison of supervised classifiers

Affiliations
Comparative Study

A systematic comparison of supervised classifiers

Diego Raphael Amancio et al. PLoS One. .

Abstract

Pattern recognition has been employed in a myriad of industrial, commercial and academic applications. Many techniques have been devised to tackle such a diversity of applications. Despite the long tradition of pattern recognition research, there is no technique that yields the best classification in all scenarios. Therefore, as many techniques as possible should be considered in high accuracy applications. Typical related works either focus on the performance of a given algorithm or compare various classification methods. In many occasions, however, researchers who are not experts in the field of machine learning have to deal with practical classification tasks without an in-depth knowledge about the underlying parameters. Actually, the adequate choice of classifiers and parameters in such practical circumstances constitutes a long-standing problem and is one of the subjects of the current paper. We carried out a performance study of nine well-known classifiers implemented in the Weka framework and compared the influence of the parameter configurations on the accuracy. The default configuration of parameters in Weka was found to provide near optimal performance for most cases, not including methods such as the support vector machine (SVM). In addition, the k-nearest neighbor method frequently allowed the best accuracy. In certain conditions, it was possible to improve the quality of SVM by more than 20% with respect to their default parameter configuration.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Example of artificial dataset for 10 classes and 2 features (DB2F).
It is possible to note that different classes have different correlations between the features. The separation between the classes are (a) formula image, (b) formula image and (c) formula image.
Figure 2
Figure 2. Behavior of the accuracy rate as the number of features increases.
As more attributes are taken into account, the kNN becomes significantly better than the other pattern recognition techniques.
Figure 3
Figure 3. One dimensional analysis performed with the parameter
formula image of the kNN classifier. Panel (a) illustrates the default value of the parameter (formula image) with a red vertical dashed line. The accuracy rate associated with default values of parameters is denoted by formula image and the best accuracy rate observed in the neighborhood of the default value of formula image is represented as formula image. The difference between these two quantities is represented by formula image. Panel (b) shows how the accuracy rates vary with the variation of formula image in DB2F (each line represent the behavior of a particular dataset in DB2F). Finally, panel (c) displays the distribution of formula image in DB2F.
Figure 4
Figure 4. Example of the random parameters analysis.
We use one of the artificial datasets and the kNN classifier. (a) By randomly drawing 1,000 different parameter combinations of kNN we construct a histogram of accuracy rates. The red dashed line indicates the performance achieved with default parameters. (b) The accuracy rate for the default parameters are subtracted from the values obtained for the random drawing. The normalized area of the histogram for values that are above zero indicates how easy is to improve the performance with a random tuning of parameters.
Figure 5
Figure 5. Distribution of the difference of accuracy rates observed between the random and default configuration of parameters.
(a) kNN; (b) C4.5; (c) Multilayer Perceptron; (d) Logistic; (e) Random Forest; (f) Simple CART; (g) SVM. Note that, in the case of kNN and SVM classifiers, most of the random configurations yield better results than the default case.

Similar articles

Cited by

References

    1. Mayer-Schonberger V, Cukier K (2013) Big Data: a revolution that will transform how we live, work, and think. Eamon Dolan/Houghton Mifflin Harcourt.
    1. Sathi A (2013) Big Data analytics: disruptive technologies for changing the game. Mc Press.
    1. Pers TH, Albrechtsen A, Holst C, Sorensen TIA, Gerds TA (2009) The validation and assessment of machine learning: a game of prediction from high-dimensional data. PLoS ONE 4 (8) e6287. - PMC - PubMed
    1. Marquand AF, Filippone M, Ashburner J, Girolami M, Mourao-Miranda J, et al. (2013) Automated, high accuracy classification of parkinsonian disorders: a pattern recognition approach. PLoS ONE 8 (7) e69237. - PMC - PubMed
    1. Montavon G, Rupp M, Gobre V, Vazquez-Mayagoitia A, Hansen K, Tkatchenko A, Mller K-R, Lilienfeld OA (2013) Machine learning of molecular electronic properties in chemical compound space. New Journal of Physics 15: 095003.

Publication types