The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation

doi:10.1186/s13040-021-00244-z

. 2021 Feb 4;14(1):13.

doi: 10.1186/s13040-021-00244-z.

The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation

Davide Chicco^#¹, Niklas Tötsch^#², Giuseppe Jurman³

Affiliations

¹ Krembil Research Institute, Toronto, Ontario, Canada. davidechicco@davidechicco.it.
² Universität Duisburg-Essen, Essen, Germany.
³ Fondazione Bruno Kessler, Trento, Italy.

^# Contributed equally.

PMID: 33541410
PMCID: PMC7863449
DOI: 10.1186/s13040-021-00244-z

The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation

Davide Chicco et al. BioData Min. 2021.

. 2021 Feb 4;14(1):13.

doi: 10.1186/s13040-021-00244-z.

Authors

Davide Chicco^#¹, Niklas Tötsch^#², Giuseppe Jurman³

Affiliations

¹ Krembil Research Institute, Toronto, Ontario, Canada. davidechicco@davidechicco.it.
² Universität Duisburg-Essen, Essen, Germany.
³ Fondazione Bruno Kessler, Trento, Italy.

^# Contributed equally.

PMID: 33541410
PMCID: PMC7863449
DOI: 10.1186/s13040-021-00244-z

Abstract

Evaluating binary classifications is a pivotal task in statistics and machine learning, because it can influence decisions in multiple areas, including for example prognosis or therapies of patients in critical conditions. The scientific community has not agreed on a general-purpose statistical indicator for evaluating two-class confusion matrices (having true positives, true negatives, false positives, and false negatives) yet, even if advantages of the Matthews correlation coefficient (MCC) over accuracy and F₁ score have already been shown.In this manuscript, we reaffirm that MCC is a robust metric that summarizes the classifier performance in a single value, if positive and negative cases are of equal importance. We compare MCC to other metrics which value positive and negative cases equally: balanced accuracy (BA), bookmaker informedness (BM), and markedness (MK). We explain the mathematical relationships between MCC and these indicators, then show some use cases and a bioinformatics scenario where these metrics disagree and where MCC generates a more informative response.Additionally, we describe three exceptions where BM can be more appropriate: analyzing classifications where dataset prevalence is unrepresentative, comparing classifiers on different datasets, and assessing the random guessing level of a classifier. Except in these cases, we believe that MCC is the most informative among the single metrics discussed, and suggest it as standard measure for scientists of all fields. A Matthews correlation coefficient close to +1, in fact, means having high values for all the other confusion matrix metrics. The same cannot be said for balanced accuracy, markedness, bookmaker informedness, accuracy and F₁ score.

Keywords: Balanced accuracy; Binary classification; Bookmaker informedness; Confusion matrix; Machine learning; Markedness; Matthews correlation coefficient.

PubMed Disclaimer

Conflict of interest statement

The authors declare they have no competing interests.

Figures

**Fig. 1**
Relationships between MCC and BM, BA, and MK. Plots indicating the values of MCC in relationship with BM (left), BA (centre), and MK (right) calculated for approximately 8 million confusion matrices with 40 thousand samples each

**Fig. 2**
Pearson correlation between MCC, BM and MK as a function of number of samples N

**Fig. 3**
Indicative example with high true positive rate (TPR) and high true negative rate (TNR). We show the trend of the four basic rates if TPR = 0.9 and TNR = 0.8, and illustrate how positive predictive value (PPV) and negative predictive value (NPV) depend on prevalence (ϕ). Bookmaker informedness (BM) equals 0.7 in this example. At least one of PPV and NPV is high, even if ϕ varies. Only if ϕ is close to 0.6, both of them are high

**Fig. 4**
BM, MCC and MK for classifiers with known randomness. We simulated classifiers with known amounts of randomness. To that purpose, we generated random lists of reference classes with a given prevalence. A fraction of those classes were copied (called lookup fraction) and used as predicted labels, the remaining ones were generated randomly, matching a given bias. Matching the reference classes with the prediction labels we determined bookmaker informedness, Matthews correlation coefficient and markedness (left, center and right column respectively). The rows differ by the amount of randomness/lookup fraction

See this image and copyright information in PMC

Cited by

Benchmarking MicrobIEM - a user-friendly tool for decontamination of microbiome sequencing data.
Hülpüsch C, Rauer L, Nussbaumer T, Schwierzeck V, Bhattacharyya M, Erhart V, Traidl-Hoffmann C, Reiger M, Neumann AU. Hülpüsch C, et al. BMC Biol. 2023 Nov 23;21(1):269. doi: 10.1186/s12915-023-01737-5. BMC Biol. 2023. PMID: 37996810 Free PMC article.
Signature literature review reveals AHCY, DPYSL3, and NME1 as the most recurrent prognostic genes for neuroblastoma.
Chicco D, Sanavia T, Jurman G. Chicco D, et al. BioData Min. 2023 Mar 4;16(1):7. doi: 10.1186/s13040-023-00325-1. BioData Min. 2023. PMID: 36870971 Free PMC article.
Predictive Potential of C_max Bioequivalence in Pilot Bioavailability/Bioequivalence Studies, through the Alternative ƒ₂ Similarity Factor Method.
Henriques SC, Paixão P, Almeida L, Silva NE. Henriques SC, et al. Pharmaceutics. 2023 Oct 20;15(10):2498. doi: 10.3390/pharmaceutics15102498. Pharmaceutics. 2023. PMID: 37896259 Free PMC article.
Implementation of IFPTML Computational Models in Drug Discovery Against Flaviviridae Family.
Velásquez-López Y, Ruiz-Escudero A, Arrasate S, González-Díaz H. Velásquez-López Y, et al. J Chem Inf Model. 2024 Mar 25;64(6):1841-1852. doi: 10.1021/acs.jcim.3c01796. Epub 2024 Mar 11. J Chem Inf Model. 2024. PMID: 38466369 Free PMC article.
Predicting Satisfaction With Chat-Counseling at a 24/7 Chat Hotline for the Youth: Natural Language Processing Study.
Hornstein S, Lueken U, Wundrack R, Hilbert K. Hornstein S, et al. JMIR AI. 2025 Feb 18;4:e63701. doi: 10.2196/63701. JMIR AI. 2025. PMID: 39965198 Free PMC article.

See all "Cited by" articles

References

1. Luca O. Model Selection and Error Estimation in a Nutshell. Berlin: Springer; 2020.
1. Naser MZ, Alavi A. Insights into performance fitness and error metrics for machine learning. 2020:1–25. arXiv preprint arXiv:2006.00887.
1. Wei Q, Dunbrack Jr. RL The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE. 2013;8(7):e67863. doi: 10.1371/journal.pone.0067863. - DOI - PMC - PubMed
1. Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl. 2013;3(10):27–38.
1. Ramola R, Jain S, Radivojac P. Proceedings of Pacific Symposium on Biocomputing 2019. Singapore: World Scientific; 2019. Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies. - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

[1] Luca O. Model Selection and Error Estimation in a Nutshell. Berlin: Springer; 2020.

[2] Luca O. Model Selection and Error Estimation in a Nutshell. Berlin: Springer; 2020.

[3] Naser MZ, Alavi A. Insights into performance fitness and error metrics for machine learning. 2020:1–25. arXiv preprint arXiv:2006.00887.

[4] Naser MZ, Alavi A. Insights into performance fitness and error metrics for machine learning. 2020:1–25. arXiv preprint arXiv:2006.00887.

[5] Wei Q, Dunbrack Jr. RL The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE. 2013;8(7):e67863. doi: 10.1371/journal.pone.0067863. - DOI - PMC - PubMed

[6] Wei Q, Dunbrack Jr. RL The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE. 2013;8(7):e67863. doi: 10.1371/journal.pone.0067863. - DOI - PMC - PubMed

[7] Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl. 2013;3(10):27–38.

[8] Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl. 2013;3(10):27–38.

[9] Ramola R, Jain S, Radivojac P. Proceedings of Pacific Symposium on Biocomputing 2019. Singapore: World Scientific; 2019. Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies. - PMC - PubMed

[10] Ramola R, Jain S, Radivojac P. Proceedings of Pacific Symposium on Biocomputing 2019. Singapore: World Scientific; 2019. Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation

Affiliations

The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources