Noisecut: a python package for noise-tolerant classification of binary data using prior knowledge integration and max-cut solutions

Moein E Samadi¹, Hedieh Mirzaieazar¹, Alexander Mitsos², Andreas Schuppert³

Affiliations

¹ Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany.
² Process Systems Engineering (AVT.SVT), RWTH Aachen University, Aachen, Germany.
³ Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany. aschuppert@ukaachen.de.

PMID: 38641616
PMCID: PMC11031902
DOI: 10.1186/s12859-024-05769-8

Noisecut: a python package for noise-tolerant classification of binary data using prior knowledge integration and max-cut solutions

Moein E Samadi et al. BMC Bioinformatics. 2024.

. 2024 Apr 20;25(1):155.

doi: 10.1186/s12859-024-05769-8.

Authors

Moein E Samadi¹, Hedieh Mirzaieazar¹, Alexander Mitsos², Andreas Schuppert³

Affiliations

¹ Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany.
² Process Systems Engineering (AVT.SVT), RWTH Aachen University, Aachen, Germany.
³ Institute for Computational Biomedicine, RWTH Aachen University, Aachen, Germany. aschuppert@ukaachen.de.

PMID: 38641616
PMCID: PMC11031902
DOI: 10.1186/s12859-024-05769-8

Abstract

Background: Classification of binary data arises naturally in many clinical applications, such as patient risk stratification through ICD codes. One of the key practical challenges in data classification using machine learning is to avoid overfitting. Overfitting in supervised learning primarily occurs when a model learns random variations from noisy labels in training data rather than the underlying patterns. While traditional methods such as regularization and early stopping have demonstrated effectiveness in interpolation tasks, addressing overfitting in the classification of binary data, in which predictions always amount to extrapolation, demands extrapolation-enhanced strategies. One such approach is hybrid mechanistic/data-driven modeling, which integrates prior knowledge on input features into the learning process, enhancing the model's ability to extrapolate.

Results: We present NoiseCut, a Python package for noise-tolerant classification of binary data by employing a hybrid modeling approach that leverages solutions of defined max-cut problems. In a comparative analysis conducted on synthetically generated binary datasets, NoiseCut exhibits better overfitting prevention compared to the early stopping technique employed by different supervised machine learning algorithms. The noise tolerance of NoiseCut stems from a dropout strategy that leverages prior knowledge of input features and is further enhanced by the integration of max-cut problems into the learning process.

Conclusions: NoiseCut is a Python package for the implementation of hybrid modeling for the classification of binary data. It facilitates the integration of mechanistic knowledge on the input features into learning from data in a structured manner and proves to be a valuable classification tool when the available training data is noisy and/or limited in size. This advantage is especially prominent in medical and biomedical applications where data scarcity and noise are common challenges. The codebase, illustrations, and documentation for NoiseCut are accessible for download at https://pypi.org/project/noisecut/ . The implementation detailed in this paper corresponds to the version 0.2.1 release of the software.

Keywords: Binary data; Hybrid mechanistic/data-driven modeling; Max-cut problem; Noise-tolerant classification; Overfitting.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no Conflict of interest.

Figures

**Fig. 1**
The visual representation of illustrative introduced in Eqs. (1–4)

**Fig. 2**
A schematic representation of the information flow from binary-represented input data to binary labels. This procedure has been used to generate the synthetic datasets

**Fig. 3**
A tree-structured FN $F : x \in {0, 1}^{N} ⟼ y \in {0, 1}$ , which maps binary-represented data to binary output. The FN has M first-layer boxes, operating on separate subsets of the input variables: $f_{m} = F_{m} (Decimal ({[x_{m}^{i}]}_{i = 1}^{i = n_{m}}))$ . The output box in the second layer processes the outcomes of the first-layer boxes towards the overall output of the FN: $y = F_{O} (Decimal ({[f_{i}]}_{i = 1}^{i = M}))$

**Fig. 4**
Classifier accuracy on testing datasets comparison of NoiseCut with various ML models for classifying binary data across the entire spectrum of noise intensities, with a consistent 70% training data size. NoiseCut outperforms the others as noise intensifies, demonstrating superior overfitting mitigation across varying levels of noise compared to the early stopping approach used by the other ML models

**Fig. 5**
a. Comparison of ROC curves illustrating the classification performance of NoiseCut alongside other ML models on testing datasets. b. Comparison of computational time between NoiseCut and the other ML models across varying sample sizes. The evaluation is conducted with only 30% of the training data available and 5% noise intensity in the data labeling

See this image and copyright information in PMC

References

1. Zhong H, Loukides G, Gwadera R. Clustering datasets with demographics and diagnosis codes. J Biomed Inform. 2020;102:103360. doi: 10.1016/j.jbi.2019.103360. - DOI - PubMed
1. Vasiljeva I, Arandjelović O. Diagnosis prediction from electronic health records using the binary diagnosis history vector representation. J Comput Biol. 2017;24(8):767–786. doi: 10.1089/cmb.2017.0023. - DOI - PubMed
1. Pocock SJ, Geller NL, Tsiatis AA. The analysis of multiple endpoints in clinical trials. Biometrics. 1987;43:487–498. doi: 10.2307/2531989. - DOI - PubMed
1. Samadi ME, Guzman-Maldonado J, Nikulina K, Mirzaieazar H, Sharafutdinov K, Fritsch SJ, et al. A hybrid modeling framework for generalizable and interpretable predictions of ICU mortality: leveraging ICD codes in a multi-hospital study of mechanically ventilated influenza patients. Preprint available at research square. 2023.
1. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22. doi: 10.1093/biomet/73.1.13. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Noisecut: a python package for noise-tolerant classification of binary data using prior knowledge integration and max-cut solutions

Affiliations

Noisecut: a python package for noise-tolerant classification of binary data using prior knowledge integration and max-cut solutions

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources