. 2025 Sep 2;26(1):228.

doi: 10.1186/s12859-025-06253-7.

Towards the genome-scale discovery of bivariate monotonic classifiers

Océane Fourquet^{1

2}, Martin S Krejca³, Carola Doerr², Benno Schwikowski⁴

Affiliations

¹ Computational Systems Biomedicine Lab, Institut Pasteur, Université Paris Cité, 25-28 Rue du Dr Roux, 75015, Paris, France.
² LIP6, CNRS, Sorbonne Université, 4 Place Jussieu, 75005, Paris, France.
³ LIX, CNRS, École Polytechnique, Institut Polytechnique de Paris, Honoré d'Estienne d'Orves, 91120, Palaiseau, France.
⁴ Computational Systems Biomedicine Lab, Institut Pasteur, Université Paris Cité, 25-28 Rue du Dr Roux, 75015, Paris, France. benno@pasteur.fr.

PMID: 40898061
PMCID: PMC12403431
DOI: 10.1186/s12859-025-06253-7

Towards the genome-scale discovery of bivariate monotonic classifiers

Océane Fourquet et al. BMC Bioinformatics. 2025.

. 2025 Sep 2;26(1):228.

doi: 10.1186/s12859-025-06253-7.

Authors

Océane Fourquet^{1

2}, Martin S Krejca³, Carola Doerr², Benno Schwikowski⁴

Affiliations

¹ Computational Systems Biomedicine Lab, Institut Pasteur, Université Paris Cité, 25-28 Rue du Dr Roux, 75015, Paris, France.
² LIP6, CNRS, Sorbonne Université, 4 Place Jussieu, 75005, Paris, France.
³ LIX, CNRS, École Polytechnique, Institut Polytechnique de Paris, Honoré d'Estienne d'Orves, 91120, Palaiseau, France.
⁴ Computational Systems Biomedicine Lab, Institut Pasteur, Université Paris Cité, 25-28 Rue du Dr Roux, 75015, Paris, France. benno@pasteur.fr.

PMID: 40898061
PMCID: PMC12403431
DOI: 10.1186/s12859-025-06253-7

Abstract

Background: Bivariate monotonic classifiers (BMCs) are based on pairs of input features. Like many other models used for machine learning, they can capture nonlinear patterns in high-dimensional data. At the same time, they are simple and easy to interpret. Until now, the use of BMCs on a genome scale was hampered by the high computational complexity of the search for pairs of features with a high leave-one-out performance estimate.

Results: We introduce the fastBMC algorithm, which drastically speeds up the identification of BMCs. The algorithm is based on a mathematical bound for the BMC performance estimate while maintaining optimality. We show empirically that fastBMC speeds up the computation by a factor of at least 15 already for a small number of features, compared to the traditional approach. For two of the three smaller biomedical datasets that we consider here, the resulting possibility of considering much larger sets of features translates into significantly improved classification performance. As an example of the high degree of interpretability of BMCs, we discuss a straightforward interpretation of a BMC glioblastoma survival predictor, an immediate novel biomedical hypothesis, options for biomedical validation, and treatment implications. In addition, we study the performance of fastBMC on a larger and well-known breast cancer dataset, validating the benefits of the BMCs for biomarker identification and biomedical hypothesis generation.

Conclusion: fastBMC enables the rapid construction of robust and interpretable ensemble models using BMC, facilitating the discovery of gene pairs predictive of relevant phenotypes and their interaction in that context.

Availability: We provide the first open-source implementation for learning BMCs, a Python implementation of fastBMC in particular, and Python code to reproduce the fastBMC results on real and simulated data in this paper, at https://github.com/oceanefrqt/fastBMC .

Keywords: Algorithms; Bivariate functions; Classification; Interpretability; Monotonic functions; Systems biology.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
Visualization of a bivariate monotonic classifier. A bivariate monotonic classifier (BMC) is a function that is monotonic in both dimensions. The function f thus divides into two classes. This specific BMC, based on OX40 and CD40 ligand transcripts, is an example from [10]

formula image — **Fig. 1**
Visualization of a bivariate monotonic classifier. A bivariate monotonic classifier (BMC) is a function that is monotonic in both dimensions. The function f thus divides into two classes. This specific BMC, based on OX40 and CD40 ligand transcripts, is an example from [10]

**Fig. 2**
Operation of the Preselection Algorithm for BMCs: Each circle represents a BMC, with the blue-green gradient indicating and the yellow-red gradient representing . The interplay of colors reveals the computational dynamics within the algorithm. Specifically, the blue-green gradient indicates that all BMCs are initially ranked according to . Subsequently, as illustrated by the yellow-red gradient, the values are iteratively computed to define and update the threshold

**Fig. 3**
Simulated monotonic pairs under different noise levels, with 50 and 160 data points, respectively. The columns correspond to varying noise levels (0.05, 0.1, 0.2, and 0.5)

**Fig. 4**
Running times of naïveBMC and fastBMC for simulated datasets of varying number of data points

**Fig. 5**
Performance metrics (accuracy, F1 score, and AUC) of several algorithms on simulated datasets containing 200 features, evaluated across varying numbers of data points and levels of noise. Each subplot represents a different number of data points, showing how the performance metrics change with increasing noise levels

**Fig. 6**
Running times in hours of naïveBMC and fastBMC on subdatasets of varying size, across three biomedical datasets

**Fig. 7**
AUC performance on three biomedical datasets, of ensemble BMCs constructed using fastBMC or naïveBMC, and for varying numbers of features

**Fig. 8**
ensemble BMC obtained on the glioblastoma dataset using fastBMC. The dots correspond to the (original) training data and the background colors correspond to the model. The red color is associated with short-term survival and the blue color with long-term survival

**Fig. 9**
The SDC4/NDUFA4L2 BMC reveals the association of simultaneous low expression of both genes in glioblastomas with long survival

**Fig. 10**
Application of the BMC to the independent TCGA glioblastoma cohort (-transformed data shown). a The BMC from the Reifenberger et al. [19] cohort, applied to the TCGA cohort. b Kaplan–Meier survival curves, stratified by BMC-predicted survival groups. The standard log-rank test indicates significantly longer survival in the predicted long-term survival group

**Fig. 11**
The ensembleBMC for the METABRIC data, composed of eight BMCs. Red dots and colored background correspond to short relapse-free status (RFS) while blue ones correspond to long RFS

**Fig. 12**
Comparison of AUC-ROC curves for different classification models applied to the METABRIC dataset on extreme RFS samples. The models include fastBMC, , , Random Forest, Decision Trees, and Logistic Regression

**Fig. 13**
Kaplan–Meier curves comparing recurrence-free survival (RFS) for METABRIC intermediate RFS samples, separated based on their predicted RFS using the ensembleBMC trained on the extreme RFS samples. The curves distinguish between samples predicted to have short RFS and those predicted to have long RFS. The log-rank test yielded a p-value of 0.00026, indicating a statistically significant difference between the two groups

**Fig. 14**
The reduction in the number of the evaluations using the Preselection Algorithm (Algorithm 1) on the dengue dataset (section “Description of the datasets”). Each point represents a set of potentially many BMCs with the and corresponding to the coordinates. The graphs along both axes represent the densities (probability density functions; PDFs) of the BMCs. The PDF at the bottom ranges over the whole dataset, the PDF at the left only over the BMCs that are evaluated (approximately 5 % of the overall data). The vertical dashed line represents the cutoff point at which no further evaluations were necessary. The horizontal dashed line shows the cutoff point beyond which no additional BMCs were selected among those whose was computed. Note that the update of the threshold t is not shown

See this image and copyright information in PMC

References

1. Thind AS, Monga I, Thakur PK, Kumari P, Dindhoria K, Krzak M, et al. Demystifying emerging bulk RNA-Seq applications: the application and utility of bioinformatic methodology. Brief Bioinform. 2021;22(6):bbab259. 10.1093/bib/bbab259. - DOI - PubMed
1. Zararsız G, Goksuluk D, Korkmaz S, Eldem V, Zararsiz GE, Duru IP, et al. A comprehensive simulation study on classification of RNA-Seq data. PLoS ONE. 2017;12(8): e0182507. 10.1371/journal.pone.0182507. - DOI - PMC - PubMed
1. Rudin C. Stop explaining black box machine learning models for high? Stakes decisions and use interpretable models instead. Nat Mach Intell. 2019;1:206–15. 10.1038/s42256-019-0048-x. - DOI - PMC - PubMed
1. Miller GA. The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychol Rev. 1956;63(2):81–97. 10.1037/h0043158. - DOI - PubMed
1. Cowan N. The magical number 4 in short-term memory: a reconsideration of mental storage capacity. Behav Brain Sci. 2001;24(1):87–114. 10.1017/s0140525x01003922. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- BioMed Central
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Towards the genome-scale discovery of bivariate monotonic classifiers

Affiliations

Towards the genome-scale discovery of bivariate monotonic classifiers

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials