Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Sep 2;26(1):228.
doi: 10.1186/s12859-025-06253-7.

Towards the genome-scale discovery of bivariate monotonic classifiers

Affiliations

Towards the genome-scale discovery of bivariate monotonic classifiers

Océane Fourquet et al. BMC Bioinformatics. .

Abstract

Background: Bivariate monotonic classifiers (BMCs) are based on pairs of input features. Like many other models used for machine learning, they can capture nonlinear patterns in high-dimensional data. At the same time, they are simple and easy to interpret. Until now, the use of BMCs on a genome scale was hampered by the high computational complexity of the search for pairs of features with a high leave-one-out performance estimate.

Results: We introduce the fastBMC algorithm, which drastically speeds up the identification of BMCs. The algorithm is based on a mathematical bound for the BMC performance estimate while maintaining optimality. We show empirically that fastBMC speeds up the computation by a factor of at least 15 already for a small number of features, compared to the traditional approach. For two of the three smaller biomedical datasets that we consider here, the resulting possibility of considering much larger sets of features translates into significantly improved classification performance. As an example of the high degree of interpretability of BMCs, we discuss a straightforward interpretation of a BMC glioblastoma survival predictor, an immediate novel biomedical hypothesis, options for biomedical validation, and treatment implications. In addition, we study the performance of fastBMC on a larger and well-known breast cancer dataset, validating the benefits of the BMCs for biomarker identification and biomedical hypothesis generation.

Conclusion: fastBMC enables the rapid construction of robust and interpretable ensemble models using BMC, facilitating the discovery of gene pairs predictive of relevant phenotypes and their interaction in that context.

Availability: We provide the first open-source implementation for learning BMCs, a Python implementation of fastBMC in particular, and Python code to reproduce the fastBMC results on real and simulated data in this paper, at https://github.com/oceanefrqt/fastBMC .

Keywords: Algorithms; Bivariate functions; Classification; Interpretability; Monotonic functions; Systems biology.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Visualization of a bivariate monotonic classifier. A bivariate monotonic classifier (BMC) is a function formula image that is monotonic in both dimensions. The function f thus divides formula image into two classes. This specific BMC, based on OX40 and CD40 ligand transcripts, is an example from [10]
Fig. 2
Fig. 2
Operation of the Preselection Algorithm for BMCs: Each circle represents a BMC, with the blue-green gradient indicating formula image and the yellow-red gradient representing formula image. The interplay of colors reveals the computational dynamics within the algorithm. Specifically, the blue-green gradient indicates that all BMCs are initially ranked according to formula image. Subsequently, as illustrated by the yellow-red gradient, the formula image values are iteratively computed to define and update the threshold
Fig. 3
Fig. 3
Simulated monotonic pairs under different noise levels, with 50 and 160 data points, respectively. The columns correspond to varying noise levels (0.05, 0.1, 0.2, and 0.5)
Fig. 4
Fig. 4
Running times of naïveBMC and fastBMC for simulated datasets of varying number of data points
Fig. 5
Fig. 5
Performance metrics (accuracy, F1 score, and AUC) of several algorithms on simulated datasets containing 200 features, evaluated across varying numbers of data points and levels of noise. Each subplot represents a different number of data points, showing how the performance metrics change with increasing noise levels
Fig. 6
Fig. 6
Running times in hours of naïveBMC and fastBMC on subdatasets of varying size, across three biomedical datasets
Fig. 7
Fig. 7
AUC performance on three biomedical datasets, of ensemble BMCs constructed using fastBMC or naïveBMC, and for varying numbers of features
Fig. 8
Fig. 8
ensemble BMC obtained on the glioblastoma dataset using fastBMC. The dots correspond to the (original) training data and the background colors correspond to the model. The red color is associated with short-term survival and the blue color with long-term survival
Fig. 9
Fig. 9
The SDC4/NDUFA4L2 BMC reveals the association of simultaneous low expression of both genes in glioblastomas with long survival
Fig. 10
Fig. 10
Application of the BMC to the independent TCGA glioblastoma cohort (formula image-transformed data shown). a The BMC from the Reifenberger et al. [19] cohort, applied to the TCGA cohort. b Kaplan–Meier survival curves, stratified by BMC-predicted survival groups. The standard log-rank test indicates significantly longer survival in the predicted long-term survival group
Fig. 11
Fig. 11
The ensembleBMC for the METABRIC data, composed of eight BMCs. Red dots and colored background correspond to short relapse-free status (RFS) while blue ones correspond to long RFS
Fig. 12
Fig. 12
Comparison of AUC-ROC curves for different classification models applied to the METABRIC dataset on extreme RFS samples. The models include fastBMC, formula image, formula image, Random Forest, Decision Trees, and Logistic Regression
Fig. 13
Fig. 13
Kaplan–Meier curves comparing recurrence-free survival (RFS) for METABRIC intermediate RFS samples, separated based on their predicted RFS using the ensembleBMC trained on the extreme RFS samples. The curves distinguish between samples predicted to have short RFS and those predicted to have long RFS. The log-rank test yielded a p-value of 0.00026, indicating a statistically significant difference between the two groups
Fig. 14
Fig. 14
The reduction in the number of the formula image evaluations using the Preselection Algorithm (Algorithm 1) on the dengue dataset (section “Description of the datasets”). Each point represents a set of potentially many BMCs with the formula image and formula image corresponding to the coordinates. The graphs along both axes represent the densities (probability density functions; PDFs) of the BMCs. The PDF at the bottom ranges over the whole dataset, the PDF at the left only over the BMCs that are evaluated (approximately 5 % of the overall data). The vertical dashed line represents the cutoff point at which no further formula image evaluations were necessary. The horizontal dashed line shows the cutoff point beyond which no additional BMCs were selected among those whose formula image was computed. Note that the update of the formula image threshold t is not shown

References

    1. Thind AS, Monga I, Thakur PK, Kumari P, Dindhoria K, Krzak M, et al. Demystifying emerging bulk RNA-Seq applications: the application and utility of bioinformatic methodology. Brief Bioinform. 2021;22(6):bbab259. 10.1093/bib/bbab259. - DOI - PubMed
    1. Zararsız G, Goksuluk D, Korkmaz S, Eldem V, Zararsiz GE, Duru IP, et al. A comprehensive simulation study on classification of RNA-Seq data. PLoS ONE. 2017;12(8): e0182507. 10.1371/journal.pone.0182507. - DOI - PMC - PubMed
    1. Rudin C. Stop explaining black box machine learning models for high? Stakes decisions and use interpretable models instead. Nat Mach Intell. 2019;1:206–15. 10.1038/s42256-019-0048-x. - DOI - PMC - PubMed
    1. Miller GA. The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychol Rev. 1956;63(2):81–97. 10.1037/h0043158. - DOI - PubMed
    1. Cowan N. The magical number 4 in short-term memory: a reconsideration of mental storage capacity. Behav Brain Sci. 2001;24(1):87–114. 10.1017/s0140525x01003922. - DOI - PubMed

LinkOut - more resources