. 2022 Aug 2;38(15):3749-3758.

doi: 10.1093/bioinformatics/btac405.

Identifying interactions in omics data for clinical biomarker discovery using symbolic regression

Niels Johan Christensen^{1

2}, Samuel Demharter², Meera Machado², Lykke Pedersen², Marco Salvatore², Valdemar Stentoft-Hansen², Miquel Triana Iglesias²

Affiliations

PMID: 35731214
PMCID: PMC9344843
DOI: 10.1093/bioinformatics/btac405

Identifying interactions in omics data for clinical biomarker discovery using symbolic regression

Niels Johan Christensen et al. Bioinformatics. 2022.

. 2022 Aug 2;38(15):3749-3758.

doi: 10.1093/bioinformatics/btac405.

Authors

Niels Johan Christensen^{1

2}, Samuel Demharter², Meera Machado², Lykke Pedersen², Marco Salvatore², Valdemar Stentoft-Hansen², Miquel Triana Iglesias²

Affiliations

¹ Department of Chemistry, University of Copenhagen, Copenhagen 1871, Denmark.
² Abzu ApS, Copenhagen 2150, Denmark.

PMID: 35731214
PMCID: PMC9344843
DOI: 10.1093/bioinformatics/btac405

Abstract

Motivation: The identification of predictive biomarker signatures from omics and multi-omics data for clinical applications is an active area of research. Recent developments in assay technologies and machine learning (ML) methods have led to significant improvements in predictive performance. However, most high-performing ML methods suffer from complex architectures and lack interpretability.

Results: We present the application of a novel symbolic-regression-based algorithm, the QLattice, on a selection of clinical omics datasets. This approach generates parsimonious high-performing models that can both predict disease outcomes and reveal putative disease mechanisms, demonstrating the importance of selecting maximally relevant and minimally redundant features in omics-based machine-learning applications. The simplicity and high-predictive power of these biomarker signatures make them attractive tools for high-stakes applications in areas such as primary care, clinical decision-making and patient stratification.

Availability and implementation: The QLattice is available as part of a python package (feyn), which is available at the Python Package Index (https://pypi.org/project/feyn/) and can be installed via pip. The documentation provides guides, tutorials and the API reference (https://docs.abzu.ai/). All code and data used to generate the models and plots discussed in this work can be found in https://github.com/abzu-ai/QLattice-clinical-omics.

Supplementary information: Supplementary material is available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Metrics of the best model (ranked by BIC criterion) for predicting Alzheimer’s disease. The model is robust as shown by the relatively small drop in performance from the training set (AUC 0.98) to the test set (AUC 0.92). Receiver operator characteristic (ROC) curves (top) and confusion matrices for training set (bottom left) and test set (bottom right)

**Fig. 2.**
Model signal path for AD. A prominent signal contribution from MAPT was found in all 10 models (green). The signal is expressed in terms of mutual information and displayed above the nodes in the model (see Cover and Thomas (2006)) (A color version of this figure appears in the online version of this article.)

**Fig. 3.**
Partial dependence plot for the AD model: marginal effect of MAPT on AD-risk

**Fig. 4.**
ROC AUC scores (top) for the selected three feature model for insulin response. Confusion matrices (bottom left: train, bottom right: test)

**Fig. 5.**
Decision boundaries of the selected model. We keep the feature C2CD2L fixed at the values corresponding to the 0.25, 0.50 and 0.75 quantiles

**Fig. 6.**
Distributions of the two classes for the variables PDK4 (top), PHF23 (bottom left) and the linear combination found in the second model of Table 2 (bottom right)

**Fig. 7.**
A representative model for predicting Hepatocellular Carcinoma. A prominent signal contribution from chr17_59473060_59483266 is found in all 10 models. The signal is expressed in terms of mutual information and displayed above the nodes in the model (Cover and Thomas, 2006)

**Fig. 8.**
Metrics of the best model (ranked by BIC criterion) for predicting Hepatocellular Carcinoma. The model is robust as shown by the performance of the training set (AUC 1.0) compared to the test set (AUC 1.0). ROC curves (top) and confusion matrices for training set (bottom left) and test set (bottom right)

**Fig. 9.**
Left: HCC. Pairplot for features in the selected model. Right: 2d response of the model predictions, with train data overlaid. The decision boundary separates the two outcome areas

**Fig. 10.**
Metrics of the best model of the first fold (ranked by BIC criterion) for predicting Breast Cancer outcomes. The model is not overfitted as shown by the performance of the training set (AUC 0.66) compared to the test set (AUC 0.66). ROC curves (top) and confusion matrices for training set (bottom left) and test set (bottom right)

**Fig. 11.**
Pairwise Pearson correlation (absolute value) heatmap of the gene expression features in the models shown in equation (1)

**Fig. 12.**
Decision boundary for three of the models at the head of each k-fold. The top right aread (green) indicate a higher probability of death, compared to the remaining area (purple) (A color version of this figure appears in the online version of this article.)

**Fig. 13.**
Metrics of the best model of the first fold (ranked by BIC criterion) for predicting Breast Cancer outcomes. The model shows some degree of overfitting as shown by the performance of the training set (AUC 0.75) compared to the test set (AUC 0.67). ROC curves (top) and confusion matrices for training set (bottom left) and test set (bottom right)

See this image and copyright information in PMC

References

1. Altman N., Krzywinski M. (2018) The curse(s) of dimensionality. Nat. Methods, 15, 399–400. - PubMed
1. Angrist J.D., Pischke J.-S. (2008). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press, Princeton, NJ.
1. Bader J. et al. (2020) Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer’s disease. Mol. Syst. Biol., 16, p.e9356. - PMC - PubMed
1. Bishop C.M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: Springer.
1. Buja A. et al. (1989) Linear smoothers and additive models. Ann. Statist., 17, 453–510.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identifying interactions in omics data for clinical biomarker discovery using symbolic regression

Affiliations

Identifying interactions in omics data for clinical biomarker discovery using symbolic regression

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources