Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Nov 23:6:36812.
doi: 10.1038/srep36812.

Logic models to predict continuous outputs based on binary inputs with an application to personalized cancer therapy

Affiliations

Logic models to predict continuous outputs based on binary inputs with an application to personalized cancer therapy

Theo A Knijnenburg et al. Sci Rep. .

Abstract

Mining large datasets using machine learning approaches often leads to models that are hard to interpret and not amenable to the generation of hypotheses that can be experimentally tested. We present 'Logic Optimization for Binary Input to Continuous Output' (LOBICO), a computational approach that infers small and easily interpretable logic models of binary input features that explain a continuous output variable. Applying LOBICO to a large cancer cell line panel, we find that logic combinations of multiple mutations are more predictive of drug response than single gene predictors. Importantly, we show that the use of the continuous information leads to robust and more accurate logic models. LOBICO implements the ability to uncover logic models around predefined operating points in terms of sensitivity and specificity. As such, it represents an important step towards practical application of interpretable logic models.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Workflow of LOBICO.
LOBICO has two main inputs: (1) a binary matrix of samples by features (depicted in the blue box). Here, the binary matrix contains the mutation status of 60 cancer genes measured across 642 cancer cell lines. (2) a continuous vector with a value for each of the samples (depicted in the orange boxes). In this case, the vector contains the IC50 of each cell line in response to Afatinib, an EGFR/ERBB2 inhibitor. The continuous vector is transformed into a binary vector and a sample-specific weight vector using a binarization scheme. Particularly, the IC50s are binarized using a threshold leading to a set of sensitive and a set of resistant cell lines. The distances of the original IC50s to the binarization threshold are represented in the weight vector, which is normalized per class. Then, LOBICO finds the optimal logic model of features (gene mutations) that minimizes the total weight of misclassified samples (cell lines). In this case, the optimal 2-input OR logic formula is ‘EGFR OR ERBB2’ (depicted in the white box).
Figure 2
Figure 2. Multi-predictor models outperform single predictor models.
Scatter plot with the 10-fold cross-validation (CV) error for single predictor models (x-axis) and the best (lowest CV error) multi-predictor model (y-axis). Each point represents one of the 142 drugs. Statistically significant models are highlighted in blue. Multi-predictor models that have a CV error lower than 0.35 and at least a 25% improvement upon the single predictor model are highlighted in magenta. The two examples discussed in the text are highlighted in bold typeface.
Figure 3
Figure 3. LOBICO’s use of continuous output leads to robust and accurate models.
(a) Heatmaps depicting the feature importance (FI) scores across the 60 gene mutations for the logic models inferred to explain the drug response to the PI3K/mTOR inhibitor BEZ235. The upper heatmap represents FI scores for the 2-input OR model (K = 2, M = 1) using three different binarization thresholds for logic models with binarized output, i.e. not using the sample-specific weights. The middle of the three heatmaps represents the same FI scores, but for logic models with continuous output, i.e. using the sample-specific weights. The bottom two heatmaps depict FI scores aggregated across all model complexities, using the standard binarization threshold (t = 0.05), for both the logic models with and without the sample-specific weights. The labels of the gene mutations with a large FI in any of these heatmaps are printed below. The ‘ground truth’ features, i.e. the expected or annotated targets of this drug, PTEN and PIK3CA, are printed in bold. (b) Scatter plot with the average Pearson correlation coefficients of the similarity of FI scores across the binarization thresholds for inferred logic models without (x-axis) and with (y-axis) the sample-specific weights. Each point represents one of the 142 drugs. The correlation scores are computed using the model-complexity-specific FI scores. The grey bars on top and to the right of the scatter plot represent histograms of these correlation scores for models without and with the sample-specific weights, respectively. (c) Scatter plot with the importance of the ground truth features for inferred logic models without (x-axis) and with (y-axis) the sample-specific weights. Each point represents one of the 49 drugs, for which ground truth features were available. The importance scores of the ground truth features were derived from aggregated FI scores.
Figure 4
Figure 4. LOBICO finds solutions at different operating points.
(a) ROC space with LOBICO solutions to explain drug sensitivity to the MEK1/2 inhibitor AZD6244. Blue crosses indicate the TPR and FPR at which the solution was found. The logic formula of the solutions is printed next to the blue crosses. The color of the genes in a formula indicate their FI. Colors range from black (moderately important) to bright red (highly important). For comparison, the best single predictor solutions are visualized in green. Pink arrows point to solutions discussed in the text. The inlay depicts the histogram of IC50s for AZD6244 together with the binarization threshold, which divides the cell lines into 91 cell lines that are sensitive to AZD6244 and 515 that are resistant. (b) Average FI scores for a group of 6 MEK/RAF inhibitors (including AZD6244), for high specificity solutions (orange) and high sensitivity solutions (magenta). High specificity solutions were defined as solutions with FPR < 10%. Conversely, high sensitivity solutions were defined as solutions with TPR > 90%. The FI scores of all solutions on the Pareto front (ROC curve) that met these respective criteria across the six drugs were averaged. We distinguished between positive terms, indicating mutations (Mut.) and negated terms, indicating wild-type (WT). The two genes with the highest average FI score as mutants were printed at the top of their FI bar. The two genes with the highest average FI score as wild-types were printed at the bottom of their FI bar. (c,d) Similar to (b), but for a group of two PI3K inhibitors and a group of two AURKA/B inhibitors, respectively.
Figure 5
Figure 5. 3-layer Boolean circuit representing the structure of the LOBICO ILP formulation.
In Layer 1 variables s11, …, sPK are used to select the inputs (x1, x2, …, xP) that are combined using a conjunction (AND gate) to create the K disjunctive terms in Layer 2. These disjunctive terms (the outputs of the AND gates) are represented by variables t1, …, tK. In Layer 3 the disjunctive terms are combined using a disjunction (OR gate) resulting in the inferred binary output variable y′. This figure is adapted from Figure 2.1 in Kamath et al..

Similar articles

Cited by

References

    1. Zou H. & Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301–320 (2005).
    1. Breiman L. Random forests. Machine learning 45, 5–32 (2001).
    1. Ruczinski I., Kooperberg C. & LeBlanc M. Logic regression. Journal of Computational and Graphical Statistics 12, 475–511 (2003).
    1. Kooperberg C. & Ruczinski I. Identifying interacting SNPs using Monte Carlo logic regression. Genetic epidemiology 28, 157–170 (2005). - PubMed
    1. Mukherjee S. et al.. Sparse combinatorial inference with an application in cancer biology. Bioinformatics 25, 265–271 (2009). - PMC - PubMed

Publication types

Substances