Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May-Jun;18(3):811-822.
doi: 10.1109/TCBB.2020.3019237. Epub 2021 Jun 3.

A Pipeline for Integrated Theory and Data-Driven Modeling of Biomedical Data

A Pipeline for Integrated Theory and Data-Driven Modeling of Biomedical Data

Vineet K Raghu et al. IEEE/ACM Trans Comput Biol Bioinform. 2021 May-Jun.

Abstract

Genome sequencing technologies have the potential to transform clinical decision making and biomedical research by enabling high-throughput measurements of the genome at a granular level. However, to truly understand mechanisms of disease and predict the effects of medical interventions, high-throughput data must be integrated with demographic, phenotypic, environmental, and behavioral data from individuals. Further, effective knowledge discovery methods must infer relationships between these data types. We recently proposed a pipeline (CausalMGM) to achieve this. CausalMGM uses probabilistic graphical models to infer the relationships between variables in the data; however, CausalMGM's graphical structure learning algorithm can only handle small datasets efficiently. We propose a new methodology (piPref-Div) that selects the most informative variables for CausalMGM, enabling it to scale. We validate the efficacy of piPref-Div against other feature selection methods and demonstrate how the use of the full pipeline improves breast cancer outcome prediction and provides biologically interpretable views of gene expression data.

PubMed Disclaimer

Figures

Fig. 1:
Fig. 1:
Pipeline proposed in this work to learn graphical model structure from mixed clinical and omics datasets.
Fig. 2:
Fig. 2:
Illustration of procedure to limit tested parameter range. Figure originally appeared in [7]
Fig. 3:
Fig. 3:
Subsampling procedure to determine empirical probabilities for every edge in the correlation graph. B (λ, S) returns a correlation graph computed upon dataset S with threshold λ. Figure originally appeared in [7]
Fig. 4:
Fig. 4:
Cluster Simulation to generate simulated datasets. Purple nodes are master regulators of a cluster, blue nodes are causal parents of the target variable, and the beige node is the target variable.
Fig. 5:
Fig. 5:
Heatmap of correlation of prior knowledge between sources. Each cell is the percentage of gene-gene pairs in the prior source of the row that are also in the prior source in the column.
Fig. 6:
Fig. 6:
Heatmap of overlapping prior knowledge between sources. Each cell is the correlation between the probabilities given by each source for all gene-gene pairs in the prior source of the row that are also in the prior source in the column.
Fig. 7:
Fig. 7:
Predicted Weight vs. Net Reliability for each prior knowledge source in simulated experiments for piPref-Div for (left) 50 samples and (right) 200 samples.
Fig. 8:
Fig. 8:
Accuracy of predicted clusters for varying amount and reliability of prior knowledge. Sample size was set to 50 (left) and 200 (right). Star represents the best possible performance if optimal correlation thresholds were selected.
Fig. 9:
Fig. 9:
Accuracy of predicted clusters for varying amount and reliability of prior knowledge on large datasets. Sample size was set to 50 (left) and 200 (right).
Fig. 10:
Fig. 10:
AUC of predicting five-year metastasis-free and relapse-free survival using several feature selection methods on six independent breast cancer microarray datasets.
Fig. 11:
Fig. 11:
Stability of learned models for predicting five-year metastasis-free and relapse-free survival using several feature selection methods on six independent breast cancer microarray datasets.
Fig. 12:
Fig. 12:
Graphical model of breast cancer subtype. Size of each edge represents the number of times a similar cluster was selected to be related to Subtype in each of the cross-validation folds.

References

    1. Hira ZM and Gillies DF, “A review of feature selection and feature extraction methods applied on microarray data,” Advances in bioinformatics, vol. 2015, 2015. - PMC - PubMed
    1. Cun Y and Fröhlich H, “Prognostic gene signatures for patient stratification in breast cancer-accuracy, stability and interpretability of gene selection approaches using prior knowledge on protein-protein interactions,” BMC bioinformatics, vol. 13, no. 1, p. 69, 2012. - PMC - PubMed
    1. Haury A-C, Gestraud P, and Vert J-P, “The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures,” PloS one, vol. 6, no. 12, p. e28210, 2011. - PMC - PubMed
    1. Koller D and Friedman N, Probabilistic graphical models: principles and techniques. MIT press, 2009.
    1. Sedgewick AJ, Shi I, Donovan RM, and Benos PV, “Learning mixed graphical models with separate sparsity parameters and stability-based model selection,” BMC Bioinformatics, vol. 17, no. S5, p. 175, 2016. - PMC - PubMed

Publication types