Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Apr;24(4):340-356.
doi: 10.1089/cmb.2016.0100. Epub 2016 Sep 28.

New Algorithm and Software (BNOmics) for Inferring and Visualizing Bayesian Networks from Heterogeneous Big Biological and Genetic Data

Affiliations

New Algorithm and Software (BNOmics) for Inferring and Visualizing Bayesian Networks from Heterogeneous Big Biological and Genetic Data

Grigoriy Gogoshin et al. J Comput Biol. 2017 Apr.

Abstract

Bayesian network (BN) reconstruction is a prototypical systems biology data analysis approach that has been successfully used to reverse engineer and model networks reflecting different layers of biological organization (ranging from genetic to epigenetic to cellular pathway to metabolomic). It is especially relevant in the context of modern (ongoing and prospective) studies that generate heterogeneous high-throughput omics datasets. However, there are both theoretical and practical obstacles to the seamless application of BN modeling to such big data, including computational inefficiency of optimal BN structure search algorithms, ambiguity in data discretization, mixing data types, imputation and validation, and, in general, limited scalability in both reconstruction and visualization of BNs. To overcome these and other obstacles, we present BNOmics, an improved algorithm and software toolkit for inferring and analyzing BNs from omics datasets. BNOmics aims at comprehensive systems biology-type data exploration, including both generating new biological hypothesis and testing and validating the existing ones. Novel aspects of the algorithm center around increasing scalability and applicability to varying data types (with different explicit and implicit distributional assumptions) within the same analysis framework. An output and visualization interface to widely available graph-rendering software is also included. Three diverse applications are detailed. BNOmics was originally developed in the context of genetic epidemiology data and is being continuously optimized to keep pace with the ever-increasing inflow of available large-scale omics datasets. As such, the software scalability and usability on the less than exotic computer hardware are a priority, as well as the applicability of the algorithm and software to the heterogeneous datasets containing many data types-single-nucleotide polymorphisms and other genetic/epigenetic/transcriptome variables, metabolite levels, epidemiological variables, endpoints, and phenotypes, etc.

Keywords: Bayesian network(s); big data; omic data; systems biology.

PubMed Disclaimer

Conflict of interest statement

No competing financial interests exist.

Figures

<b>FIG. 1.</b>
FIG. 1.
BN reconstruction algorithm kernel pseudocode (Procedures 1 and 2). BN, Bayesian network.
<b>FIG. 2.</b>
FIG. 2.
BNs reconstructed from the APOE datasets. (a) African Americans from Jackson, Mississippi, (b) non-Hispanic whites from Rochester, Minnesota. Numbers next to BN edges indicate edge strengths. See text for interpretation of edge strength and disconnected notes. APO_E, APO_A, APO_B, TRIG, CHOL, and HDL stand for levels of apolipoproteins E, AI, and B, triglycerides, cholesterol, and high-density lipoprotein cholesterol, respectively. Number nodes indicate corresponding APOE SNPs. APOE, apolipoprotein E; SNP, single-nucleotide polymorphism.
<b>FIG. 3.</b>
FIG. 3.
(a–c) Visualization of the subnetworks of a BN reconstructed from the ARIC GWAS dataset. (a) Third-order (radius) Markov neighborhoods of blood lipid and epidemiological variables (nodes 1–8). Other number nodes correspond to the working SNP designations. Such fine scale does not permit for sensible visualization and is for methodology illustration purposes only. (b) Second-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. (c) First-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. Numbers next to BN edges indicate edge strengths. Sex, v1age01, hdl01, totchol, ldl02, trigs, bmi01, and glucos01 stand for gender, age, high-density lipoprotein cholesterol, total cholesterol, low-density lipoprotein cholesterol, triglycerides, BMI, and plasma glucose, respectively. Number nodes indicate corresponding SNPs. (d) BN reconstructed from eight non-SNP variables only, for comparison purposes. GWAS, genome-wide association study.
<b>FIG. 3.</b>
FIG. 3.
(a–c) Visualization of the subnetworks of a BN reconstructed from the ARIC GWAS dataset. (a) Third-order (radius) Markov neighborhoods of blood lipid and epidemiological variables (nodes 1–8). Other number nodes correspond to the working SNP designations. Such fine scale does not permit for sensible visualization and is for methodology illustration purposes only. (b) Second-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. (c) First-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. Numbers next to BN edges indicate edge strengths. Sex, v1age01, hdl01, totchol, ldl02, trigs, bmi01, and glucos01 stand for gender, age, high-density lipoprotein cholesterol, total cholesterol, low-density lipoprotein cholesterol, triglycerides, BMI, and plasma glucose, respectively. Number nodes indicate corresponding SNPs. (d) BN reconstructed from eight non-SNP variables only, for comparison purposes. GWAS, genome-wide association study.
<b>FIG. 3.</b>
FIG. 3.
(a–c) Visualization of the subnetworks of a BN reconstructed from the ARIC GWAS dataset. (a) Third-order (radius) Markov neighborhoods of blood lipid and epidemiological variables (nodes 1–8). Other number nodes correspond to the working SNP designations. Such fine scale does not permit for sensible visualization and is for methodology illustration purposes only. (b) Second-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. (c) First-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. Numbers next to BN edges indicate edge strengths. Sex, v1age01, hdl01, totchol, ldl02, trigs, bmi01, and glucos01 stand for gender, age, high-density lipoprotein cholesterol, total cholesterol, low-density lipoprotein cholesterol, triglycerides, BMI, and plasma glucose, respectively. Number nodes indicate corresponding SNPs. (d) BN reconstructed from eight non-SNP variables only, for comparison purposes. GWAS, genome-wide association study.
<b>FIG. 3.</b>
FIG. 3.
(a–c) Visualization of the subnetworks of a BN reconstructed from the ARIC GWAS dataset. (a) Third-order (radius) Markov neighborhoods of blood lipid and epidemiological variables (nodes 1–8). Other number nodes correspond to the working SNP designations. Such fine scale does not permit for sensible visualization and is for methodology illustration purposes only. (b) Second-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. (c) First-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. Numbers next to BN edges indicate edge strengths. Sex, v1age01, hdl01, totchol, ldl02, trigs, bmi01, and glucos01 stand for gender, age, high-density lipoprotein cholesterol, total cholesterol, low-density lipoprotein cholesterol, triglycerides, BMI, and plasma glucose, respectively. Number nodes indicate corresponding SNPs. (d) BN reconstructed from eight non-SNP variables only, for comparison purposes. GWAS, genome-wide association study.
<b>FIG. 4.</b>
FIG. 4.
(a) BN reconstructed from the ARIC metabolomic profile dataset. (b) Visualization of a first-order (radius) Markov neighborhood subnetwork of hypertension phenotype node (HYPERT05). Numbers next to BN edges indicate edge strengths. Epidemiological and known metabolite node designations are largely self-explanatory (e.g., V1AGE01, glycerol). X—<…> nodes indicate unknown metabolites. See Zheng et al., , for more detail.

Similar articles

Cited by

References

    1. Agostinho N.B., Machado K.S., and Werhli A.V. 2015. Inference of regulatory networks with a convergence improved MCMC sampler. BMC Bioinformatics 16, 306. - PMC - PubMed
    1. Akaike H. 1974. A new look at the statistical identification problem. IEEE Trans. Auto. Control 19, 716–723
    1. ARIC Investigators. 1989. The Atherosclerosis Risk in Communities (ARIC) study: Design and objectives. Am. J. Epidemiol. 129, 687–702 - PubMed
    1. Beinlich I.A., Suermondt H.J., Chavez R.M., et al. . 1989. The ALARM monitoring system: A case study with two probablistic inference techniques for belief networks. Second European Conference on Artificial Intelligence in Medicine, London, 38, 247–256
    1. Chickering D.M., Heckerman D., and Meek C. 2004. Large-sample learning of Bayesian networks is NP-hard. J. Mach. Learn. Res. 5, 1278–1330

LinkOut - more resources