Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct;26(10):1113-1129.
doi: 10.1089/cmb.2019.0036. Epub 2019 Apr 22.

Integration of Multiple Data Sources for Gene Network Inference Using Genetic Perturbation Data

Affiliations

Integration of Multiple Data Sources for Gene Network Inference Using Genetic Perturbation Data

Xiao Liang et al. J Comput Biol. 2019 Oct.

Abstract

The inference of gene networks from large-scale human genomic data is challenging due to the difficulty in identifying correct regulators for each gene in a high-dimensional search space. We present a Bayesian approach integrating external data sources with knockdown data from human cell lines to infer gene regulatory networks. In particular, we assemble multiple data sources, including gene expression data, genome-wide binding data, gene ontology, and known pathways, and use a supervised learning framework to compute prior probabilities of regulatory relationships. We show that our integrated method improves the accuracy of inferred gene networks as well as extends some previous Bayesian frameworks both in theory and applications. We apply our method to two different human cell lines, namely skin melanoma cell line A375 and lung cancer cell line A549, to illustrate the capabilities of our method. Our results show that the improvement in performance could vary from cell line to cell line and that we might need to choose different external data sources serving as prior knowledge if we hope to obtain better accuracy for different cell lines.

Keywords: data integration; gene regulation; machine learning; systems biology.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing financial interests.

Figures

<b>FIG. 1.</b>
FIG. 1.
An overview of the approach. We first build a supervised framework for a selected set of target–gene regulatory pairs using external knowledge derived from the literature and existing data sets. Then, we apply machine learning methods to predict the regulatory relationships across all target–gene regulatory pairs for the landmark genes in the LINCS L1000 project. The predicted regulatory relationships are used as the prior probabilities in our Bayesian approach to predict the posterior probabilities.
<b>FIG. 2.</b>
FIG. 2.
Histograms of the expected number of regulators per target gene predicted using knockdown data in cell line A375. (A) Shows the histogram of the expected number of regulators per target gene without the sampling bias correction. (B) Shows the histogram of the expected number of regulators per target gene after applying the sampling bias correction to the prior.
<b>FIG. 3.</b>
FIG. 3.
Histograms of the expected number of regulators per target gene predicted using knockdown data in cell line A549. (A) Shows the histogram of the expected number of regulators per target gene without the sampling bias correction. (B) Shows the histogram of the expected number of regulators per target gene after applying the sampling bias correction to the prior.
<b>FIG. 4.</b>
FIG. 4.
Precision–recall curves for cell line A549 using different data assessed with TRANSFAC and JASPAR. The results are improved by external data integration with or without MCDC. MCDC, model-based clustering with data correction.
<b>FIG. 5.</b>
FIG. 5.
Inferred directed edges at a posterior probability cutoff of 0.5 from the gene network generated by integrating external data with the knockdown data and MCDC-corrected untreated data. Each node represents a gene and each edge represents a regulatory interaction between the two genes. The width of each edge is in proportion to the inferred posterior probability that the regulatory relationship exists for the corresponding gene pair.
<b>FIG. 6.</b>
FIG. 6.
True positive edges at a posterior probability cutoff of 0.5 from the gene network generated by integrating external data with the knockdown data and MCDC-corrected untreated data. These true positive edges represent the edges from Figure 5 that are also found in our assessment criteria. Each node represents a gene and each edge represents a regulatory interaction between the two genes. The width of each edge is in proportion to the inferred posterior probability that the regulatory relationship exists for the corresponding gene pair.
<b>FIG. 7.</b>
FIG. 7.
Precision–recall curves for cell line A375 using different data assessed with TRANSFAC and JASPAR. The results are improved by external data integration with or without MCDC.

Similar articles

Cited by

References

    1. ada package 2016. Available at: cran.r-project.org/package=ada Accessed February28, 2017
    1. Ashburner M., Ball C.A., Blake J.A., et al. . 2000. Gene ontology: Tool for the unification of biology. Nat. Genet. 25, 25–29 - PMC - PubMed
    1. Banfield J.D., and Raftery A.E. 1993. Model-based gaussian and non-gaussian clustering. Biometrics 49, 803–821
    1. Bansal M., Della Gatta G., and Di Bernardo D. 2006. Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 22, 815–822 - PubMed
    1. Barretina J., Caponigro G., Stransky N., et al. . 2012. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 - PMC - PubMed

Publication types

LinkOut - more resources