. 2020 Aug 24;11(9):985.

doi: 10.3390/genes11090985.

A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions

Thomas Vanhaeren¹, Federico Divina¹, Miguel García-Torres¹, Francisco Gómez-Vela¹, Wim Vanhoof², Pedro Manuel Martínez-García^{3

4}

Affiliations

¹ Division of Computer Science, Universidad Pablo de Olavide, 41013 Sevilla, Spain.
² Faculty of Computer Science, University of Namur, 5000 Namur, Belgium.
³ Centro Andaluz de Biología Molecular y Medicina Regenerativa (CABIMER), CSIC-Universidad de Sevilla-Universidad Pablo de Olavide, 41092 Sevilla, Spain.
⁴ Facultad de Ciencias y Tecnología, Universidad Isabel I, 09003 Burgos, Spain.

PMID: 32847102
PMCID: PMC7563616
DOI: 10.3390/genes11090985

A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions

Thomas Vanhaeren et al. Genes (Basel). 2020.

. 2020 Aug 24;11(9):985.

doi: 10.3390/genes11090985.

Authors

Thomas Vanhaeren¹, Federico Divina¹, Miguel García-Torres¹, Francisco Gómez-Vela¹, Wim Vanhoof², Pedro Manuel Martínez-García^{3

4}

Affiliations

¹ Division of Computer Science, Universidad Pablo de Olavide, 41013 Sevilla, Spain.
² Faculty of Computer Science, University of Namur, 5000 Namur, Belgium.
³ Centro Andaluz de Biología Molecular y Medicina Regenerativa (CABIMER), CSIC-Universidad de Sevilla-Universidad Pablo de Olavide, 41092 Sevilla, Spain.
⁴ Facultad de Ciencias y Tecnología, Universidad Isabel I, 09003 Burgos, Spain.

PMID: 32847102
PMCID: PMC7563616
DOI: 10.3390/genes11090985

Abstract

The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model cohesin-mediated chromatin interactions in two human cell lines and evaluate the prediction performance of six popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines, multi-layer perceptron and deep learning. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other five methods, yielding accuracies of about 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information, as has been previously reported. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring mediated by cohesin.

Keywords: chromatin interactions; genome architecture; genomics; machine-learning; prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Figures

**Figure 1**
Chromatin features associated with RAD21 loops in GM12878 cell line. Hi-C interaction frequencies are shown in the top panel and ChIA-PET interactions are represented as blue arcs in the second panel. Then, a genome browser view for relevant chromatin features is displayed, including architectural factors (blue), DNase-seq (black), RNA-seq (yellow), RNA Pol2 (orange), histone marks (green) and transcription factors (red).

**Figure 2**
Illustration of the integrative machine learning schema for the prediction of chromatin loops. (A) Positive and negative RAD21-associated loops were first identified (see Methods). (B) Then, given a loop with length L, we extended L base pairs to its left and right and the extended region (with length 3L) was partitioned into 1500 bins. For each bin, 23 high-throughput sequencing experiments were scored, resulting in a data matrix with rows representing loops and columns representing the scored features. (C) Finally, we trained and tested classifiers using six machine learning algorithms. XGBoost: Gradient boosting; SVM: Support Vector Machines; ANN: Artificial Neural Networks.

**Figure 3**
Ranking of top 10 important features for the prediction of RAD21 chromatin loops. Horizontal bars represent relative importances of featured bins. The terms ’left’, ’in’ and ’right’ are used for bins from 1 to 500, 501 to 1000 and 1001 to 1500, respectively. The relative position of the bins within one of these 3 windows is also included in the feature names.

**Figure 4**
Position specific importance of selected high-throughput sequencing datasets in GM12878 cell line. Random forests importance score is shown for the 1500 bins corresponding to the most informative experiments. Coordinates of the x-axis are similar to those of Figure 5. Figure S4D–F display similar plots for all the tested datasets as well as for Decision Trees and XGBoost algorithms. CTCF: CCCTC-binding factor.

**Figure 5**
Position specific importance of selected high-throughput sequencing datasets in K562 cell line. Random forests importance score is shown for the 1500 bins corresponding to the most informative experiments. Coordinates of the x-axis are similar to those described in Figure 3, with left and right anchor position represented as red and blue vertical lines, respectively. Figure S4A–C display similar plots for all the tested datasets as well as for Decision Trees and XGBoost algorithms.

**Figure 6**
Ranking of top 10 important features for the prediction of RAD21 chromatin loops using only features associated with transcription. Horizontal bars represent relative importances as in Figure 3.

**Figure 7**
Accuracies of the Decision trees (DT), Random forests (RF) and XGBoost models trained with specific subset of chromatin features in K562 (blue) and GM12878 (red).

**Figure 8**
Cross cell lines accuracies of the DT, RF and XGBoost models trained with specific subset of chromatin features.

See this image and copyright information in PMC

References

1. Bickmore W.A., Van Steensel B. Genome architecture: Domain organization of interphase chromosomes. Cell. 2013;152:1270–1284. doi: 10.1016/j.cell.2013.02.001. - DOI - PubMed
1. Bonev B., Cavalli G. Organization and function of the 3D genome. Nat. Rev. Genet. 2016;17:661–678. doi: 10.1038/nrg.2016.112. - DOI - PubMed
1. Rao S.S., Huntley M.H., Durand N.C., Stamenova E.K., Bochkov I.D., Robinson J.T., Sanborn A.L., Machol I., Omer A.D., Lander E.S., et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159:1665–1680. doi: 10.1016/j.cell.2014.11.021. - DOI - PMC - PubMed
1. Weintraub A.S., Li C.H., Zamudio A.V., Sigova A.A., Hannett N.M., Day D.S., Abraham B.J., Cohen M.A., Nabet B., Buckley D.L., et al. YY1 Is a Structural Regulator of Enhancer-Promoter Loops. Cell. 2017;171:1573–1588. doi: 10.1016/j.cell.2017.11.008. - DOI - PMC - PubMed
1. Dixon J.R., Selvaraj S., Yue F., Kim A., Li Y., Shen Y., Hu M., Liu J.S., Ren B. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485:376–380. doi: 10.1038/nature11082. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions

Affiliations

A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources