Mocap: large-scale inference of transcription factor binding sites from chromatin accessibility

doi:10.1093/nar/gkx174

. 2017 May 5;45(8):4315-4329.

doi: 10.1093/nar/gkx174.

Mocap: large-scale inference of transcription factor binding sites from chromatin accessibility

Xi Chen¹, Bowen Yu², Nicholas Carriero³, Claudio Silva², Richard Bonneau^{1

2

3}

Affiliations

¹ Department of Biology, New York University, New York, NY 10003, USA.
² Department of Computer Science, New York University, New York, NY 10003, USA.
³ Center for Computational Biology, Flatiron Foundation, Simons Foundation, New York, NY 10010, USA.

PMID: 28334916
PMCID: PMC5416775
DOI: 10.1093/nar/gkx174

Mocap: large-scale inference of transcription factor binding sites from chromatin accessibility

Xi Chen et al. Nucleic Acids Res. 2017.

. 2017 May 5;45(8):4315-4329.

doi: 10.1093/nar/gkx174.

Authors

Xi Chen¹, Bowen Yu², Nicholas Carriero³, Claudio Silva², Richard Bonneau^{1

2

3}

Affiliations

¹ Department of Biology, New York University, New York, NY 10003, USA.
² Department of Computer Science, New York University, New York, NY 10003, USA.
³ Center for Computational Biology, Flatiron Foundation, Simons Foundation, New York, NY 10010, USA.

PMID: 28334916
PMCID: PMC5416775
DOI: 10.1093/nar/gkx174

Abstract

Differential binding of transcription factors (TFs) at cis-regulatory loci drives the differentiation and function of diverse cellular lineages. Understanding the regulatory interactions that underlie cell fate decisions requires characterizing TF binding sites (TFBS) across multiple cell types and conditions. Techniques, e.g. ChIP-Seq can reveal genome-wide patterns of TF binding, but typically requires laborious and costly experiments for each TF-cell-type (TFCT) condition of interest. Chromosomal accessibility assays can connect accessible chromatin in one cell type to many TFs through sequence motif mapping. Such methods, however, rarely take into account that the genomic context preferred by each factor differs from TF to TF, and from cell type to cell type. To address the differences in TF behaviors, we developed Mocap, a method that integrates chromatin accessibility, motif scores, TF footprints, CpG/GC content, evolutionary conservation and other factors in an ensemble of TFCT-specific classifiers. We show that integration of genomic features, such as CpG islands improves TFBS prediction in some TFCT. Further, we describe a method for mapping new TFCT, for which no ChIP-seq data exists, onto our ensemble of classifiers and show that our cross-sample TFBS prediction method outperforms several previously described methods.

PubMed Disclaimer

Figures

**Figure 1.**
Our TFBS prediction pipeline. We compiled a non-redundant set of TF binding motifs, and compute genomic features for all candidate motif sites. We trained sparse logistic regression models to predict binding sites (MocapS) for 98 TFCT conditions, for which ChIP-Seq data is available in ENCODE cell type K562, A549 and Hepg2. True binding sites are defined as motif sites that overlap ChIP-Seq peaks. For a new TFCT condition, binding sites are inferred from either the unsupervised accessibility classifier (Mocap) or a trained sparse logistic regression classifier according to sample mapping using weighted least squares regression (MocapX). Shaded area stands for supervised training steps; unshaded area are steps for data acquisition (top) and making predictions (bottom).

**Figure 2.**
Modelling DNase I cut count as a mixture of negative binomial distributions. (A) Distribution of DNase I cut count simulated using zero-inflated negative binomial model parameters derived using an EM algorithm (n = 100 000). Red: cut count from inaccessible regions of the chromatin; blue: cut count from accessible regions of the chromatin. Inset: Cutoff point is determined by the probability ratio between accessible and inaccessible components. X and Y axes are in log scales. (B) Hierarchical clustering of the accessibility landscape of ENCODE cell types. Genome is binned into 400 bp (overlapping by 200 bp) windows, and the accessibility of each genomic window is classified using the zero-inflated negative binomial mixture model as 1 s (accessible) and 0 s (inaccessible). Cell types cluster in accordance with their developmental origins.

**Figure 3.**
Feature selection and classifier training. (A) Genomic features ranked by their correlation to the ChIP-Seq signal. Barplot showing the average correlation for each genomic feature over 98 TFCT samples. Error bars mark average ±- one standard deviation. (B) Clustering heatmap showing PCC between genomic features across motif sites. Red: positive correlation, white: no correlation, blue: negative correlation. (C) Ten-fold cross-validation performance (AUPR) while adopting different shrinkage parameters λ. We tune the shrinkage parameter to approach maximum AUPR. Red dot marks the shrinkage level (sparsity) that corresponds to the maximum 10-fold cross-validation performance. Green dot corresponds to our selected feature combination–the sparsest model that achieved a near optimum (within one standard error of maximum) cross-validation performance. Example TF: SMC3, cell type: K562. (D) Barplot showing the number of times each feature is selected in the 98 trained models. Bar colors are scaled. Red and blue corresponding to more and less commonly selected features respectively.

**Figure 4.**
Heatmap clustering TFs based on the Euclidean distance between cross-TF prediction performances (AUPR). Red indicates large Euclidean distance and relatively poor cross-prediction performance between TFs; Blue indicates smaller Euclidean distance and good cross-prediction performance (where cross-prediction is the use of TF's MocapS model to predict another TF's binding). TFs are clustered together if they are more likely to share the same sparse logistic regression models for predicting TFBS. Data from multiple cell types, if available are averaged out for each TF.

**Figure 5.**
Cross-sample binding site prediction. Violin plot showing the hold-one-out performance for MocapX in comparison to MocapS (with MocapS models trained in the TFCT), MocapG (with local chromatin accessibility feature only) and randomly selected MocapS model (with random mappings between leave-out TFCT and MocapS model ensemble) performance. AUPR scores are normalized (centered at zero) across the four methods in each TFCT condition Inset: Heatmap showing weighted feature vectors that is used to compute distances between new TFCTs and TFCTs for which MocapS models have been trained. If no fit model exists in the trained model pool (no model is predicted to outperform unsupervised MocapG), MocapX will use MocapG for TFBS prediction.

**Figure 6.**
Method comparison between Mocap, CENTIPEDE and PIQ (98 TFCT samples in hold-out chromosome 15). (A) Boxplot showing overall performance of CENTIPEDE, PIQ, MocapS, MocapX, and MocapG method in predicting TFBS (n = 98). (B) Boxplot showing performance of Mocap and CENTIPEDE applied to ATAC-Seq data in Gm12878 (n = 23). Performance metrics used are AUPR (top panel), Sensitivity at 1% FPR (middle panel) and AUROC (bottom panel).

**Figure 7.**
Genome browser view of predictions made by different methods. Tracks highlight region 85081291–85557900 on chromosome 15 for binding site predictions of ETS1 in K562. We standardized MocapG, MocapS and MocapX (modeled with YY1 in K562) prediction scores into z scores and used a cutoff of z > 3. Cutoff for PIQ and CENTIPEDE are 700 and 0.99 as suggested.

See this image and copyright information in PMC

Cited by

Accurate prediction of cell type-specific transcription factor binding.
Keilwagen J, Posch S, Grau J. Keilwagen J, et al. Genome Biol. 2019 Jan 10;20(1):9. doi: 10.1186/s13059-018-1614-y. Genome Biol. 2019. PMID: 30630522 Free PMC article.
Anchor: trans-cell type prediction of transcription factor binding sites.
Li H, Quang D, Guan Y. Li H, et al. Genome Res. 2019 Feb;29(2):281-292. doi: 10.1101/gr.237156.118. Epub 2018 Dec 19. Genome Res. 2019. PMID: 30567711 Free PMC article.
MICMIC: identification of DNA methylation of distal regulatory regions with causal effects on tumorigenesis.
Tong Y, Sun J, Wong CF, Kang Q, Ru B, Wong CN, Chan AS, Leung SY, Zhang J. Tong Y, et al. Genome Biol. 2018 Jun 5;19(1):73. doi: 10.1186/s13059-018-1442-0. Genome Biol. 2018. PMID: 29871649 Free PMC article.
Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome.
Karimzadeh M, Hoffman MM. Karimzadeh M, et al. Genome Biol. 2022 Jun 10;23(1):126. doi: 10.1186/s13059-022-02690-2. Genome Biol. 2022. PMID: 35681170 Free PMC article.
Alternative transcription start sites contribute to acute-stress-induced transcriptome response in human skeletal muscle.
Makhnovskii PA, Gusev OA, Bokov RO, Gazizova GR, Vepkhvadze TF, Lysenko EA, Vinogradova OL, Kolpakov FA, Popov DV. Makhnovskii PA, et al. Hum Genomics. 2022 Jul 22;16(1):24. doi: 10.1186/s40246-022-00399-8. Hum Genomics. 2022. PMID: 35869513 Free PMC article.

See all "Cited by" articles

References

1. Mitchell P.J., Tjian R.. Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science. 1989; 245:371–378. - PubMed
1. Thurman R.E., Rynes E., Humbert R., Vierstra J., Maurano M.T., Haugen E., Sheffield N.C., Stergachis A.B., Wang H., Vernot B. et al. . The accessible chromatin landscape of the human genome. Nature. 2012; 489:75–82. - PMC - PubMed
1. van Steensel B. Mapping of genetic and epigenetic regulatory networks using microarrays. Nat. Genet. 2005; 37:S18–S24. - PubMed
1. Junion G., Spivakov M., Girardot C., Braun M., Gustafson E.H., Birney E., Furlong E.E.. A transcription factor collective defines cardiac cell fate and reflects lineage history. Cell. 2012; 148:473–486. - PubMed
1. Davidson E. The Regulatory Genome: Gene Regulatory Networks In Development And Evolution. 2010; NY: Elsevier Science.

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

[1] Mitchell P.J., Tjian R.. Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science. 1989; 245:371–378. - PubMed

[2] Mitchell P.J., Tjian R.. Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science. 1989; 245:371–378. - PubMed

[3] Thurman R.E., Rynes E., Humbert R., Vierstra J., Maurano M.T., Haugen E., Sheffield N.C., Stergachis A.B., Wang H., Vernot B. et al. . The accessible chromatin landscape of the human genome. Nature. 2012; 489:75–82. - PMC - PubMed

[4] Thurman R.E., Rynes E., Humbert R., Vierstra J., Maurano M.T., Haugen E., Sheffield N.C., Stergachis A.B., Wang H., Vernot B. et al. . The accessible chromatin landscape of the human genome. Nature. 2012; 489:75–82. - PMC - PubMed

[5] van Steensel B. Mapping of genetic and epigenetic regulatory networks using microarrays. Nat. Genet. 2005; 37:S18–S24. - PubMed

[6] van Steensel B. Mapping of genetic and epigenetic regulatory networks using microarrays. Nat. Genet. 2005; 37:S18–S24. - PubMed

[7] Junion G., Spivakov M., Girardot C., Braun M., Gustafson E.H., Birney E., Furlong E.E.. A transcription factor collective defines cardiac cell fate and reflects lineage history. Cell. 2012; 148:473–486. - PubMed

[8] Junion G., Spivakov M., Girardot C., Braun M., Gustafson E.H., Birney E., Furlong E.E.. A transcription factor collective defines cardiac cell fate and reflects lineage history. Cell. 2012; 148:473–486. - PubMed

[9] Davidson E. The Regulatory Genome: Gene Regulatory Networks In Development And Evolution. 2010; NY: Elsevier Science.

[10] Davidson E. The Regulatory Genome: Gene Regulatory Networks In Development And Evolution. 2010; NY: Elsevier Science.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Mocap: large-scale inference of transcription factor binding sites from chromatin accessibility

Affiliations

Mocap: large-scale inference of transcription factor binding sites from chromatin accessibility

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous