Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 May 5;45(8):4315-4329.
doi: 10.1093/nar/gkx174.

Mocap: large-scale inference of transcription factor binding sites from chromatin accessibility

Affiliations

Mocap: large-scale inference of transcription factor binding sites from chromatin accessibility

Xi Chen et al. Nucleic Acids Res. .

Abstract

Differential binding of transcription factors (TFs) at cis-regulatory loci drives the differentiation and function of diverse cellular lineages. Understanding the regulatory interactions that underlie cell fate decisions requires characterizing TF binding sites (TFBS) across multiple cell types and conditions. Techniques, e.g. ChIP-Seq can reveal genome-wide patterns of TF binding, but typically requires laborious and costly experiments for each TF-cell-type (TFCT) condition of interest. Chromosomal accessibility assays can connect accessible chromatin in one cell type to many TFs through sequence motif mapping. Such methods, however, rarely take into account that the genomic context preferred by each factor differs from TF to TF, and from cell type to cell type. To address the differences in TF behaviors, we developed Mocap, a method that integrates chromatin accessibility, motif scores, TF footprints, CpG/GC content, evolutionary conservation and other factors in an ensemble of TFCT-specific classifiers. We show that integration of genomic features, such as CpG islands improves TFBS prediction in some TFCT. Further, we describe a method for mapping new TFCT, for which no ChIP-seq data exists, onto our ensemble of classifiers and show that our cross-sample TFBS prediction method outperforms several previously described methods.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Our TFBS prediction pipeline. We compiled a non-redundant set of TF binding motifs, and compute genomic features for all candidate motif sites. We trained sparse logistic regression models to predict binding sites (MocapS) for 98 TFCT conditions, for which ChIP-Seq data is available in ENCODE cell type K562, A549 and Hepg2. True binding sites are defined as motif sites that overlap ChIP-Seq peaks. For a new TFCT condition, binding sites are inferred from either the unsupervised accessibility classifier (Mocap) or a trained sparse logistic regression classifier according to sample mapping using weighted least squares regression (MocapX). Shaded area stands for supervised training steps; unshaded area are steps for data acquisition (top) and making predictions (bottom).
Figure 2.
Figure 2.
Modelling DNase I cut count as a mixture of negative binomial distributions. (A) Distribution of DNase I cut count simulated using zero-inflated negative binomial model parameters derived using an EM algorithm (n = 100 000). Red: cut count from inaccessible regions of the chromatin; blue: cut count from accessible regions of the chromatin. Inset: Cutoff point is determined by the probability ratio between accessible and inaccessible components. X and Y axes are in log scales. (B) Hierarchical clustering of the accessibility landscape of ENCODE cell types. Genome is binned into 400 bp (overlapping by 200 bp) windows, and the accessibility of each genomic window is classified using the zero-inflated negative binomial mixture model as 1 s (accessible) and 0 s (inaccessible). Cell types cluster in accordance with their developmental origins.
Figure 3.
Figure 3.
Feature selection and classifier training. (A) Genomic features ranked by their correlation to the ChIP-Seq signal. Barplot showing the average correlation for each genomic feature over 98 TFCT samples. Error bars mark average ±- one standard deviation. (B) Clustering heatmap showing PCC between genomic features across motif sites. Red: positive correlation, white: no correlation, blue: negative correlation. (C) Ten-fold cross-validation performance (AUPR) while adopting different shrinkage parameters λ. We tune the shrinkage parameter to approach maximum AUPR. Red dot marks the shrinkage level (sparsity) that corresponds to the maximum 10-fold cross-validation performance. Green dot corresponds to our selected feature combination–the sparsest model that achieved a near optimum (within one standard error of maximum) cross-validation performance. Example TF: SMC3, cell type: K562. (D) Barplot showing the number of times each feature is selected in the 98 trained models. Bar colors are scaled. Red and blue corresponding to more and less commonly selected features respectively.
Figure 4.
Figure 4.
Heatmap clustering TFs based on the Euclidean distance between cross-TF prediction performances (AUPR). Red indicates large Euclidean distance and relatively poor cross-prediction performance between TFs; Blue indicates smaller Euclidean distance and good cross-prediction performance (where cross-prediction is the use of TF's MocapS model to predict another TF's binding). TFs are clustered together if they are more likely to share the same sparse logistic regression models for predicting TFBS. Data from multiple cell types, if available are averaged out for each TF.
Figure 5.
Figure 5.
Cross-sample binding site prediction. Violin plot showing the hold-one-out performance for MocapX in comparison to MocapS (with MocapS models trained in the TFCT), MocapG (with local chromatin accessibility feature only) and randomly selected MocapS model (with random mappings between leave-out TFCT and MocapS model ensemble) performance. AUPR scores are normalized (centered at zero) across the four methods in each TFCT condition Inset: Heatmap showing weighted feature vectors that is used to compute distances between new TFCTs and TFCTs for which MocapS models have been trained. If no fit model exists in the trained model pool (no model is predicted to outperform unsupervised MocapG), MocapX will use MocapG for TFBS prediction.
Figure 6.
Figure 6.
Method comparison between Mocap, CENTIPEDE and PIQ (98 TFCT samples in hold-out chromosome 15). (A) Boxplot showing overall performance of CENTIPEDE, PIQ, MocapS, MocapX, and MocapG method in predicting TFBS (n = 98). (B) Boxplot showing performance of Mocap and CENTIPEDE applied to ATAC-Seq data in Gm12878 (n = 23). Performance metrics used are AUPR (top panel), Sensitivity at 1% FPR (middle panel) and AUROC (bottom panel).
Figure 7.
Figure 7.
Genome browser view of predictions made by different methods. Tracks highlight region 85081291–85557900 on chromosome 15 for binding site predictions of ETS1 in K562. We standardized MocapG, MocapS and MocapX (modeled with YY1 in K562) prediction scores into z scores and used a cutoff of z > 3. Cutoff for PIQ and CENTIPEDE are 700 and 0.99 as suggested.

Similar articles

Cited by

References

    1. Mitchell P.J., Tjian R.. Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science. 1989; 245:371–378. - PubMed
    1. Thurman R.E., Rynes E., Humbert R., Vierstra J., Maurano M.T., Haugen E., Sheffield N.C., Stergachis A.B., Wang H., Vernot B. et al. . The accessible chromatin landscape of the human genome. Nature. 2012; 489:75–82. - PMC - PubMed
    1. van Steensel B. Mapping of genetic and epigenetic regulatory networks using microarrays. Nat. Genet. 2005; 37:S18–S24. - PubMed
    1. Junion G., Spivakov M., Girardot C., Braun M., Gustafson E.H., Birney E., Furlong E.E.. A transcription factor collective defines cardiac cell fate and reflects lineage history. Cell. 2012; 148:473–486. - PubMed
    1. Davidson E. The Regulatory Genome: Gene Regulatory Networks In Development And Evolution. 2010; NY: Elsevier Science.

Publication types

MeSH terms