Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar 3;23(1):83.
doi: 10.1186/s12859-022-04615-z.

abc4pwm: affinity based clustering for position weight matrices in applications of DNA sequence analysis

Affiliations

abc4pwm: affinity based clustering for position weight matrices in applications of DNA sequence analysis

Omer Ali et al. BMC Bioinformatics. .

Abstract

Background: Transcription factor (TF) binding motifs are identified by high throughput sequencing technologies as means to capture Protein-DNA interactions. These motifs are often represented by consensus sequences in form of position weight matrices (PWMs). With ever-increasing pool of TF binding motifs from multiple sources, redundancy issues are difficult to avoid, especially when every source maintains its own database for collection. One solution can be to cluster biologically relevant or similar PWMs, whether coming from experimental detection or in silico predictions. However, there is a lack of efficient tools to cluster PWMs. Assessing quality of PWM clusters is yet another challenge. Therefore, new methods and tools are required to efficiently cluster PWMs and assess quality of clusters.

Results: A new Python package Affinity Based Clustering for Position Weight Matrices (abc4pwm) was developed. It efficiently clustered PWMs from multiple sources with or without using DNA-Binding Domain (DBD) information, generated a representative motif for each cluster, evaluated the clustering quality automatically, and filtered out incorrectly clustered PWMs. Additionally, it was able to update human DBD family database automatically, classified known human TF PWMs to the respective DBD family, and performed TF motif searching and motif discovery by a new ensemble learning approach.

Conclusion: This work demonstrates applications of abc4pwm in the DNA sequence analysis for various high throughput sequencing data using ~ 1770 human TF PWMs. It recovered known TF motifs at gene promoters based on gene expression profiles (RNA-seq) and identified true TF binding targets for motifs predicted from ChIP-seq experiments. Abc4pwm is a useful tool for TF motif searching, clustering, quality assessment and integration in multiple types of sequence data analysis including RNA-seq, ChIP-seq and ATAC-seq.

Keywords: Clustering quality assessment; DNA sequence analysis; DNA-binding domain; Motif searching; Position weight matrices; Transcription factor.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Automatic quality assessment method for PWM clustering. First, a similarity score matrix for PWMs in a cluster is calculated and Z-score is calculated for each row (one row represents one PWM; Z-scores of one PWM versus all others). Then, Z-scores less than a threshold (e.g., < − 1) are counted to make a frequency count vector, which is sorted and the top 15% of them (default parameter in abc4pwm) are selected as putative poorly clustered PWMs. Finally, the poorly clustered PWMs are identified and be removed from clusters (e.g., PWMs 15, 3, 20 and 21 in the figure)
Fig. 2
Fig. 2
An example of representative motif for a cluster of PWMs in bZIP DBD family. Here, there are five PWMs in a cluster (ATF2, ATF3, ATF4, ATF4_1 and BACH2) from bZIP family. A representative motif of this cluster is shown on the top of figure
Fig. 3
Fig. 3
A boxplot of similarity scores for PWMs in clusters of bZIP DBD family. This figure shows 15 PWM clusters in a bZIP DBD. X-axis indicates the number of poorly clustered PWMs (total number of PWMs) in each cluster. Y-axis shows the distribution of PWM similarity scores in each cluster
Fig. 4
Fig. 4
An overview of abc4pwm workflow. This figure shows an overview of all major features in abc4pwm. Purple line flow shows classification module where input PWMs are divided into DBD. Then, clustering module (orange flow) is applied within each DBD. Subsequently, resulted clusters are subjected to quality assessment (green flow) and a representative motif or PWM is created for each cluster. Green dotted line shows flow where input PWMs skip DBD assignment step. Orange dotted line shows the flow of ensemble learning technique for motif prediction
Fig. 5
Fig. 5
Comparison between automatic quality assessment and manual quality assessment. Here, dark blue represents manual or eye balling assessment of quality for clustered PWMs, where 75 PWMs were identified as poorly clustered. Light blue color shows result of automatic quality assessment for the same clusters provided by abc4pwm, where 121 PWMs were marked as poor quality. There are 58 of PWMs overlapping (dark green color) between the two results. The remaining 15 out of 17 that identified by manual evaluation have mild dissimilarity
Fig. 6
Fig. 6
Comparison clustering quality of PWMs between with and no DBD information. Here, green color represents good homogeneous clusters, yellow means average quality clusters, red are bad quality ones (clusters without similar PWMs). Left panel shows clustering results of TF PWMs by classifying them to DBD family before clustering (with-DBD), while right panel shows clustering results for the same set of PWMs by clustering them directly without consider DBD information (no-DBD)
Fig. 7
Fig. 7
Application of abc4pwm in TF binding motif prediction by using either ChIP-seq data or gene expression profiles. A Application of abc4pwm in ESR1 CHIP-seq data in MCF7 cells, where the predicted novo motif L20_1 is similar to known-ESR1-1,2,3 motifs (similarity score = 0.89) based on motif search module of abc4pwm. B Application of abc4pwm in RNA-seq data of TP-53 knockout experiment, the top enriched novo motif L10_1 in promoters of differentially expressed genes is similar to known-TP53 motifs TP53_6 and TP53_7 (similarity score = 0.85). The left panel is the motif logo of TFs, the right panel shows the output of motif similar scores from abc4pwm searching module
Fig. 8
Fig. 8
Applying an ensemble learning approach to predict TF binding motifs from ESR1 ChIP-seq data. First, input data is randomly selected multiple times from all called peaks from ESR1 ChIP-seq experiment in MCF7 cell line for predicting enriched motif, by using bayesPI2. Then, all predicted PWMs from multiple selections are clustered and quality evaluated by abc4pwm (e.g., three clusters indicated by brown color). Representative motifs or PWMs of good quality clusters are generated, and are used to search against known PWMs of human TFs (~ 1770 PWMs) by using searching module of abc4pwm (gray colored box). The top two matched search results (ESR1_M00959 and ESR1_M00191) are displayed along with their similarity scores, where the motif images are cropped to highlight matched areas

References

    1. Stormo GD, Zhao Y. Determining the specificity of protein-DNA interactions. Nat Rev Genet. 2010;11(11):751–760. - PubMed
    1. Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, Saez-Rodriguez J, Cokelaer T, Vedenko A, Talukder S. Evaluation of methods for modeling transcription factor sequence specificity. Nat Biotechnol. 2013;31(2):126–134. - PMC - PubMed
    1. Batmanov K, Wang J. Predicting variation of DNA shape preferences in protein-DNA interaction in cancer cells with a new biophysical model. Genes (Basel). 2017;8(9). - PMC - PubMed
    1. Fornes O, Castro-Mondragon JA, Khan A, Van der Lee R, Zhang X, Richmond PA, Modi BP, Correard S, Gheorghe M, Baranašić D. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2020;48(D1):D87–D92. - PMC - PubMed
    1. Wingender E, Dietze P, Karas H, Knüppel R. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res. 1996;24(1):238–241. - PMC - PubMed

Substances

LinkOut - more resources