Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 17;23(1):bbab374.
doi: 10.1093/bib/bbab374.

Assessing deep learning methods in cis-regulatory motif finding based on genomic sequencing data

Affiliations

Assessing deep learning methods in cis-regulatory motif finding based on genomic sequencing data

Shuangquan Zhang et al. Brief Bioinform. .

Abstract

Identifying cis-regulatory motifs from genomic sequencing data (e.g. ChIP-seq and CLIP-seq) is crucial in identifying transcription factor (TF) binding sites and inferring gene regulatory mechanisms for any organism. Since 2015, deep learning (DL) methods have been widely applied to identify TF binding sites and predict motif patterns, with the strengths of offering a scalable, flexible and unified computational approach for highly accurate predictions. As far as we know, 20 DL methods have been developed. However, without a clear and systematic assessment, users will struggle to choose the most appropriate tool for their specific studies. In this manuscript, we evaluated 20 DL methods for cis-regulatory motif prediction using 690 ENCODE ChIP-seq, 126 cancer ChIP-seq and 55 RNA CLIP-seq data. Four metrics were investigated, including the accuracy of motif finding, the performance of DNA/RNA sequence classification, algorithm scalability and tool usability. The assessment results demonstrated the high complementarity of the existing DL methods. It was determined that the most suitable model should primarily depend on the data size and type and the method's outputs.

Keywords: CLIP-seq; ChIP-seq; TF binding sites identification; deep learning method assessment; motif prediction.

PubMed Disclaimer

Figures

Figure 1
Figure 1
ChIP-seq data input and five categories of DL methods. Outcomes include both predicted sequence labels and identified motif patterns.
Figure 2
Figure 2
Schematic overview of the evaluation pipeline. AEMR score assesses the sequence classification ability based on F1_score, recall, precision, PRC, AUC, MCC, specificity and ACC between predicted classification labels and ChIP-seq peak labels. The motif prediction score (with a P-value and a similarity) assesses how well the predicted motifs can be, based on the documented TFBSs.
Figure 3
Figure 3
Illustration of evaluation results for the 20 DL tools. (A) For DNA sequence-based analysis, tools were separated by DL methods. In each comparative group, tools were ranked by their overall score (grey) from high to low. Four evaluation scores were shown: AEMR (blue), motif prediction score (green), algorithm scalability (pink) and tool usability (yellow). The highest score for each evaluation score is highlighted in a red box. The result of the conventional method gkmSVM and MEME-ChIP was also shown at the bottom for comparison. (B) For RNA sequence-based analysis, the same columns and labels were used as described in A.
Figure 4
Figure 4
Analysis of motif analysis on nine cancer types. (A) AEMR scores of the 15 DL methods across the nine cancer types. (B) Box plot of motif enrichment P-value (with details in the Method section) of 11 methods with respect to breast cancer. (C) For each cancer type, we calculate the average number of identified motifs for each tool. Note that, we only keep motifs that can be matched with existing motif patterns in the database using TOMTOM and TFBSTools. The horizontal red line indicates the highest median value on the y-axis. (D) The shared motifs between the nine different cancer types. Motifs shared between breast cancer and colorectal cancer were highlighted as cyan, and all other shared links were light grey.

Similar articles

Cited by

References

    1. Lin Quy Xiao X, Thieffry D, Jha S, et al. . TFregulomeR reveals transcription factors’ context-specific features and functions. Nucleic Acids Res 2019;48:e10–0. - PMC - PubMed
    1. Bhagwat AS, Vakoc CR. Targeting transcription factors in cancer. Trends Cancer 2015;1:53–65. - PMC - PubMed
    1. D'haeseleer P. What are DNA sequence motifs? Nat Biotechnol 2006;24:423–5. - PubMed
    1. Chen H, Li H, Liu F, et al. . An integrative analysis of TFBS-clustered regions reveals new transcriptional regulation models on the accessible chromatin landscape. Sci Rep 2015;5:8465. - PMC - PubMed
    1. Zambelli F, Pesole G, Pavesi G. Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform 2013;14:225–37. - PMC - PubMed

Publication types

Substances