Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 5;20(1):174.
doi: 10.1186/s12859-019-2781-x.

Computational enhancer prediction: evaluation and improvements

Affiliations

Computational enhancer prediction: evaluation and improvements

Hasiba Asma et al. BMC Bioinformatics. .

Abstract

Background: Identifying transcriptional enhancers and other cis-regulatory modules (CRMs) is an important goal of post-sequencing genome annotation. Computational approaches provide a useful complement to empirical methods for CRM discovery, but it is critical that we develop effective means to evaluate their performance in terms of estimating their sensitivity and specificity.

Results: We introduce here pCRMeval, a pipeline for in silico evaluation of any enhancer prediction tools that are flexible enough to be applied to the Drosophila melanogaster genome. pCRMeval compares the result of predictions with the extensive existing knowledge of experimentally-validated Drosophila CRMs in order to estimate the precision and relative sensitivity of the prediction method. In the case of supervised prediction methods-when training data composed of validated CRMs are used-pCRMeval can also assess the sensitivity of specific training sets. We demonstrate the utility of pCRMeval through evaluation of our SCRMshaw CRM prediction method and training data. By measuring the impact of different parameters on SCRMshaw performance, as assessed by pCRMeval, we develop a more robust version of SCRMshaw, SCRMshaw_HD, that improves the number of predictions while maintaining sensitivity and specificity. Our analysis also demonstrates that SCRMshaw_HD, when applied to increasingly less well-assembled genomes, maintains its strong predictive power with only a minor drop-off in performance.

Conclusion: Our pCRMeval pipeline provides a general framework for evaluation that can be applied to any CRM prediction method, particularly a supervised method. While we make use of it here primarily to test and improve a particular method for CRM prediction, SCRMshaw, pCRMeval should provide a valuable platform to the research community not only for evaluating individual methods, but also for comparing between competing methods.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Performance evaluation for SCRMshaw using pCRMeval. pCRMeval demonstrates that SCRMshaw, when training on real CRM data, performs better than either random training data or random expectation. a Aggregate performance for training set sensitivity, REDfly recovery, and expression pattern precision for 29 true training sets, 62 random training sets, and random expectation. b-d Comparison of training set sensitivity (b), REDfly recovery (c), and expression pattern precision (d) for true predictions versus random expectation for each of the 29 training sets
Fig. 2
Fig. 2
Performance evaluation for SCRMshaw using pCRMeval on a continuous scale. (a) Training set sensitivity, (b) REDfly recovery, and (c) expression pattern precision for selected training sets (solid lines) compared to the median percentage (dashed line) and 1st and 3rd quartiles (shaded region) of 62 random training sets, and to random expectation (dotted line). Black solid line, mapping 1.neuroectoderm; orange solid line, mapping1.somatic_muscle; blue solid line, mapping1.visceral_mesoderm
Fig. 3
Fig. 3
SCRMshaw results vary based on analysis starting position. Results are shown based on pCRMeval assessment of training set sensitivity, REDfly recovery, and expression pattern precision for two representative trainings sets (“mapping1.blastoderm,” “mapping1.dorsal_ectoderm”) with starting position offsets of 0, 5, 15, 40, 80 and 125 base pairs. a Results using a fixed cutoff. b Results using a continuous scale
Fig. 4
Fig. 4
The SCRMshaw-HD protocol. The new, more robust SCRMshaw-HD protocol is shown to the left, with the default SCRMshaw protocol to the right. a In both protocols, 500 bp windows are scored with a 250 bp offset between windows. b, c For SCRMshaw-HD, this process is parallelized by running 25 instances of SCRMshaw simultaneously, with each instance starting its first window at a different starting points corresponding to 0, 10, 20, 30, …, 240 bp from each chromosome/scaffold end. d The output from the individual SCRMshaw runs is concatenated into a single output representing 500 bp windows with 10 bp offsets across the entire genome. e The SCRMshaw scores for each 10 bp genomic window are summed, with any individual score (orange boxes) below the value of the 5000th ranked score reassigned to zero (gray boxes). The summed scores (f) are used as the basis for peak calling. Any peaks with amplitude above the selected amplitude threshold (g, red dot) are then evaluated for SCRMshaw score (h, red dot; see Methods for details). Peaks meeting both criteria (e.g., peak “d” in panel f) are accepted as “top predictions.” Peaks that either fall below the amplitude cutoff (e.g. “b” in panel f), or which pass the amplitude cutoff but not the score cutoff (“a” in panel f) are not considered top predictions. In default SCRMshaw (right side of figure), those predictions with SCRMshaw score above the elbow point of the curve of all ranked SCRMshaw scores are considered to be “top” predictions (h, red dot)
Fig. 5
Fig. 5
Degree of genome assembly has a minimal impact on SCRMshaw performance. Black boxplots (top) show the aggregate percentage of true positives for eight representative training sets (each shown as different colored point), and blue boxplots (bottom) the aggregate percentage of false positives for the same sets, over a range of simulated qualities of genome assembly. For details about genomes “A” through “J” see Table 3
Fig. 6
Fig. 6
Correlation between SCRMshaw scores of corresponding windows in the native and simulated genomes

References

    1. Davidson EH. The regulatory genome: gene regulatory networks in development and evolution. Burlington: Academic Press; 2006. ISBN 0120885638.
    1. Carroll SB, Grenier JK, Weatherbee SD. From DNA to diversity. Molecular Genetics and the Evolution of Animal Design. Massachusetts: Blackwell Science; 2001.
    1. Suryamohan K, Halfon MS. Overview article: identifying transcriptional cis-regulatory modules in animal genomes. Wiley Interdiscip Rev Dev Biol. 2015;4(2):59–84. doi: 10.1002/wdev.168. - DOI - PMC - PubMed
    1. Kleftogiannis D, Kalnis P, Bajic VB. Progress and challenges in bioinformatics approaches for enhancer identification. Brief Bioinform. 2016;17(6):967–979. doi: 10.1093/bib/bbv101. - DOI - PMC - PubMed
    1. Su J, Teichmann SA, Down TA. Assessing computational methods of cis-regulatory module prediction. PLoS Comput Biol. 2010;6(12):e1001020. doi: 10.1371/journal.pcbi.1001020. - DOI - PMC - PubMed

Substances

LinkOut - more resources