. 2010 Mar 22:11:149.

doi: 10.1186/1471-2105-11-149.

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis

Jens Keilwagen¹, Jan Grau, Stefan Posch, Ivo Grosse

Affiliations

PMID: 20307305
PMCID: PMC2859755
DOI: 10.1186/1471-2105-11-149

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis

Jens Keilwagen et al. BMC Bioinformatics. 2010.

. 2010 Mar 22:11:149.

doi: 10.1186/1471-2105-11-149.

Authors

Jens Keilwagen¹, Jan Grau, Stefan Posch, Ivo Grosse

Affiliation

¹ Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany. Jens.Keilwagen@ipk-gatersleben.de

PMID: 20307305
PMCID: PMC2859755
DOI: 10.1186/1471-2105-11-149

Abstract

Background: One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions.

Results: With the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the same a-priori information, we derive a generalization of the commonly-used product-Dirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites.

Conclusions: We find that comparisons of different learning principles using the same a-priori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior is implemented in the open-source library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis.

PubMed Disclaimer

Figures

**Figure 1**
**Illustration of the derived prior**. Illustration of the derived prior of Eqn. (14) for one and two free parameters. Figure a) shows the derived prior (red line) for one free parameter λ₁and α_i∈ {0.2, 1, 5} in comparison to a Gaussian (black line) and a Laplace prior (green line). Figure b) shows the derived prior for two free parameters λ₁, λ₂and α_i∈ {0.2, 1, 5}.

**Figure 2**
**Comparing generatively to discriminatively trained models**. We compare the classification performance of classifiers using the MAP principle (solid line) and the MSP principle (dashed line) with the derived prior on differently-sized training data sets for binding sites of the transcription factor Sp1. For both classifiers, we use a PWM model in the foreground and a Markov model of order 3 in the background. We plot the four performance measures, false positive rate, sensitivity, positive predictive value, and area under the precision-recall curve (AUC-PR), against the percentage of the preliminary training data set used for estimating the parameters. Whiskers indicate two-fold standard errors. We find that the classification performance increases with increasing size of the training data set. For the false positive rate this corresponds to a decreasing curve. For all four measures and all sizes of the data set, we find that the discriminatively trained Markov models yield a consistently higher classification performance than the generatively trained Markov models.

**Figure 3**
**Comparison of different generatively and discriminatively trained models**. We compare the classification performance of Markov models (MM), mixtures of Markov models (mixMM), Markov random fields (MRF), and mixtures of Markov random fields (mixMRF) for a set of donor splice sites [5] using the MAP and the MSP principle, and using the derived prior for all models. We plot the four performance measures false positive rate, area under the ROC curve (AUC-ROC), positive predictive value, and area under the precision-recall curve (AUC-PR) for each of the four models. For the MAP principle (a-d), the comparison shows that mixMM and mixMRF yield a higher classification performance than MM and MRF, respectively, and that mixMRF achieves the highest classification performance of all models with respect to all four performance measures. For the MSP principle (e-h), the comparison shows that mixMM and mixMRF yield a higher classification performance than MM and MRF, respectively, and that mixMRF achieves the highest classification performance of all models with respect to false positive rate and positive predictive value, whereas the highest AUC-PR and AUC-ROC are achieved by mixMM.

See this image and copyright information in PMC

Cited by

Varying levels of complexity in transcription factor binding motifs.
Keilwagen J, Grau J. Keilwagen J, et al. Nucleic Acids Res. 2015 Oct 15;43(18):e119. doi: 10.1093/nar/gkv577. Epub 2015 Jun 26. Nucleic Acids Res. 2015. PMID: 26116565 Free PMC article.
Accurate prediction of cell type-specific transcription factor binding.
Keilwagen J, Posch S, Grau J. Keilwagen J, et al. Genome Biol. 2019 Jan 10;20(1):9. doi: 10.1186/s13059-018-1614-y. Genome Biol. 2019. PMID: 30630522 Free PMC article.
A general approach for discriminative de novo motif discovery from high-throughput data.
Grau J, Posch S, Grosse I, Keilwagen J. Grau J, et al. Nucleic Acids Res. 2013 Nov;41(21):e197. doi: 10.1093/nar/gkt831. Epub 2013 Sep 20. Nucleic Acids Res. 2013. PMID: 24057214 Free PMC article.
Systems biology data analysis methodology in pharmacogenomics.
Rodin AS, Gogoshin G, Boerwinkle E. Rodin AS, et al. Pharmacogenomics. 2011 Sep;12(9):1349-60. doi: 10.2217/pgs.11.76. Pharmacogenomics. 2011. PMID: 21919609 Free PMC article. Review.
New Algorithm and Software (BNOmics) for Inferring and Visualizing Bayesian Networks from Heterogeneous Big Biological and Genetic Data.
Gogoshin G, Boerwinkle E, Rodin AS. Gogoshin G, et al. J Comput Biol. 2017 Apr;24(4):340-356. doi: 10.1089/cmb.2016.0100. Epub 2016 Sep 28. J Comput Biol. 2017. PMID: 27681505 Free PMC article.

See all "Cited by" articles

References

1. Kel AE, Gössling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E. MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 2003;31(13):3576–3579. doi: 10.1093/nar/gkg585. - DOI - PMC - PubMed
1. Barash Y, Elidan G, Friedman N, Kaplan T. RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology. New York, NY, USA: ACM Press; 2003. Modelling dependencies in protein-DNA binding sites; pp. 28–37. full_text.
1. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology. 1997;268:78–94. doi: 10.1006/jmbi.1997.0951. - DOI - PubMed
1. Salzberg SL. A method for identifying splice sites and translational start sites in eukaryotic mRNA. Comput Appl Biosci. 1997;13(4):365–376. - PubMed
1. Yeo G, Burge CB. Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals. Journal of Computational Biology. 2004;11(2-3):377–394. doi: 10.1089/1066527041410418. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis

Affiliation

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources