Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Mar 1;28(5):636-42.
doi: 10.1093/bioinformatics/btr698. Epub 2012 Jan 3.

Evaluating bacterial gene-finding HMM structures as probabilistic logic programs

Affiliations

Evaluating bacterial gene-finding HMM structures as probabilistic logic programs

Søren Mørk et al. Bioinformatics. .

Abstract

Motivation: Probabilistic logic programming offers a powerful way to describe and evaluate structured statistical models. To investigate the practicality of probabilistic logic programming for structure learning in bioinformatics, we undertook a simplified bacterial gene-finding benchmark in PRISM, a probabilistic dialect of Prolog.

Results: We evaluate Hidden Markov Model structures for bacterial protein-coding gene potential, including a simple null model structure, three structures based on existing bacterial gene finders and two novel model structures. We test standard versions as well as ADPH length modeling and three-state versions of the five model structures. The models are all represented as probabilistic logic programs and evaluated using the PRISM machine learning system in terms of statistical information criteria and gene-finding prediction accuracy, in two bacterial genomes. Neither of our implementations of the two currently most used model structures are best performing in terms of statistical information criteria or prediction performances, suggesting that better-fitting models might be achievable.

Availability: The source code of all PRISM models, data and additional scripts are freely available for download at: http://github.com/somork/codonhmm.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Graphical representation of the conditioning schemes of the underlying structure of the models. (a) iid.psm; (b) eco.psm; (c) i3pmc.psm; (d) mc5.psm; (e) aa.psm; and (f) mm.psm. Squares represent the hidden State(S), Previous State(PS) or Next State(S); circles represent emissions (X) or past emissions (P). The dotted arrows are conditional transition probabilities and the full arrows are conditional emission probabilities.
Fig. 2.
Fig. 2.
E.coli cross-validation ROC curves for the standard models using thresholds over all log odds values in the positive set. Notice that only the area from 0.0–0.1 FPR and 0.9–1.0 TPR is shown.

Similar articles

Cited by

References

    1. Besemer J., Borodovsky M. Heuristic approach to deriving models for gene finding. Nucleic Acids Res. 1999;27:3911–3920. - PMC - PubMed
    1. Besemer J., et al. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 2001;29:2607–2618. - PMC - PubMed
    1. Blattner F.R., et al. The complete genome sequence of Escherichia coli K-12. Science. 1997;277:1453–1462. - PubMed
    1. Bobbio A., et al. Acyclic discrete phase type distributions: properties and a parameter estimation algorithm. Perform. Eval. 2003;54:1–32.
    1. Borodovsky M., McInich J. GENMARK: parallel gene recognition for both DNA strands. Comput. Chem. 1993;17:123.

Publication types