A hidden Markov model that finds genes in E. coli DNA
- PMID: 7984429
- PMCID: PMC308529
- DOI: 10.1093/nar/22.22.4768
A hidden Markov model that finds genes in E. coli DNA
Abstract
A hidden Markov model (HMM) has been developed to find protein coding genes in E. coli DNA using E. coli genome DNA sequence from the EcoSeq6 database maintained by Kenn Rudd. This HMM includes states that model the codons and their frequencies in E. coli genes, as well as the patterns found in the intergenic region, including repetitive extragenic palindromic sequences and the Shine-Delgarno motif. To account for potential sequencing errors and or frameshifts in raw genomic DNA sequence, it allows for the (very unlikely) possibility of insertions and deletions of individual nucleotides within a codon. The parameters of the HMM are estimated using approximately one million nucleotides of annotated DNA in EcoSeq6 and the model tested on a disjoint set of contigs containing about 325,000 nucleotides. The HMM finds the exact locations of about 80% of the known E. coli genes, and approximate locations for about 10%. It also finds several potentially new genes, and locates several places were insertion or deletion errors/and or frameshifts may be present in the contigs.
Similar articles
-
GeneMark.hmm: new solutions for gene finding.Nucleic Acids Res. 1998 Feb 15;26(4):1107-15. doi: 10.1093/nar/26.4.1107. Nucleic Acids Res. 1998. PMID: 9461475 Free PMC article.
-
[Statistical characteristics in primary structures of functional regions of Escherichia coli genome. II. Non-stationary Markov chains].Mol Biol (Mosk). 1986 Jul-Aug;20(4):1024-33. Mol Biol (Mosk). 1986. PMID: 3531811 Russian.
-
Intrinsic and extrinsic approaches for detecting genes in a bacterial genome.Nucleic Acids Res. 1994 Nov 11;22(22):4756-67. doi: 10.1093/nar/22.22.4756. Nucleic Acids Res. 1994. PMID: 7984428 Free PMC article.
-
Hidden Markov model and its applications in motif findings.Methods Mol Biol. 2010;620:405-16. doi: 10.1007/978-1-60761-580-4_13. Methods Mol Biol. 2010. PMID: 20652513 Review.
-
Optimizing scaleup yield for protein production: Computationally Optimized DNA Assembly (CODA) and Translation Engineering.Biotechnol Annu Rev. 2007;13:27-42. doi: 10.1016/S1387-2656(07)13002-7. Biotechnol Annu Rev. 2007. PMID: 17875472 Review.
Cited by
-
Current methods of gene prediction, their strengths and weaknesses.Nucleic Acids Res. 2002 Oct 1;30(19):4103-17. doi: 10.1093/nar/gkf543. Nucleic Acids Res. 2002. PMID: 12364589 Free PMC article. Review.
-
ORF organization and gene recognition in the yeast genome.Comp Funct Genomics. 2003;4(3):318-28. doi: 10.1002/cfg.292. Comp Funct Genomics. 2003. PMID: 18629282 Free PMC article.
-
Evaluating bacterial gene-finding HMM structures as probabilistic logic programs.Bioinformatics. 2012 Mar 1;28(5):636-42. doi: 10.1093/bioinformatics/btr698. Epub 2012 Jan 3. Bioinformatics. 2012. PMID: 22215819 Free PMC article.
-
env sequences of simian immunodeficiency viruses from chimpanzees in Cameroon are strongly related to those of human immunodeficiency virus group N from the same geographic area.J Virol. 2000 Jan;74(1):529-34. doi: 10.1128/jvi.74.1.529-534.2000. J Virol. 2000. PMID: 10590144 Free PMC article.
-
Use of artificial genomes in assessing methods for atypical gene detection.PLoS Comput Biol. 2005 Nov;1(6):e56. doi: 10.1371/journal.pcbi.0010056. Epub 2005 Nov 11. PLoS Comput Biol. 2005. PMID: 16292353 Free PMC article.
References
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources