Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks
- PMID: 8441672
- PMCID: PMC309159
- DOI: 10.1093/nar/21.3.607
Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks
Abstract
Dynamic programming (DP) is applied to the problem of precisely identifying internal exons and introns in genomic DNA sequences. The program GeneParser first scores the sequence of interest for splice sites and for these intron- and exon-specific content measures: codon usage, local compositional complexity, 6-tuple frequency, length distribution and periodic asymmetry. This information is then organized for interpretation by DP. GeneParser employs the DP algorithm to enforce the constraints that introns and exons must be adjacent and non-overlapping and finds the highest scoring combination of introns and exons subject to these constraints. Weights for the various classification procedures are determined by training a simple feed-forward neural network to maximize the number of correct predictions. In a pilot study, the system has been trained on a set of 56 human gene fragments containing 150 internal exons in a total of 158,691 bps of genomic sequence. When tested against the training data, GeneParser precisely identifies 75% of the exons and correctly predicts 86% of coding nucleotides as coding while only 13% of non-exon bps were predicted to be coding. This corresponds to a correlation coefficient for exon prediction of 0.85. Because of the simplicity of the network weighting scheme, generalization performance is nearly as good as with the training set.
Similar articles
-
Identification of protein coding regions in genomic DNA.J Mol Biol. 1995 Apr 21;248(1):1-18. doi: 10.1006/jmbi.1995.0198. J Mol Biol. 1995. PMID: 7731036
-
Detection of compositional constraints in nucleic acid sequences using neural networks.Comput Appl Biosci. 1995 Feb;11(1):29-37. doi: 10.1093/bioinformatics/11.1.29. Comput Appl Biosci. 1995. PMID: 7796272
-
Determination of eukaryotic protein coding regions using neural networks and information theory.J Mol Biol. 1992 Jul 20;226(2):471-9. doi: 10.1016/0022-2836(92)90961-i. J Mol Biol. 1992. PMID: 1640461
-
Advances in the Exon-Intron Database (EID).Brief Bioinform. 2006 Jun;7(2):178-85. doi: 10.1093/bib/bbl003. Epub 2006 Mar 9. Brief Bioinform. 2006. PMID: 16772261 Review.
-
Assessment of protein coding measures.Nucleic Acids Res. 1992 Dec 25;20(24):6441-50. doi: 10.1093/nar/20.24.6441. Nucleic Acids Res. 1992. PMID: 1480466 Free PMC article. Review.
Cited by
-
Intrinsic and extrinsic approaches for detecting genes in a bacterial genome.Nucleic Acids Res. 1994 Nov 11;22(22):4756-67. doi: 10.1093/nar/22.22.4756. Nucleic Acids Res. 1994. PMID: 7984428 Free PMC article.
-
Intron exon boundary junctions in human genome have in-built unique structural and energetic signals.Nucleic Acids Res. 2021 Mar 18;49(5):2674-2683. doi: 10.1093/nar/gkab098. Nucleic Acids Res. 2021. PMID: 33621338 Free PMC article.
-
Evaluating high-throughput ab initio gene finders to discover proteins encoded in eukaryotic pathogen genomes missed by laboratory techniques.PLoS One. 2012;7(11):e50609. doi: 10.1371/journal.pone.0050609. Epub 2012 Nov 30. PLoS One. 2012. PMID: 23226328 Free PMC article.
-
A brief review of computational gene prediction methods.Genomics Proteomics Bioinformatics. 2004 Nov;2(4):216-21. doi: 10.1016/s1672-0229(04)02028-5. Genomics Proteomics Bioinformatics. 2004. PMID: 15901250 Free PMC article. Review.
-
Design optimization methods for genomic DNA tiling arrays.Genome Res. 2006 Feb;16(2):271-81. doi: 10.1101/gr.4452906. Epub 2005 Dec 19. Genome Res. 2006. PMID: 16365382 Free PMC article.
References
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources