Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 1992 Jul 20;226(2):471-9.
doi: 10.1016/0022-2836(92)90961-i.

Determination of eukaryotic protein coding regions using neural networks and information theory

Affiliations

Determination of eukaryotic protein coding regions using neural networks and information theory

R Farber et al. J Mol Biol. .

Abstract

Our previous work applied neural network techniques to the problem of discriminating open reading frame (ORF) sequences taken from introns versus exons. The method counted the codon frequencies in an ORF of a specified length, and then used this codon frequency representation of DNA fragments to train a neural net (essentially a Perceptron with a sigmoidal, or "soft step function", output) to perform this discrimination. After training, the network was then applied to a disjoint "predict" set of data to assess accuracy. The resulting accuracy in our previous work was 98.4%, exceeding accuracies reported in the literature at that time for other algorithms. Here, we report even higher accuracies stemming from calculations of mutual information (a correlation measure) of spatially separated codons in exons, and in introns. Significant mutual information exists in exons, but not in introns, between adjacent codons. This suggests that dicodon frequencies of adjacent codons are important for intron/exon discrimination. We report that accuracies obtained using a neural net trained on the frequency of dicodons is significantly higher at smaller fragment lengths than even our original results using codon frequencies, which were already higher than simple statistical methods that also used codon frequencies. We also report accuracies obtained from including codon and dicodon statistics in all six reading frames, i.e. the three frames on the original and complement strand. Inclusion of six-frame statistics increases the accuracy still further. We also compare these neural net results to a Bayesian statistical prediction method that assumes independent codon frequencies in each position. The performance of the Bayesian scheme is poorer than any of the neural based schemes, however many methods reported in the literature either explicitly, or implicitly, use this method. Specifically, Bayesian prediction schemes based on codon frequencies achieve 90.9% accuracy on 90 codon ORFs, while our best neural net scheme reaches 99.4% accuracy on 60 codon ORFs. "Accuracy" is defined as the average of the exon and intron sensitivities. Achievement of sufficiently high accuracies on short fragment lengths can be useful in providing a computational means of finding coding regions in unannotated DNA sequences such as those arising from the mega-base sequencing efforts of the Human Genome Project. We caution that the high accuracies reported here do not represent a complete solution to the problem of identifying exons in "raw" base sequences. The accuracies are considerably lower from exons of small length, although still higher than accuracies reported in the literature for other methods. Short exon lengths are not uncommon.(ABSTRACT TRUNCATED AT 400 WORDS)

PubMed Disclaimer

Publication types

Associated data

LinkOut - more resources