Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Dec 21:5:206.
doi: 10.1186/1471-2105-5-206.

An empirical analysis of training protocols for probabilistic gene finders

Affiliations

An empirical analysis of training protocols for probabilistic gene finders

William H Majoros et al. BMC Bioinformatics. .

Erratum in

  • BMC Bioinformatics. 2005;6:193

Abstract

Background: Generalized hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations. While prevailing methods for modeling and parsing genes using GHMMs have been described in the literature, little attention has been paid as of yet to their proper training. The few hints available in the literature together with anecdotal observations suggest that most practitioners perform maximum likelihood parameter estimation only at the local submodel level, and then attend to the optimization of global parameter structure using some form of ad hoc manual tuning of individual parameters.

Results: We decided to investigate the utility of applying a more systematic optimization approach to the tuning of global parameter structure by implementing a global discriminative training procedure for our GHMM-based gene finder. Our results show that significant improvement in prediction accuracy can be achieved by this method.

Conclusions: We conclude that training of GHMM-based gene finders is best performed using some form of discriminative training rather than simple maximum likelihood estimation at the submodel level, and that generalized gradient ascent methods are suitable for this task. We also conclude that partitioning of training data for the twin purposes of maximum likelihood initialization and gradient ascent optimization appears to be unnecessary, but that strict segregation of test data must be enforced during final gene finder evaluation to avoid artificially inflated accuracy measurements.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Maximum likelihood versus gradient ascent Gradient ascent parameter estimation (GRAPE) improves accuracy over MLE at the nucleotide, exon, and whole gene levels. arab = Arabidopsis thaliana, asp = Aspergillus fumigatus.
Figure 2
Figure 2
Data partitioning for gradient ascent Separating the training set into an 800-gene MLE set and a 200-gene gradient ascent set provides no improvement over simply performing MLE and GRAPE on the full training set.
Figure 3
Figure 3
Cross-validation versus testing on unseen data Cross-validation scores provide a reasonably accurate prediction of performance on unseen data. Results shown for A. thaliana only; results for A. fumigatus are given in Table 2.
Figure 4
Figure 4
Evaluation on the training set Accuracy measurements taken from the training set were artificially inflated, as expected. Results are shown only for A. thaliana; results for A. fumigatus were even more extreme.
Figure 5
Figure 5
Gradient ascent training Schematic diagram of gradient ascent training procedure. Of 29 parameters modified by gradient ascent, some (e.g., WAM size) were used to control the MLE estimation procedure, while others (e.g., mean intron length) were used directly as parameters to the GHMM. Testing of the gradient direction was performed on the 200-gene cross-validation set, which was part of the 1000-gene training set, T.
Figure 6
Figure 6
Cross-validation experiments Five-fold cross-validation was used both in the gradient ascent and in the MLE-only experiments. For gradient ascent training, MLE was performed on four-fifths of the training set (T) and then gradient ascent was performed on the other one-fifth. A separate hold-out set (H) of 1000 genes was used to obtain an unbiased evaluation of all final models.

References

    1. Kulp D, Haussler D, Reese MG, Eeckman FH. A generalized hidden Markov model for the recognition of human genes in DNA. In: States DJ, Agarwal P, Gaasterland T, Hunter L, Smith RF, editor. In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology: 12–15 June 1996 St Louis. Menlo Park: American Association for Artificial Intelligence; 1996. pp. 134–142. - PubMed
    1. Burge C. PhD thesis. Stanford University, Mathematics Department; 1997. Identification of genes in human genomic DNA.
    1. Salamov A, Salovyev V. Ab initio gene finding in Drosophila genome DNA. Genome Res. 2000;10:516–522. - PMC - PubMed
    1. Cawley SE, Wirth AI, Speed TP. Phat – a gene finding program for Plasmodium falciparum. Mol Biochem Parasitol. 2001;118:167–174. - PubMed
    1. Majoros WM, Pertea M, Antonescu C, Salzberg SL. GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders. Nucleic Acids Res. 2003;31:3601–3604. - PMC - PubMed

Publication types

LinkOut - more resources