Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Feb 9:7:62.
doi: 10.1186/1471-2105-7-62.

Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources

Affiliations

Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources

Mario Stanke et al. BMC Bioinformatics. .

Abstract

Background: In order to improve gene prediction, extrinsic evidence on the gene structure can be collected from various sources of information such as genome-genome comparisons and EST and protein alignments. However, such evidence is often incomplete and usually uncertain. The extrinsic evidence is usually not sufficient to recover the complete gene structure of all genes completely and the available evidence is often unreliable. Therefore extrinsic evidence is most valuable when it is balanced with sequence-intrinsic evidence.

Results: We present a fairly general method for integration of external information. Our method is based on the evaluation of hints to potentially protein-coding regions by means of a Generalized Hidden Markov Model (GHMM) that takes both intrinsic and extrinsic information into account. We used this method to extend the ab initio gene prediction program AUGUSTUS to a versatile tool that we call AUGUSTUS+. In this study, we focus on hints derived from matches to an EST or protein database, but our approach can be used to include arbitrary user-defined hints. Our method is only moderately effected by the length of a database match. Further, it exploits the information that can be derived from the absence of such matches. As a special case, AUGUSTUS+ can predict genes under user-defined constraints, e.g. if the positions of certain exons are known. With hints from EST and protein databases, our new approach was able to predict 89% of the exons in human chromosome 22 correctly.

Conclusion: Sensitive probabilistic modeling of extrinsic evidence such as sequence database matches can increase gene prediction accuracy. When a match of a sequence interval to an EST or protein sequence is used it should be treated as compound information rather than as information about individual positions.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Venn diagram of exons and genes. Area-proportional Venn diagram of three sets of exons (top) and three sets of genes (bottom) for chromosome 22. 'Annotation' refers to the set of 387 genes compiled by the Sanger Institute. Examples: 2271 exons were in the Sanger Center annotation and were exactly predicted by AUGUSTUS+ using the Combined hints and by SGP2. The annotation set and the set of predictions of AUGUSTUS+ shared 71 genes identically, that were not in the set of SGP2 predictions.
Figure 2
Figure 2
Combined hints. The information retrieved from a combination of EST and protein database searches. The input DNA sequence contains one gene of which the dark boxes are the coding parts. At first, ESTs matching the DNA sequence are found and clustered. The concatenation of the segments of the input DNA sequence which are aligned to the clustered ESTs is then searched against a protein database. The protein match can be used to infer which part of the EST consensus sequence was coding. In this example the alignment of the protein started at the first position of its amino acid sequence. Thus a likely translation start site (start hint) can be inferred.

References

    1. Burge C. PhD thesis. Stanford University; 1997. Identification of Genes in Human Genomic DNA.
    1. Stanke M, Waack S. Gene prediction with a hidden Markov model and new intron submodel. Bioinformatics. 2003;19:ii215–ii225. - PubMed
    1. Krogh A. Two methods for improving performance of an HMM and their application for gene finding. Proc Fifth Int Conf Intelligent Systems for Molecular Biology. 1997. pp. 179–186. - PubMed
    1. Parra G, Enrique B, Guigó R. GenelD in Drosophila. Genome Research. 2000;10:511–515. - PMC - PubMed
    1. Parra G, Agarwal P, Abril J, Wiehe T, Fickett J, Guigó R. Comparative Gene Prediction in Human and Mouse. Genome Research. 2003;13:108–117. - PMC - PubMed

Publication types