Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2000 Apr;10(4):511-5.
doi: 10.1101/gr.10.4.511.

GeneID in Drosophila

Affiliations

GeneID in Drosophila

G Parra et al. Genome Res. 2000 Apr.

Abstract

GeneID is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov model for coding DNA. In the last step, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. In this paper we describe the obtention of PWMs for sites, and the Markov model of coding DNA in Drosophila melanogaster. We also compare other models of coding DNA with the Markov model. Finally, we present and discuss the results obtained when GeneID is used to predict genes in the Adh region. These results show that the accuracy of GeneID predictions compares currently with that of other existing tools but that GeneID is likely to be more efficient in terms of speed and memory usage.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Predictions obtained by GeneID in the region 462500–477500 from the Adh sequence, compared with the annotation in the standard std3 set. In a first step, GeneID identifies and scores all possible donor (blue) and acceptor (yellow) sites, start codons (green), and stop codons (red) using PWMs—the height of the corresponding spike is proportional to the site score. A total of 4704 sites were generated along this 15,000-bp region by GeneID, only the highest scoring ones are displayed here. In a second step, GeneID builds all exons compatible with these sites. A total of 11,967 exons were built in this particular region (not displayed). Exons are scored as the sum of the scores of the defining sites, plus the score of their coding potential measured according with a Markov model of order 5. The coding potential is displayed along the DNA sequence (MM_score). Regions strong in red are more likely to be coding than regions strong in blue. From the set of predicted exons, the gene structure is generated, maximizing the sum of the scores of the assembled exons. Exons assembled in the predicted genes are drawn with heights proportional to their scores. A two-color code is used to indicate frame compatibility: Two adjacent exons are frame compatible if the right half of the upstream exon (the remainder) matches the color of the left half of the downstream exon (the frame). Data are from the gff2ps program (available at http://www1.imim.es/∼jabril/GFFTOOLS/GFF2PS.html). The input GFF and the configuration files required for gff2ps to generate this diagram can be found at http://www1.imim.es/∼gparra/GASP1.

Comment in

References

    1. Borodovsky M, McIninch J. Genmark: Parallel gene recognition for both DNA strands. Comput Chem. 1993;17:123–113.
    1. Burge CB, Karlin S. Finding the genes in genomic DNA. Curr Opin Struct Biol. 1998;8:346–354. - PubMed
    1. Claverie JM. Computational methods for the identification of genes in vertebrate genomic sequences. Hum Mol Genet. 1997;6:1735–1744. - PubMed
    1. Guigó R. Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol. 1998;5:681–702. - PubMed
    1. ————— . DNA composition, codon usage and exon prediction. In: Bishop M, editor. Nucleic protein databases. San Diego, CA: Academic Press; 1999. pp. 53–80.

Publication types

Substances

LinkOut - more resources