Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2007 Dec;17(12):1823-36.
doi: 10.1101/gr.6679507. Epub 2007 Nov 7.

Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes

Affiliations
Comparative Study

Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes

Michael F Lin et al. Genome Res. 2007 Dec.

Abstract

The availability of sequenced genomes from 12 Drosophila species has enabled the use of comparative genomics for the systematic discovery of functional elements conserved within this genus. We have developed quantitative metrics for the evolutionary signatures specific to protein-coding regions and applied them genome-wide, resulting in 1193 candidate new protein-coding exons in the D. melanogaster genome. We have reviewed these predictions by manual curation and validated a subset by directed cDNA screening and sequencing, revealing both new genes and new alternative splice forms of known genes. We also used these evolutionary signatures to evaluate existing gene annotations, resulting in the validation of 87% of genes lacking descriptive names and identifying 414 poorly conserved genes that are likely to be spurious predictions, noncoding, or species-specific genes. Furthermore, our methods suggest a variety of refinements to hundreds of existing gene models, such as modifications to translation start codons and exon splice boundaries. Finally, we performed directed genome-wide searches for unusual protein-coding structures, discovering 149 possible examples of stop codon readthrough, 125 new candidate ORFs of polycistronic mRNAs, and several candidate translational frameshifts. These results affect >10% of annotated fly genes and demonstrate the power of comparative genomics to enhance our understanding of genome organization, even in a model organism as intensively studied as Drosophila melanogaster.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Evolutionary signatures for protein-coding gene identification. (A) Within coding regions, triplet substitutions are biased toward conservative codon substitutions (Codon Substitution Frequencies, CSF). Additionally, indels in coding regions are strongly biased to be a multiple of three in length (reading frame conservation; RFC). (B) The color of each codon substitution between the D. melanogaster sequence and an informant sequence corresponds to a log-odds score of observing that substitution in a coding region versus a noncoding region. (C) Quantitative metrics of RFC and CSF distinguish coding and noncoding regions. Shown in blue are 5567 coding exons of well-studied genes and in orange are 22,019 regions chosen uniformly at random from the noncoding part of the genome, with the same length distribution as the exons. The CSF score is length-normalized and the discrete RFC score is dithered by adding random noise uniformly from (−0.5,0.5) for the purposes of visualization.
Figure 2.
Figure 2.
New protein-coding exons predicted by evolutionary signatures, examined by manual curation, and validated by cDNA sequencing. (A) The “Evolutionary Signatures” track shows the posterior probability of a protein-coding state in a probabilistic model integrating the RFC and CSF metrics. The “Conservation” track shows the analogous quantity from a model measuring nucleotide conservation only (Siepel et al. 2005). Note the high protein-coding scores of known exons despite lower nucleotide conservation (a,d), the low protein-coding scores of conserved noncoding regions (c,e), and the prediction of a novel exon within an intron of CG4495 (b), subsequently validated (see Fig. 3). Rendered by the UCSC Genome Browser (Kent et al. 2002). (B) Distribution of 1193 new exon predictions throughout the genome. (C) Newly predicted exons were examined by manual curation, 81% leading to new and modified FlyBase gene annotations. Additionally, curation of genes rejected by evolutionary signatures led to the recognition of hundreds of spurious annotations. (D) A sample of predicted new exons was tested by cDNA sequencing with inverse PCR. Surprisingly, 44% of the validated predictions in “intronic” regions revealed a transcript independent of the surrounding gene, and 40% of the validated predictions in “intergenic” regions were part of existing genes. See Fig. 3 for examples.
Figure 3.
Figure 3.
Full-length cDNA sequences recovered from exon predictions through inverse PCR. (A) Alternatively spliced transcripts—Exon Shuffling. The clone, IP17639, validates prediction congochr2L7183503 and provides evidence for an alternative transcript of the gene CG4495. Analysis of the embryonic microarray data (Manak et al. 2006) shows this exon is not used in embryogenesis, suggesting stage-specific splicing. Interestingly, the two alternative exons encode 20 identical amino acids at the N-terminal side of the exon. (B) 3′ CDS extension. The clone, IP17355, validates two predicted exons, congochr3R23777966 and congochr3R23778197, and provides evidence for an alternative transcript encoding an additional 126 aa at the C-terminal end of the gene, CG4951. In addition, the clone contains 185 bp of 3′ UTR. (C) New spliced interleaved gene. The clone, IP17336, validates four predicted exons, congochr3R15461397, congochr3R15461180, congochr3R15461031, and congochr3R15460742, and provides evidence for four additional exons. (D) Novel spliced overlapping gene. The clone, IP17407, validates prediction congochr3L18835687 and extends the CDS by 22 aa at the N terminus and 79 aa at the C terminus. The third coding exon overlaps the coding sequence of the gene on the opposite strand, Rad9, such that 45 aa on each strand are encoded in the region of overlap.
Figure 4.
Figure 4.
Examples of adjustments to existing annotations based on evolutionary signatures. (A) Translation start adjustment. The annotated coding sequence begins at the indicated ATG, but the informant species show frameshifts, nonsense mutations, and nonconservative substitutions in the immediately downstream region. Strikingly, however, coding signatures begin at a slightly downstream ATG. (B) Incorrect reading frame annotated. The transcript model contains two overlapping reading frames, the slightly longer of which is annotated as the coding sequence; but the evolutionary signatures clearly show that the other is the frame under selection. (C) Nonsense mutation in (the sequenced strain of) D. melanogaster.
Figure 5.
Figure 5.
Unusual protein-coding structures identified by evolutionary signatures. (A) A well-conserved 30-aa ORF immediately following the stop codon in the gene Caki suggests translational readthrough. Note the perfect conservation of the putative readthrough stop codon, the “wobble” of the downstream stop codon, and the precipitous loss of conservation following the downstream stop codon, typical of a true translation stop. (B) A well-conserved ORF within the annotated 3′ UTR of CG4468 suggests a dicistronic transcript structure. Note the region of poor conservation that extends precisely from the upstream stop codon to the downstream start codon, suggesting separate translation of the two ORFs. (C) An abrupt change in the reading frame upon which selection appears to act within an exon of CG14047 is suggestive of a “programmed” translational frameshift (see also Supplemental Fig. 2).

References

    1. Adams M.D., Celniker S.E., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Celniker S.E., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Li P.W., Hoskins R.A., Galle R.F., Hoskins R.A., Galle R.F., Galle R.F., et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. - PubMed
    1. Andrews J., Smith M., Merakovsky J., Coulson M., Hannan F., Kelly L.E., Smith M., Merakovsky J., Coulson M., Hannan F., Kelly L.E., Merakovsky J., Coulson M., Hannan F., Kelly L.E., Coulson M., Hannan F., Kelly L.E., Hannan F., Kelly L.E., Kelly L.E. The stoned locus of Drosophila melanogaster produces a dicistronic transcript and encodes two distinct polypeptides. Genetics. 1996;143:1699–1711. - PMC - PubMed
    1. Bass B.L. RNA editing by adenosine deaminases that act on RNA. Annu. Rev. Biochem. 2002;71:817–846. - PMC - PubMed
    1. Bergman C.M., Pfeiffer B.D., Rincon-Limas D.E., Hoskins R.A., Gnirke A., Mungall C.J., Wang A.M., Kronmiller B., Pacleb J., Park S., Pfeiffer B.D., Rincon-Limas D.E., Hoskins R.A., Gnirke A., Mungall C.J., Wang A.M., Kronmiller B., Pacleb J., Park S., Rincon-Limas D.E., Hoskins R.A., Gnirke A., Mungall C.J., Wang A.M., Kronmiller B., Pacleb J., Park S., Hoskins R.A., Gnirke A., Mungall C.J., Wang A.M., Kronmiller B., Pacleb J., Park S., Gnirke A., Mungall C.J., Wang A.M., Kronmiller B., Pacleb J., Park S., Mungall C.J., Wang A.M., Kronmiller B., Pacleb J., Park S., Wang A.M., Kronmiller B., Pacleb J., Park S., Kronmiller B., Pacleb J., Park S., Pacleb J., Park S., Park S., et al. Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome. Genome Biol. 2002;3 doi: 10.1186/gb-2002-3-12-research0086. - DOI - PMC - PubMed
    1. Bergstrom D.E., Merli C.A., Cygan J.A., Shelby R., Blackman R.K., Merli C.A., Cygan J.A., Shelby R., Blackman R.K., Cygan J.A., Shelby R., Blackman R.K., Shelby R., Blackman R.K., Blackman R.K. Regulatory autonomy and molecular characterization of the Drosophila out at first gene. Genetics. 1995;139:1331–1346. - PMC - PubMed

Publication types

Associated data

LinkOut - more resources