Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations

Gennifer E Merrihew¹, Colleen Davis, Brent Ewing, Gary Williams, Lukas Käll, Barbara E Frewen, William Stafford Noble, Phil Green, James H Thomas, Michael J MacCoss

Affiliations

PMID: 18653799
PMCID: PMC2556273
DOI: 10.1101/gr.077644.108

Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations

Gennifer E Merrihew et al. Genome Res. 2008 Oct.

. 2008 Oct;18(10):1660-9.

doi: 10.1101/gr.077644.108. Epub 2008 Jul 24.

Authors

Gennifer E Merrihew¹, Colleen Davis, Brent Ewing, Gary Williams, Lukas Käll, Barbara E Frewen, William Stafford Noble, Phil Green, James H Thomas, Michael J MacCoss

Affiliation

¹ University of Washington, Department of Genome Sciences, Seattle, Washington 98195, USA.

PMID: 18653799
PMCID: PMC2556273
DOI: 10.1101/gr.077644.108

Abstract

We describe a general mass spectrometry-based approach for gene annotation of any organism and demonstrate its effectiveness using the nematode Caenorhabditis elegans. We detected 6779 C. elegans proteins (67,047 peptides), including 384 that, although annotated in WormBase WS150, lacked cDNA or other prior experimental support. We also identified 429 new coding sequences that were unannotated in WS150. Nearly half (192/429) of the new coding sequences were confirmed with RT-PCR data. Thirty-three (approximately 8%) of the new coding sequences had been predicted to be pseudogenes, 151 (approximately 35%) reveal apparent errors in gene models, and 245 (57%) appear to be novel genes. In addition, we verified 6010 exon-exon splice junctions within existing WormBase gene models. Our work confirms that mass spectrometry is a powerful experimental tool for annotating sequenced genomes. In addition, the collection of identified peptides should facilitate future proteomics experiments targeted at specific proteins of interest.

PubMed Disclaimer

Figures

**Figure 1.**
Chromsomal distribution of peptides identified by mass spectrometry in *C. elegans*. Shown here are the distributions of our mass spectral identifications by chromosome location. The chromosomes are binned into sections of ∼100 kb, and the length of the blue line represents the number of spectra mapping to genes in that bin. This figure shows that our peptides are sampled more frequently from genes in the center of the autosomes and more disperse on the arms of the autosomes and on the sex chromosome. Assuming that peptides are sampled more frequently for abundant proteins, these data support that proteins near the center of the autosome are, on average, expressed at a greater abundance than proteins located on the arms of the autosome.

**Figure 2.**
Splice junction confirmation via mass spectrometry. The confirmation of the splice junction between exon 10 (in blue) and exon 11 (in green) for the *C. elegans* gene C27C12.7 encoding a dipeptidyl peptidase (DPF-2) is illustrated. (A) The unspliced DNA sequence of C27C12.7 between the end of exon 10 and the beginning of exon 11. (B) Exon 10 and exon 11 spliced together. (C) The spliced exons separated into codons. (D) The peptide sequence spanning the splice junction and the representative mass spectrum. The numbers in blue *above* the peptide sequence represent the C-terminal y-ions and the red numbers *below* the peptide sequence represent the N-terminal b-ions. (Blue) y-ions; (red) b-ions; (green) all other ions (a-ions, doubly charged ions, ions from the loss of water or ammonia, etc.).

**Figure 3.**
Classification of proteins identified from existing or new coding sequences. From the total 6779 proteins identified, 6350 were identified based on the protein-coding genes from WormBase WS150, and 429 proteins were identified using either new GeneFinder predictions, the conserved intergenic data set, or both. From the 429 new proteins, 33 mapped to predicted pseudogenes in WS150. Of the 33 predicted pseudogenes, 18.2% have been confirmed by RT-PCR. We have identified 151 misannotated protein sequences, and 56.9% of these new coding sequences have RT-PCR confirmation. The last category represents 245 novel or unknown coding sequences of which 40.8% have RT-PCR confirmation.

**Figure 4.**
Identification of a novel coding sequence by shotgun proteomics. Three unique peptides were identified in the genomic region 16,652,022–16,654,397 on the X chromosome. This genomic region represents a new ORF from the new GeneFinder prediction set. There are no gene models predicted in this region in WormBase WS150; however, several SAGE tags confirming this gene model have been added since WS150. A mass spectrum from the peptide SPASGSALLDLLSR is shown.

**Figure 5.**
Correction of a misannotated coding sequence. The gene *alh-3* (F36H1.6) contains six exons (pink) and encodes a dehydrogenase in *C. elegans* according to WormBase 150. We have identified two unique peptides (blue) between exons 2 and 3 that span the genomic region 11,022,575–11,025,702 on chromosome IV. Both peptides lie at least partially within an intron. This gene model has since been fixed in WS180.

**Figure 6.**
Identification of a misannotated coding sequence located in an untranslated region (UTR). WormBase gene model T08A9.11 lies within genomic region of 7,327,554–7,330,510 on chromosome X. The two unique peptides (blue) lie within 3′ UTR (gray) region of the gene in WormBase 150. A mass spectrum from the peptide SSLTIPDNFVTEGEVPQK, one of the two peptides identified within the 3′ UTR, is shown.

**Figure 7.**
Identification of a translated pseudogene. Two unique peptides (blue) span the conserved intergenic ORF prediction located at 13,386,747–13,387,043 on chromosome IV. In WormBase WS150 these peptides were present within a predicted pseudogene. In a later version of WormBase, this pseudogene has been corrected to a protein-coding gene. A mass spectrum of the peptide DMFAFENVGFTR, one of the two peptides confirming the translation of this pseudogene, is illustrated.

**Figure 8.**
Peptides identified in the insulin/insulin-like growth factor 1 signaling pathway can be used as proteotypic peptides in future targeted analyses. Shown here are the major proteins involved in the insulin/insulin-like growth factor 1 signaling pathway along with peptides identified from the respective proteins.

See this image and copyright information in PMC

References

1. Anderson L., Hunter C.L., Hunter C.L. Quantitative mass spectrometric multiple reaction monitoring assays for major plasma proteins. Mol. Cell. Proteomics. 2006;5:573–588. - PubMed
1. Basrai M.A., Hieter P., Boeke J.D., Hieter P., Boeke J.D., Boeke J.D. Small open reading frames: Beautiful needles in the haystack. Genome Res. 1997;7:768–771. - PubMed
1. Brunner E., Ahrens C.H., Mohanty S., Baetschmann H., Loevenich S., Potthast F., Deutsch E.W., Panse C., de Lichtenberg U., Rinner O., Ahrens C.H., Mohanty S., Baetschmann H., Loevenich S., Potthast F., Deutsch E.W., Panse C., de Lichtenberg U., Rinner O., Mohanty S., Baetschmann H., Loevenich S., Potthast F., Deutsch E.W., Panse C., de Lichtenberg U., Rinner O., Baetschmann H., Loevenich S., Potthast F., Deutsch E.W., Panse C., de Lichtenberg U., Rinner O., Loevenich S., Potthast F., Deutsch E.W., Panse C., de Lichtenberg U., Rinner O., Potthast F., Deutsch E.W., Panse C., de Lichtenberg U., Rinner O., Deutsch E.W., Panse C., de Lichtenberg U., Rinner O., Panse C., de Lichtenberg U., Rinner O., de Lichtenberg U., Rinner O., Rinner O., et al. A high-quality catalog of the Drosophila melanogaster proteome. Nat. Biotechnol. 2007;25:576–583. - PubMed
1. The C. elegans Sequencing Consortium Genome sequence of the nematode C. elegans: A platform for investigating biology. Science. 1998;282:2012–2018. - PubMed
1. Desiere F., Deutsch E.W., King N.L., Nesvizhskii A.I., Mallick P., Eng J., Chen S., Eddes J., Loevenich S.N., Aebersold R., Deutsch E.W., King N.L., Nesvizhskii A.I., Mallick P., Eng J., Chen S., Eddes J., Loevenich S.N., Aebersold R., King N.L., Nesvizhskii A.I., Mallick P., Eng J., Chen S., Eddes J., Loevenich S.N., Aebersold R., Nesvizhskii A.I., Mallick P., Eng J., Chen S., Eddes J., Loevenich S.N., Aebersold R., Mallick P., Eng J., Chen S., Eddes J., Loevenich S.N., Aebersold R., Eng J., Chen S., Eddes J., Loevenich S.N., Aebersold R., Chen S., Eddes J., Loevenich S.N., Aebersold R., Eddes J., Loevenich S.N., Aebersold R., Loevenich S.N., Aebersold R., Aebersold R. The PeptideAtlas project. Nucleic Acids Res. 2006;34:D655–D658. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations

Affiliation

Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources