Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Nov;21(11):1872-81.
doi: 10.1101/gr.127951.111. Epub 2011 Jul 27.

A proteogenomic analysis of Anopheles gambiae using high-resolution Fourier transform mass spectrometry

Affiliations

A proteogenomic analysis of Anopheles gambiae using high-resolution Fourier transform mass spectrometry

Raghothama Chaerkady et al. Genome Res. 2011 Nov.

Abstract

Anopheles gambiae is a major mosquito vector responsible for malaria transmission, whose genome sequence was reported in 2002. Genome annotation is a continuing effort, and many of the approximately 13,000 genes listed in VectorBase for Anopheles gambiae are predictions that have still not been validated by any other method. To identify protein-coding genes of An. gambiae based on its genomic sequence, we carried out a deep proteomic analysis using high-resolution Fourier transform mass spectrometry for both precursor and fragment ions. Based on peptide evidence, we were able to support or correct more than 6000 gene annotations including 80 novel gene structures and about 500 translational start sites. An additional validation by RT-PCR and cDNA sequencing was successfully performed for 105 selected genes. Our proteogenomic analysis led to the identification of 2682 genome search-specific peptides. Numerous cases of encoded proteins were documented in regions annotated as intergenic, introns, or untranslated regions. Using a database created to contain potential splice sites, we also identified 35 novel splice junctions. This is a first report to annotate the An. gambiae genome using high-accuracy mass spectrometry data as a complementary technology for genome annotation.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Flowchart illustrating the proteogenomics analysis steps.
Figure 2.
Figure 2.
Mapping of mass spectrometry-derived peptide data onto the VectorBase genome browser. The unique peptides identified by mass spectrometry (rectangle bars), which mapped to the known exons of the gene encoding salivary gland secreted protein 4 (SGS4) (AGAP009917-RA). The peptides identified in this study can be viewed as separate tracks on the VectorBase genome browser using the URL http://funcgen.vector base.org/gdav/das as DAS server and “JHU_Ag_v2” as the data source. The JHU_Ag_v2 track shows peptide data as JHU_Ag_xxxx, where JHU and Ag stand for Johns Hopkins University and An. gambiae, respectively; and “xxxx” denotes the serial number of the peptide. The MS/MS spectra of two representative peptides LESMLEYSDVQIDR (JHU_Ag_24279) and TVDIFVANMITFR (JHU_Ag_41147) are shown.
Figure 3.
Figure 3.
Overview of mass spectrometry data used for genome annotation. (A) An estimation of the mass error of peptides in parts per million identified from mass spectrometric analysis of An. gambiae. (B) Chromosomal distribution of peptides identified by mass spectrometry. The number of peptides identified from each chromosome roughly parallels the estimated number of known and novel protein-coding genes in An. gambiae.
Figure 4.
Figure 4.
(A) N-terminal extension of AGAP011939 using peptides mapping to an upstream intergenic region. Twenty peptides were mapped to an intergenic region upstream of the gene AGAP011939. SNAP predicts a longer gene model that is supported by novel peptides identified upstream of this gene. (B) Identification of a novel protein-coding gene using peptides mapping to an intergenic region. Sixteen peptides were mapped to an intergenic region on chromosome 3R, where the intron of a VectorBase gene model AGAP009515-RA was annotated on the opposite strand. The presence of a novel gene in this region is also indicated by the SNAP prediction program. (C) Correction of a gene structure using peptide mapping to an intron of an annotated gene. Fifteen peptides were identified in the intronic region of the gene AGAP008769. These peptides support two different gene models predicted by SNAP. (D) Identification of peptides translated in a different frame from the annotated protein sequence. Three GSSPs mapped within the coordinates of the sixth exon of the AGAP000622 gene that were not present in the predicted protein product of the gene. However, these peptides were present in the protein product of SNAP prediction and NCBI RefSeq annotation. (E) Identification of a novel protein-coding region using peptides mapping to the UTR of a gene. Five GSSPs mapped to the 3′-UTR region of the AGAP009974 gene. The SNAP prediction model for this genomic region supports a C-terminal extension of the protein encoded by the AGAP009974 gene. (F) Identification of a novel splice form. The peptide, IIEDSDYVAVLFYKPECK, was identified in the MS/MS ion search against the novel splice junction database of hypothetical splice isoforms. This novel splicing event, which occurred between exons 3 and 5 of the AGAP006452-RA gene, is also observed in Culex quinquefasciatus.
Figure 5.
Figure 5.
Validation of mass spectrometry-derived data using RT-PCR and cDNA sequencing. RT-PCR products were sequenced on both strands, and the resulting sequences were submitted to GenBank. GenBank accession numbers are indicated above each lane. (A) RT-PCR validation of 35 novel genes. (B) RT-PCR validation of 70 gene models that led to the correction of existing VectorBase gene annotations in genebuild AgambP3.4. The genes that belong to novel categories with respect to genebuild AgambP3.6 are marked with an asterisk (*).

References

    1. Baerenfaller K, Grossmann J, Grobei MA, Hull R, Hirsch-Hoffmann M, Yalovsky S, Zimmermann P, Grossniklaus U, Gruissem W, Baginsky S 2008. Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science 320: 938–941 - PubMed
    1. Birney E, Clamp M, Durbin R 2004. GeneWise and Genomewise. Genome Res 14: 988–995 - PMC - PubMed
    1. Brunner E, Ahrens CH, Mohanty S, Baetschmann H, Loevenich S, Potthast F, Deutsch EW, Panse C, de Lichtenberg U, Rinner O, et al. 2007. A high-quality catalog of the Drosophila melanogaster proteome. Nat Biotechnol 25: 576–583 - PubMed
    1. Choudhary JS, Blackstock WP, Creasy DM, Cottrell JS 2001. Matching peptide mass spectra to EST and genomic DNA databases. Trends Biotechnol 19: S17–S22 - PubMed
    1. Driessen HP, de Jong WW, Tesser GI, Bloemendal H 1985. The mechanism of N-terminal acetylation of proteins. CRC Crit Rev Biochem 18: 281–325 - PubMed

Publication types

Associated data