Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Sep 19:6:128.
doi: 10.1186/1471-2164-6-128.

Genome annotation of Anopheles gambiae using mass spectrometry-derived data

Affiliations

Genome annotation of Anopheles gambiae using mass spectrometry-derived data

Dário E Kalume et al. BMC Genomics. .

Abstract

Background: A large number of animal and plant genomes have been completely sequenced over the last decade and are now publicly available. Although genomes can be rapidly sequenced, identifying protein-coding genes still remains a problematic task. Availability of protein sequence data allows direct confirmation of protein-coding genes. Mass spectrometry has recently emerged as a powerful tool for proteomic studies. Protein identification using mass spectrometry is usually carried out by searching against databases of known proteins or transcripts. This approach generally does not allow identification of proteins that have not yet been predicted or whose transcripts have not been identified.

Results: We searched 3,967 mass spectra from 16 LC-MS/MS runs of Anopheles gambiae salivary gland homogenates against the Anopheles gambiae genome database. This allowed us to validate 23 known transcripts and 50 novel transcripts. In addition, a novel gene was identified on the basis of peptides that matched a genomic region where no gene was known and no transcript had been predicted. The amino termini of proteins encoded by two predicted transcripts were confirmed based on N-terminally acetylated peptides sequenced by tandem mass spectrometry. Finally, six sequence polymorphisms could be annotated based on experimentally obtained peptide sequences.

Conclusion: The peptide sequences from this study were mapped onto the genomic sequence using the distributed annotation system available at Ensembl and can be visualized in the context of all other existing annotations. The strategy described in this paper can be used to correct and confirm genome annotations and permit discovery of novel proteins in a high-throughput manner by mass spectrometry.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A workflow depicting the steps involved in mass spectrometry data analysis for genome annotation purposes. In this case, the tryptic peptide mixture derived from digestion of Anopheles gambiae salivary gland proteins was analyzed by liquid chromatography tandem mass spectrometry (LC-MS/MS). The mass spectrometry data was searched against the NCBI non-redundant protein database to identify known or novel transcripts from An. gambiae. The data was also searched against the An. gambiae genome database to identify novel protein-coding genes. A careful bioinformatics analysis was performed to use peptide data for correcting genomic annotations.
Figure 2
Figure 2
A screenshot depicting the mapping of mass spectrometry-derived peptide sequence data onto the genome sequence in 'ContigView' of Ensembl Genome Browser. The red rectangles on 'MS data JHU' track are peptide sequences obtained through tandem mass spectrometry. The brown colored rectangles on the 'Ensembl trans.' are the Ensembl known transcripts. Ten peptides are shown to match two exons in the known transcript encoding D7-related protein 1. The prefix JHU refers to Johns Hopkins University and is followed by a unique identifier for each peptide. Clicking on the peptide accession number links to a page containing additional information including its sequence.
Figure 3
Figure 3
MS/MS spectra in four different instances that were used for annotation of An. gambiae genome. (A) MS/MS spectrum of a peptide, whose sequence validated an exon of a novel transcript encoding a peroxidase family of proteins ([ENSANG:P00000000593]); (B) MS/MS spectrum of a peptide used to identify a novel protein-coding gene. (C) MS/MS spectrum of a peptide that maps to a predicted UTR of a known transcript encoding Antigen 5-related 1 protein. (D) MS/MS spectrum corresponding to a peptide that is acetylated at its N-terminus. The acetyl moiety is denoted by Ac.
Figure 4
Figure 4
A screenshot depicting validation of a novel transcript using 4 mass spectrometry-derived peptide sequences. The figure shows a novel transcript ([ENSANG:T00000000593]) on chromosome 3L with the red rectangles on 'MS Data' track corresponding to peptide sequences obtained through mass spectrometry.
Figure 5
Figure 5
A screenshot depicting identification of a novel gene. Two peptide sequences AAAYCADPSLLFAR and MVVDGTFLR were mapped onto the forward strand of chromosome 2R, whose scaffold coordinates are 3012881–3012922 (JHU_431) and 3012836–3012862 (JHU_432), respectively. There are no known or novel transcripts where these two peptides matched as shown
Figure 6
Figure 6
A screenshot depicting the correction of annotation using peptide sequence data. Panel A – Ensembl known gene [ENSANG:G00000018539] has two transcripts. Three peptides (JHU_0096, JHU_0097 and JHU_0098) align to the untranslated regions of both transcripts in this region. Panel B – Peptides are mapped onto the intronic regions of the Ensembl novel transcript [ENSANG:T00000018280]
Figure 7
Figure 7
MS/MS spectra of five different peptides that identify coding SNPs. (A) The amino acid change Ala→ Ser is shown in the peptide SFASDGTDVTVR that matches the protein [ENSANG:P00000029569]. (B) The sequence of the peptide CNAEAEKVHTSSK that matches the D7-related 3 protein precursor ([ENSANG:P00000025580]) shows the amino acid change Asp→His. (C) The peptide VPYDTKYDTVEGDYPLVVK corresponding to the protein putative 5'-nucleotidase precursor ([ENSANG:P00000012716]) presents the amino acid change Ile→Val. (D) The amino acid change Y→S is identified in the peptide LLPAEYGDGVSVPR that corresponds to the protein peroxidase precursor ([ENSANG:P00000000593]). (E) Two changes L→ Q and T→A occur in the same peptide SQNPASPAGSLGGKDVVSK that corresponds to the TRIO protein ([ENSANG:P00000017522]). The amino acid changes representing the cSNPs for the five proteins are shown in rectangle.

Similar articles

Cited by

References

    1. Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR, Wincker P, Clark AG, Ribeiro JM, Wides R, Salzberg SL, Loftus B, Yandell M, Majoros WH, Rusch DB, Lai Z, Kraft CL, Abril JF, Anthouard V, Arensburger P, Atkinson PW, Baden H, de Berardinis V, Baldwin D, Benes V, Biedler J, Blass C, Bolanos R, Boscus D, Barnstead M, Cai S, Center A, Chaturverdi K, Christophides GK, Chrystal MA, Clamp M, Cravchik A, Curwen V, Dana A, Delcher A, Dew I, Evans CA, Flanigan M, Grundschober-Freimoser A, Friedli L, Gu Z, Guan P, Guigo R, Hillenmeyer ME, Hladun SL, Hogan JR, Hong YS, Hoover J, Jaillon O, Ke Z, Kodira C, Kokoza E, Koutsos A, Letunic I, Levitsky A, Liang Y, Lin JJ, Lobo NF, Lopez JR, Malek JA, McIntosh TC, Meister S, Miller J, Mobarry C, Mongin E, Murphy SD, O'Brochta DA, Pfannkoch C, Qi R, Regier MA, Remington K, Shao H, Sharakhova MV, Sitter CD, Shetty J, Smith TJ, Strong R, Sun J, Thomasova D, Ton LQ, Topalis P, Tu Z, Unger MF, Walenz B, Wang A, Wang J, Wang M, Wang X, Woodford KJ, Wortman JR, Wu M, Yao A, Zdobnov EM, Zhang H, Zhao Q, Zhao S, Zhu SC, Zhimulev I, Coluzzi M, della Torre A, Roth CW, Louis C, Kalush F, Mural RJ, Myers EW, Adams MD, Smith HO, Broder S, Gardner MJ, Fraser CM, Birney E, Bork P, Brey PT, Venter JC, Weissenbach J, Kafatos FC, Collins FH, Hoffman SL. The genome sequence of the malaria mosquito Anopheles gambiae. Science. 2002;298:129–149. doi: 10.1126/science.1076181. - DOI - PubMed
    1. Mongin E, Louis C, Holt RA, Birney E, Collins FH. The Anopheles gambiae genome: an update. Trends Parasitol. 2004;20:49–52. doi: 10.1016/j.pt.2003.11.003. - DOI - PubMed
    1. Mann M, Pandey A. Use of mass spectrometry-derived data to annotate nucleotide and protein sequence databases. Trends Biochem Sci. 2001;26:54–61. doi: 10.1016/S0968-0004(00)01726-6. - DOI - PubMed
    1. Jaffe JD, Berg HC, Church GM. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 2004;4:59–77. doi: 10.1002/pmic.200300511. - DOI - PubMed
    1. Shevchenko A, Jensen ON, Podtelejnikov AV, Sagliocco F, Wilm M, Vorm O, Mortensen P, Boucherie H, Mann M. Linking genome and proteome by mass spectrometry: large-scale identification of yeast proteins from two dimensional gels. Proc Natl Acad Sci U S A. 1996;93:14440–14445. doi: 10.1073/pnas.93.25.14440. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources