Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jan;10(1):M110.002527.
doi: 10.1074/mcp.M110.002527. Epub 2010 Oct 28.

Proteogenomic analysis of polymorphisms and gene annotation divergences in prokaryotes using a clustered mass spectrometry-friendly database

Affiliations

Proteogenomic analysis of polymorphisms and gene annotation divergences in prokaryotes using a clustered mass spectrometry-friendly database

Gustavo A de Souza et al. Mol Cell Proteomics. 2011 Jan.

Abstract

Precise annotation of genes or open reading frames is still a difficult task that results in divergence even for data generated from the same genomic sequence. This has an impact in further proteomic studies, and also compromises the characterization of clinical isolates with many specific genetic variations that may not be represented in the selected database. We recently developed software called multistrain mass spectrometry prokaryotic database builder (MSMSpdbb) that can merge protein databases from several sources and be applied on any prokaryotic organism, in a proteomic-friendly approach. We generated a database for the Mycobacterium tuberculosis complex (using three strains of Mycobacterium bovis and five of M. tuberculosis), and analyzed data collected from two laboratory strains and two clinical isolates of M. tuberculosis. We identified 2561 proteins, of which 24 were present in M. tuberculosis H37Rv samples, but not annotated in the M. tuberculosis H37Rv genome. We were also able to identify 280 nonsynonymous single amino acid polymorphisms and confirm 367 translational start sites. As a proof of concept we applied the database to whole-genome DNA sequencing data of one of the clinical isolates, which allowed the validation of 116 predicted single amino acid polymorphisms and the annotation of 131 N-terminal start sites. Moreover we identified regions not present in the original M. tuberculosis H37Rv sequence, indicating strain divergence or errors in the reference sequence. In conclusion, we demonstrated the potential of using a merged database to better characterize laboratory or clinical bacterial strains.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
TSS choice validation for protein Rv0390. A, The entry in FASTA format is shown. Underlined region delimit the expected tryptic peptides of two predicted TSS choices, a valine and a methionine (bold, underlined). These tryptic peptides were artificially added in the end of the entry after the letter code “O” (bold). B, Fragmentation profile of peptide SYAGDITPLQAWEMLSDNPR.
Fig. 2.
Fig. 2.
Identification of regions predicted as noncoding. A, The entry in FASTA format is shown, with the predicted N-terminal tryptic peptide underlined. The sequence in bold is present in a region initially predicted as noncoding in all eight genomes used in this work. B, The fragmentation pattern of sequence MEGDAGAGQLNPADANK is shown, indicating that this region is indeed coding. This amino acid sequence is not present in any other gene of the database.
Fig. 3.
Fig. 3.
Identification of M. tuberculosis H37Rv unannotated genes. A, Schematic representation of the genomic region containing the gene MT2297 from the M. tuberculosis CDC1551 strain. Black boxes indicate gene annotation. In M. tuberculosis H37Rv and M. tuberculosis F11 genomes, the gene is not annotated but the genomic region is nonetheless present. B, Fragmentation pattern of peptide ADLYAAVDAMR from MT2297, present in M. tuberculosis H37Rv fractions.
Fig. 4.
Fig. 4.
Missing region of the M. tuberculosis H37Rv genome. A, Alignment of selected gene sequences from M. tuberculosis CDC1551, H37Ra, and H37Rv genomes, illustrating a deletion region in M. tuberculosis H37Rv that includes genes MT2420/MRA_2374 to MT2422/MRA_2376. Interestingly, M. tuberculosis H37Rv and M. tuberculosis H37Ra share the same ancestor, but the M. tuberculosis H37Ra genome sequence share more similarities with the M. tuberculosis CDC1551 genome than with the original M. tuberculosis H37Rv genome sequence. B, Fragmentation pattern of peptide AQAAALEAEHQAIVR from MT2420, found in M. tuberculosis H37Rv (ATCC27294) whole cell lysates, indicating that the deletion reported in the original M. tuberculosis H37Rv sequencing effort is incorrect. MS/MS information tables (MaxQuant output) are openly available at www.proteomecommons.org under the Hash code: dXuxNwU84QKYzzkLfmpU8Mcv6p277wRTOWXjRuWEH/WkkdAyYT/DeWm3ILF43l3lLZF7MMchNwPBwWa6G16fo6KhRrIAAAAAAAAC/w = = All RAW files used in this work can be downloaded using the Hash code: EIH2o0QZ9mMIXgurpLpJ34rgf1PQHXKOIa0EUOX0NIZ+bJdOOsdkXvcCQ9N5ZUqtlAEDZ/TQaoPn/uTOvpR5SPQuAyAAAAAAAAB0Cw = =.

Similar articles

Cited by

References

    1. Garrels J. I. (2002) Yeast genomic databases and the challenge of the post-genomic era. Funct. Integr. Genomics 2, 212–237 - PubMed
    1. Rappsilber J., Mann M. (2002) What does it mean to identify a protein in proteomics? Trends Biochem. Sci. 27, 74–78 - PubMed
    1. Ge H., Walhout A. J., Vidal M. (2003) Integrating ‘omic’ information: a bridge between genomics and systems biology. Trends. Genet. 19, 551–560 - PubMed
    1. Overbeek R. (2000) Genomics: what is realistically achievable? Genome Biol.. 1, Comment2002 - PMC - PubMed
    1. Kyrpides N. C. (1999) Genomes OnLine Database (GOLD 1.0): a monitor of complete and ongoing genome projects world-wide. Bioinformatics 15, 773–774 - PubMed

Publication types

MeSH terms