Review

. 2014 Nov;11(11):1114-25.

doi: 10.1038/nmeth.3144.

Proteogenomics: concepts, applications and computational strategies

Alexey I Nesvizhskii¹

Affiliations

Affiliation

¹ 1] Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA. [2] Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA.

PMID: 25357241
PMCID: PMC4392723
DOI: 10.1038/nmeth.3144

Review

Proteogenomics: concepts, applications and computational strategies

Alexey I Nesvizhskii. Nat Methods. 2014 Nov.

. 2014 Nov;11(11):1114-25.

doi: 10.1038/nmeth.3144.

Author

Alexey I Nesvizhskii¹

Affiliation

¹ 1] Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA. [2] Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA.

PMID: 25357241
PMCID: PMC4392723
DOI: 10.1038/nmeth.3144

Abstract

Proteogenomics is an area of research at the interface of proteomics and genomics. In this approach, customized protein sequence databases generated using genomic and transcriptomic information are used to help identify novel peptides (not present in reference protein sequence databases) from mass spectrometry-based proteomic data; in turn, the proteomic data can be used to provide protein-level evidence of gene expression and to help refine gene models. In recent years, owing to the emergence of new sequencing technologies such as RNA-seq and dramatic improvements in the depth and throughput of mass spectrometry-based proteomics, the pace of proteogenomic research has greatly accelerated. Here I review the current state of proteogenomic methods and applications, including computational strategies for building and using customized protein sequence databases. I also draw attention to the challenge of false positive identifications in proteogenomics and provide guidelines for analyzing the data and reporting the results of proteogenomic studies.

PubMed Disclaimer

Conflict of interest statement

Competing Financial Interests

The author declares no competing financial interests.

Figures

**Figure 1. Peptide and protein identification in shotgun proteomics**
A) Overview of shotgun proteomics. Proteins are digested into peptides, then separated using liquid chromatography coupled online to a mass spectrometer, then analyzed by the mass spectrometer which generates tandem mass (MS/MS) spectra. B) Peptides are most commonly identified using a sequence database search approach. Traditionally, experimental MS/MS spectra are matched with theoretical spectra predicted for each peptide contained in a protein sequence database. Sequence tag-assisted database searching starts with extraction of short tags followed by database searching in which the list of candidate peptides is restricted to those peptides only that contain one of the extracted sequence tags, allowing for mutations in the sequences of candidate database peptides. Peptide sequence can also be extracted directly from the spectrum using de novo sequencing (extracted sequences can then be searched in a protein sequence database to find the exact or a homologous peptide).

**Figure 2. The concept of proteogenomics**
In a proteogenomics approach, genomics (DNA sequencing, expressed sequence tags (ESTs) and transcriptomics (RNA-Seq, ribosome profiling) data is used to generate customized protein sequence databases to help interpret proteomics (LC-MS/MS) data. In turn, the proteomics data provides protein-level validation of the gene expression data, as well as helping to refine gene models. The enhanced gene models can help improve protein sequence databases for traditional proteomics analysis.

**Figure 3. Type of peptides identified in proteogenomics**
Peptides identified by searching customized protein sequence databases are mapped on the genome. Intergenic peptides map to regions located between annotated gene models, whereas intragenic peptides map to genomic regions contained within or in close proximity to an annotated gene model. Intragenic peptides can be further categorized based on the annotation of the corresponding gene model (e.g. ‘protein-coding gene’, ‘long noncoding RNA (lncRNA) gene’, and ‘pseudogene’). The majority of peptides map to a protein coding gene, and can be divided into Exon and exon-exon junction (Junction) peptides. Novel peptides include peptides mapping to untranslated regions (3′ or 5′ UTR peptides) or Intron peptides, peptides spanning the boundary between the coding sequence region and the neighboring UTR or intron region (Exon boundary), peptides spanning un-annotated (alternative) splice junctions (Alt junction), and out of frame peptides (Alt frame).

**Figure 4. Statistical assessment of peptide identifications in proteogenomics**
MS/MS spectra are searched against a customized protein sequence database that includes target sequences for the organism of interest, i.e. a reference protein database and predicted protein sequences (containing novel peptides). In addition, two decoy databases (e.g. reversed sequences) of the same sizes as the target reference and predicted databases are appended to the target databases. The best database peptide match for each spectrum is selected for further analysis. Peptide identifications are classified as known or novel (for a decoy peptide the class - ‘known’ or ‘novel’ – is determined based on the class of the corresponding target sequence from which the decoy was generated). When using simple database search score based filtering, the numbers of target and decoy peptide identifications passing a certain score threshold are counted and used to estimate FDR corresponding to that threshold. FDR analysis should be done separately for known and novel peptides (class-specific FDR) due to difference in the number of known and novel sequences in the searched customized sequence database, and due to lower likelihood of correctly identifying a novel peptide than known peptide. When using more advanced methods based on computing posterior peptide probabilities, both the database search scores and the peptide class (known or novel) should be taken into consideration.

See this image and copyright information in PMC

References

1. Mann M, Kulak NA, Nagaraj N, Cox J. The coming age of complete, accurate, and ubiquitous proteomes. Mol Cell. 2013;49:583–590. - PubMed
1. Bantscheff M, Lemeer S, Savitski MM, Kuster B. Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present. Anal Bioanal Chem. 2012;404:939–965. - PubMed
1. Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. Journal of proteomics. 2010;73:2092–2123. - PMC - PubMed
1. Nesvizhskii AI, Aebersold R. Interpretation of shotgun proteomic data - The protein inference problem. Molecular & Cellular Proteomics. 2005;4:1419–1440. - PubMed
1. Dasari S, et al. TagRecon: High-throughput mutation identification through sequence tagging. Journal of Proteome Research. 2010;9:1716–1726. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Proteogenomics: concepts, applications and computational strategies

Affiliation

Proteogenomics: concepts, applications and computational strategies

Author

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources