Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jun 4;20(6):3353-3364.
doi: 10.1021/acs.jproteome.1c00264. Epub 2021 May 17.

Personalized Proteome: Comparing Proteogenomics and Open Variant Search Approaches for Single Amino Acid Variant Detection

Affiliations

Personalized Proteome: Comparing Proteogenomics and Open Variant Search Approaches for Single Amino Acid Variant Detection

Renee Salz et al. J Proteome Res. .

Abstract

Discovery of variant peptides such as a single amino acid variant (SAAV) in shotgun proteomics data is essential for personalized proteomics. Both the resolution of shotgun proteomics methods and the search engines have improved dramatically, allowing for confident identification of SAAV peptides. However, it is not yet known if these methods are truly successful in accurately identifying SAAV peptides without prior genomic information in the search database. We studied this in unprecedented detail by exploiting publicly available long-read RNA sequences and shotgun proteomics data from the gold standard reference cell line NA12878. Searching spectra from this cell line with the state-of-the-art open modification search engine ionbot against carefully curated search databases resulted in 96.7% false-positive SAAVs and an 85% lower true positive rate than searching with peptide search databases that incorporate prior genetic information. While adding genetic variants to the search database remains indispensable for correct peptide identification, inclusion of long-read RNA sequences in the search database contributes only 0.3% new peptide identifications. These findings reveal the differences in SAAV detection that result from various approaches, providing guidance to researchers studying SAAV peptides and developers of peptide spectrum identification tools.

Keywords: deep proteomics; direct RNA sequencing; long-read RNA sequence; open search.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
Creation of the search databases. (A) Three databases were made to make comparison between use of different sources of sequences. One with only translations of transcriptome sequences (ONT), one with only the reference proteome (GENCODE), and one with the union of the two. This comparison is denoted with a blue square. Variants from NA12878 were incorporated into the combination database from A and compared to the combination database without variants. This comparison is denoted with a red square. (B) Number of (predicted) ORFs in the different sources used to construct the VF search database and their overlap. The sources included the GENCODE v29 reference ORFs and the predicted ORFs from ONT RNAseq. Two ORF prediction software (ANGEL and SQANTI) were used to determine candidate ORFs, and the intersection was included in the final search database.
Figure 2
Figure 2
Detectable peptides per method. Theoretical (upper pie charts) and observed (lower pie charts) proportions of peptides when searching against VC (right) or VF (left) search databases. This shows percentages of matched peptides attributed only to GENCODE proteins, only ONT proteins, and those that match to proteins in both databases.
Figure 3
Figure 3
Detection of variant peptides using (combination) VF and VC databases. (A) Variant PSMs (left) and unique peptides (right) attributed to genome-supported variant peptides. (B) PSM and peptide counts found by each method.
Figure 4
Figure 4
Properties of detected variants compared to those expected. (A) Groups of variant peptides being compared. All circles, including all overlaps, are being compared to each other. (B) Length distribution differences between detected variant peptides by the different variant detection methods. (C) Normalized (divided by max) frequency of variation per original (reference) amino acid.
Figure 5
Figure 5
False-negative variant misidentifications. (A) Investigation of causes of mis-identification of peptides in the VF set. (B) Scores of those misidentified peptides in VF vs VC set. Each point corresponds to one false-negative variant peptide. Percolator PSM score is used. Color corresponds to delta retention time.
Figure 6
Figure 6
False-positive misidentifications. (A) False-positive misidentifications are genome-unsupported (US) variants predicted by the VF method. The Venn diagram highlights the subset of variants that are being investigated in this figure. These 2998 variants were predicted by ionbot to be variant peptides but were not found with the variant containing set. All but seven were variants unsupported by genome information. (B) Relative score distributions between genome supported vs unsupported variants in the VF set. (C) Unexpected modifications by the VC set corresponding to all “false-positive” predicted variant PSMs in the VF set.
Figure 7
Figure 7
Underlying SNPs detected at the protein level. (A) Variant peptide abundance vs reference counterpart split by zygosity and search database, square root-transformed. (B) Separating heterozygous variants in the variant-containing database by whether more variant peptide was found (variant-biased) or more of the reference counterpart was found (reference-biased) revealed differences in allele frequency distributions. (C) Ratio variability of genes with two or more variant peptides. Ratio is defined by the variant counterpart abundance divided by variant peptide abundance. Y axis shows max – min per gene.

References

    1. Nagaraj N.; Mann M. Quantitative Analysis of the Intra-and Inter-Individual Variability of the Normal Urinary Proteome. J. Proteome Res. 2011, 10, 637–645. 10.1021/pr100835s. - DOI - PubMed
    1. Kushner I. K.; Clair G.; Purvine S. O.; Lee J.-Y.; Adkins J. N.; Payne S. H. Individual Variability of Protein Expression in Human Tissues. J. Proteome Res. 2018, 17, 3914–3922. 10.1021/acs.jproteome.8b00580. - DOI - PubMed
    1. Li J.; Su Z.; Ma Z. Q.; Slebos R. J. C.; Halvey P.; Tabb D. L.; Liebler D. C.; Pao W.; Zhang B. A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics. Mol. Cell. Proteomics 2011, 10, M110.006536.10.1074/mcp.m110.006536. - DOI - PMC - PubMed
    1. Mertins P.; Mani D. R.; Mani D. R.; Ruggles K. V.; Gillette M. A.; Clauser K. R.; Wang P.; Wang X.; Qiao J. W.; Cao S.; Petralia F.; Kawaler E.; Mundt F.; Krug K.; Tu Z.; Lei J. T.; Gatza M. L.; Wilkerson M.; Perou C. M.; Yellapantula V.; Huang K.-l.; Lin C.; McLellan M. D.; Yan P.; Davies S. R.; Townsend R. R.; Skates S. J.; Wang J.; Zhang B.; Kinsinger C. R.; Mesri M.; Rodriguez H.; Ding L.; Paulovich A. G.; Fenyö D.; Ellis M. J.; Carr S. A. Proteogenomics Connects Somatic Mutations to Signalling in Breast Cancer. Nature 2016, 534, 55–62. 10.1038/nature18003. - DOI - PMC - PubMed
    1. Subbannayya Y.; Pinto S. M.; Gowda H.; Prasad T. S. K. Proteogenomics for Understanding Oncology: Recent Advances and Future Prospects. Expet Rev. Proteonomics 2016, 13, 297–308. 10.1586/14789450.2016.1136217. - DOI - PubMed

Publication types

LinkOut - more resources