Estimating tissue-specific peptide abundance from public RNA-Seq data

Angela Frentzen¹, Jason A Greenbaum¹, Haeuk Kim¹, Bjoern Peters^{1

2}, Zeynep Koşaloğlu-Yalçın¹

Affiliations

¹ Center for Infectious Disease and Vaccine Research, La Jolla Institute for Immunology, San Diego, CA, United States.
² Department of Medicine, University of California, San Diego, San Diego, CA, United States.

PMID: 36713080
PMCID: PMC9878344
DOI: 10.3389/fgene.2023.1082168

Estimating tissue-specific peptide abundance from public RNA-Seq data

Angela Frentzen et al. Front Genet. 2023.

. 2023 Jan 12:14:1082168.

doi: 10.3389/fgene.2023.1082168. eCollection 2023.

Authors

Angela Frentzen¹, Jason A Greenbaum¹, Haeuk Kim¹, Bjoern Peters^{1

2}, Zeynep Koşaloğlu-Yalçın¹

Affiliations

¹ Center for Infectious Disease and Vaccine Research, La Jolla Institute for Immunology, San Diego, CA, United States.
² Department of Medicine, University of California, San Diego, San Diego, CA, United States.

PMID: 36713080
PMCID: PMC9878344
DOI: 10.3389/fgene.2023.1082168

Abstract

Several novel MHC class I epitope prediction tools additionally incorporate the abundance levels of the peptides' source antigens and have shown improved performance for predicting immunogenicity. Such tools require the user to input the MHC alleles and peptide sequences of interest, as well as the abundance levels of the peptides' source proteins. However, such expression data is often not directly available to users, and retrieving the expression level of a peptide's source antigen from public databases is not trivial. We have developed the Peptide eXpression annotator (pepX), which takes a peptide as input, identifies from which proteins the peptide can be derived, and returns an estimate of the expression level of those source proteins from selected public databases. We have also investigated how the abundance level of a peptide can be best estimated in cases when it can originate from multiple transcripts and proteins and found that summing up transcript-level expression values performs best in distinguishing ligands from decoy peptides.

Keywords: RNA sequencing; RNA-Seq; cancer; ligands; peptide (pep); prediction; tool.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**FIGURE 1**
pepX Database Schema. The three proteome-related tables of the database (in blue) catalog peptides, proteins, transcripts and genes. The three expression-related tables of the database (in green) track gene- and transcript-level TPM data and associated study details.

**FIGURE 2**
Expression data from TCGA correlates well with patient-specific RNA-Seq data. **(A)**. The mean, median, and geomean TPM were calculated for the gene PDCD1 for the 470 TCGA-SKCM patients (each dot represents one patient). **(B)**. For each patient in the Hugo (Hugo et al., 2016) and Riaz (Riaz et al., 2017) datasets, all genes were considered and TPM values were correlated to the TCGA-SKCM mean, median, and geomean TPM. Spearman correlations coefficients were calculated. **(C)**. For each patient, the TPM values were separated into ranges for both the patient-specific (x-axis) and the TCGA median (y-axis) TPM values. For each TPM range combination, the fraction of genes expressed within the corresponding TPM ranges is shown as a percentage and is also color-coded.

**FIGURE 3**
Considerations for retrieving peptide abundance levels. Due to alternative splicing, genes can produce multiple different proteins. **(A)** The different protein sequences usually share amino acid stretches encoded by the same exons. It is also possible, that different genes share amino acid stretches, particularly genes from the same gene family. Peptide A (highlighted in red) for example can be retrieved from two proteins of Gene A and from one protein of Gene B, while Peptide B (highlighted in blue) can be retrieved from three proteins of Gene. **(B)**. Performance comparison of different ways to aggregate TPM values of multiple source proteins in distinguishing ligands of the HLA Ligand Atlas from decoy peptides. Summing up TPM values (total TPM) from all genes a peptide can be retrieved from performs best, followed by using the maximum TPM of all genes (Wilcoxon Test, p≤.0001). **(C)**. Performance comparison of scaling the TPM values considering the number of proteins a gene encodes and the number of proteins a peptide occurs in for ligands from the HLA Ligand Atlas. The total TPM significantly outperformed the total scaled TPM values (Wilcoxon Test, p ≤ .0001). **(D)**. Ligand elution datasets used in this study and the expression datasets we used to retrieve abundance levels. **(E)**. Performance comparison of different ways to aggregate TPM values of multiple source proteins in distinguishing ligands of the six validation datasets from decoy peptides. **(F)**. Performance comparison of total TPM and total scaled TPM proteins in distinguishing ligands of the six validation datasets from decoy peptides.

**FIGURE 4**
**(A)**. Performance comparison of different ways to aggregate transcript-level TPM values of multiple source proteins in distinguishing ligands of the HLA Ligand Atlas from decoy peptides. Summing up TPM values (total TPM) from all transcripts a peptide can be retrieved from performs best, followed by using the maximum TPM of all transcripts (Wilcoxon Test, p≤.0001). **(B)**. Performance comparison of different ways to aggregate transcript-level TPM values of multiple source proteins in distinguishing ligands of the four validation datasets from decoy peptides. **(C)**. Performance comparison of using transcript-level and gene-level TPM values in distinguishing ligands of the HLA Ligand Atlas from decoy peptides. The total transcript-level TPM significantly outperformed the total gene-level TPM (Wilcoxon Test, p≤.0001). **(D)**. Performance comparison of using transcript-level and gene-level TPM values in distinguishing ligands of the four validation datasets from decoy peptides. The mean AUC of the total transcript-level TPM is higher the total gene-level TPM, however not significantly (Wilcoxon Test, p > .05).

See this image and copyright information in PMC

References

1. Abelin J. G., Keskin D. B., Sarkizova S., Hartigan C. R., Zhang W., Sidney J., et al. (2017). Mass spectrometry profiling of HLA-associated peptidomes in mono-allelic cells enables more accurate epitope prediction. Immunity 46 (2), 315–326. 10.1016/j.immuni.2017.02.007 - DOI - PMC - PubMed
1. Bray N. L., Pimentel H., Melsted P., Pachter L. (2016). Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34 (5), 525–527. 10.1038/nbt.3519 - DOI - PubMed
1. Cancer Genome Atlas Research N., Weinstein J. N., Collisson E. A., Mills G. B., Shaw K. R., Ozenberger B. A., et al. (2013). The cancer genome Atlas pan-cancer analysis project. Nat. Genet. 45 (10), 1113–1120. 10.1038/ng.2764 - DOI - PMC - PubMed
1. Cantarella S., Carnevali D., Morselli M., Conti A., Pellegrini M., Montanini B., et al. (2019). Alu RNA modulates the expression of cell cycle genes in human fibroblasts. Int. J. Mol. Sci. 20 (13), 3315. 10.3390/ijms20133315 - DOI - PMC - PubMed
1. Carithers L. J., Moore H. M. (2015). The genotype-tissue expression (GTEx) project. Biopreserv Biobank 13 (5), 307–308. 10.1089/bio.2015.29031.hmm - DOI - PMC - PubMed

Grants and funding

U24 CA248138/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Estimating tissue-specific peptide abundance from public RNA-Seq data

Affiliations

Estimating tissue-specific peptide abundance from public RNA-Seq data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous