. 2022 Apr;21(4):100220.

doi: 10.1016/j.mcpro.2022.100220. Epub 2022 Feb 26.

Proteogenomic Analysis of Breast Cancer Transcriptomic and Proteomic Data, Using De Novo Transcript Assembly: Genome-Wide Identification of Novel Peptides and Clinical Implications

P S Hari¹, Lavanya Balakrishnan¹, Chaithanya Kotyada¹, Arivusudar Everad John¹, Shivani Tiwary², Nameeta Shah³, Ravi Sirdeshmukh⁴

Affiliations

¹ Mazumdar Shaw Center for Translational Research, Narayana Health, Bangalore, India.
² Simulation and Modeling Sciences, Pfizer Pharma GmBH, Berlin, Germany.
³ Mazumdar Shaw Center for Translational Research, Narayana Health, Bangalore, India. Electronic address: nameeta.shah@gmail.com.
⁴ Mazumdar Shaw Center for Translational Research, Narayana Health, Bangalore, India; Institute of Bioinformatics, International Tech Park, Bangalore, India; Health Sciences, Manipal Academy of Higher Education, Manipal, India. Electronic address: ravisirdeshmukh@gmail.com.

PMID: 35227895
PMCID: PMC9020135
DOI: 10.1016/j.mcpro.2022.100220

Proteogenomic Analysis of Breast Cancer Transcriptomic and Proteomic Data, Using De Novo Transcript Assembly: Genome-Wide Identification of Novel Peptides and Clinical Implications

P S Hari et al. Mol Cell Proteomics. 2022 Apr.

. 2022 Apr;21(4):100220.

doi: 10.1016/j.mcpro.2022.100220. Epub 2022 Feb 26.

Authors

P S Hari¹, Lavanya Balakrishnan¹, Chaithanya Kotyada¹, Arivusudar Everad John¹, Shivani Tiwary², Nameeta Shah³, Ravi Sirdeshmukh⁴

Affiliations

¹ Mazumdar Shaw Center for Translational Research, Narayana Health, Bangalore, India.
² Simulation and Modeling Sciences, Pfizer Pharma GmBH, Berlin, Germany.
³ Mazumdar Shaw Center for Translational Research, Narayana Health, Bangalore, India. Electronic address: nameeta.shah@gmail.com.
⁴ Mazumdar Shaw Center for Translational Research, Narayana Health, Bangalore, India; Institute of Bioinformatics, International Tech Park, Bangalore, India; Health Sciences, Manipal Academy of Higher Education, Manipal, India. Electronic address: ravisirdeshmukh@gmail.com.

PMID: 35227895
PMCID: PMC9020135
DOI: 10.1016/j.mcpro.2022.100220

Abstract

We have carried out proteogenomic analysis of the breast cancer transcriptomic and proteomic data, available at The Clinical Proteomic Tumor Analysis Consortium resource, to identify novel peptides arising from alternatively spliced events as well as other noncanonical expressions. We used a pipeline that consisted of de novo transcript assembly, six frame-translated custom database, and a combination of search engines to identify novel peptides. A portfolio of 4,387 novel peptide sequences initially identified was further screened through PepQuery validation tool (Clinical Proteomic Tumor Analysis Consortium), which yielded 1,558 novel peptides. We considered the dataset of 1,558 validated through PepQuery to understand their functional and clinical significance, leaving the rest to be further verified using other validation tools and approaches. The novel peptides mapped to the known gene sequences as well as to genomic regions yet undefined for translation, 580 novel peptides mapped to known protein-coding genes, 147 to non-protein-coding genes, and 831 belonged to novel translational sequences. The novel peptides belonging to protein-coding genes represented alternatively spliced events or 5' or 3' extensions, whereas others represented translation from pseudogenes, long noncoding RNAs, or novel peptides originating from uncharacterized protein-coding sequences-mostly from the intronic regions of known genes. Seventy-six of the 580 protein-coding genes were associated with cancer hallmark genes, which included key oncogenes, transcription factors, kinases, and cell surface receptors. Survival association analysis of the 76 novel peptide sequences revealed 10 of them to be significant, and we present a panel of six novel peptides, whose high expression was found to be strongly associated with poor survival of patients with human epidermal growth factor receptor 2-enriched subtype. Our analysis represents a landscape of novel peptides of different types that may be expressed in breast cancer tissues, whereas their presence in full-length functional proteins needs further investigations.

Keywords: CPTAC; alternative splicing; breast cancer; de novo transcript assembly; proteogenomics.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest The authors declare no competing interests.

Figures

**Fig. 1**
**Proteogenomic analysis and identification of novel peptides.**A, a schematic view of the proteogenomic pipeline. Breast cancer transcriptomic and proteomic data from CPTAC resource was used for the analysis. The pipeline includes *de novo* assembly of RNA-Seq reads followed by six-frame translation for custom database creation to search against the MS/MS files from the proteomics analysis. The custom database generated for each of the samples was searched against the respective mgf files using the search engines, X!Tandem, MSGF+, and Tide. PeptideShaker was used for integrated identification of the candidate peptides and their corresponding proteins. The known peptides (RefSeq) were then filtered out from the total identifications to get the list of novel peptides, which were then subjected to ACTG analysis followed by categorization into different kinds of peptide categories using custom scripts as described under Experimental Procedures section. The novel peptides obtained were then validated using PepQuery. The novel peptides validated by PepQuery were categorized into those that map to protein-coding genes, noncoding genes, and uncharacterized ORFs. Numbers shown in *brackets* represent number of novel peptides in the respective groups. The different types of novel peptides obtained after ACTG categorization are also shown. The novel peptides mapping to known protein-coding genes were mapped to cancer hallmark genes and further assessed for clinical relevance in breast cancer by carrying out survival analysis. Validation with DeepMass:Prism is briefly discussed in the Results section. B, chromosome-wise distribution of the PepQuery-validated peptides as revealed by ACTG and their respective categories, indicated by the color code, is shown. The details of the categories are explained under Experimental Procedures section.

**Fig. 2**
**Density distribution plot (p value) of novel peptides identified in proteogenomic analysis by CPTAC and those from our analysis.** Details about the number of peptides identified in our analysis as compared with CPTAC analysis are as follows: CPTAC—422, our analysis—1,558, and overlap—38. The basis and details of these numbers are given in the Results section.

**Fig. 3**
**Schematic representation of novel peptide categories to understand their functional and clinical significance.** Of the 1,558 peptides validated by PepQuery (Fig. 1), 580 were found to map to known protein-coding genes,147 mapped to noncoding genes, and 801 mapped to uncharacterized ORFs. The different types of peptides seen in each of the categories along with the respective numbers are depicted using the pie chart. The peptides (n = 580) corresponding to 501 protein-coding genes were further mapped to cancer hallmarks to identify their tumor relevance. Seventy-six of them mapped to cancer hallmarks, and the corresponding novel peptide sequences were further subjected to survival analysis as described under Experimental Procedures and Results sections. The survival association plots for significant peptide sequences are given in Figure 4.

**Fig. 4**
**Survival analysis of novel peptide sequences mapping to protein-coding genes.** Survival plots for the novel peptide sequences belonging to eight genes, namely FADD, FLT1, ALDOA, CXCL16, FGFR1, PLCB3, PPP2R2A, and RPA1, found to exhibit significant survival association are provided along with the respective peptide sequences. The novel peptides were quantified at transcript level using the breast cancer RNA-Seq data from TCGA. *Red line* represents high-expression group of patients, whereas *blue line* indicates low-expression group of patients. Number of patients at risk in the high- and low-expression groups are also shown. A, peptides showing survival association in luminal (FADD) and basal subtypes (FLT1). B, peptides showing survival association in HER2-enriched subtype (ALDOA, CXCL16, FGFR1, PLCB3, PPP2R2A, and RPA1). ALDOA, aldolase A; CXCL16, C-X-C motif chemokine ligand 16; FADD, Fas-associated death domain; FGFR1, fibroblast growth factor receptor 1; FLT1, Fms-related receptor tyrosine kinase 1; HER2, human epidermal growth factor receptor 2; PLCB3, phospholipase C beta 3; PPP2R2A, protein phosphatase 2 regulatory subunit B alpha; RPA1, replication protein A1.

**Fig. 5**
**MS/MS spectra of novel peptides of eight protein-coding genes with survival association as shown in**Figure 4. The details of the peptides and their corresponding genes are given in supplemental Table S5.

See this image and copyright information in PMC

References

1. Adhikari S., Nice E.C., Deutsch E.W., Lane L., Omenn G.S., Pennington S.R., Paik Y.K., Overall C.M., Corrales F.J., Cristea I.M., Van Eyk J.E., Uhlén M., Lindskog C., Chan D.W., Bairoch A., et al. A high-stringency blueprint of the human proteome. Nat. Commun. 2020;11:5301. - PMC - PubMed
1. Hartford C.C.R., Lal A. When long noncoding becomes protein coding. Mol. Cell Biol. 2020;40 - PMC - PubMed
1. Ji Z., Song R., Regev A., Struhl K. Many lncRNAs, 5'UTRs, and pseudogenes are translated and some are likely to express functional proteins. Elife. 2015;4 - PMC - PubMed
1. Chen X., Wan L., Wang W., Xi W.J., Yang A.G., Wang T. Re-recognition of pseudogenes: From molecular to clinical applications. Theranostics. 2020;10:1479–1499. - PMC - PubMed
1. Rusk N. From pseudogenes to proteins. Nat. Methods. 2011;8:448–449. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Proteogenomic Analysis of Breast Cancer Transcriptomic and Proteomic Data, Using De Novo Transcript Assembly: Genome-Wide Identification of Novel Peptides and Clinical Implications

Affiliations

Proteogenomic Analysis of Breast Cancer Transcriptomic and Proteomic Data, Using De Novo Transcript Assembly: Genome-Wide Identification of Novel Peptides and Clinical Implications

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Medical

Research Materials

Miscellaneous