Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr;21(4):100220.
doi: 10.1016/j.mcpro.2022.100220. Epub 2022 Feb 26.

Proteogenomic Analysis of Breast Cancer Transcriptomic and Proteomic Data, Using De Novo Transcript Assembly: Genome-Wide Identification of Novel Peptides and Clinical Implications

Affiliations

Proteogenomic Analysis of Breast Cancer Transcriptomic and Proteomic Data, Using De Novo Transcript Assembly: Genome-Wide Identification of Novel Peptides and Clinical Implications

P S Hari et al. Mol Cell Proteomics. 2022 Apr.

Abstract

We have carried out proteogenomic analysis of the breast cancer transcriptomic and proteomic data, available at The Clinical Proteomic Tumor Analysis Consortium resource, to identify novel peptides arising from alternatively spliced events as well as other noncanonical expressions. We used a pipeline that consisted of de novo transcript assembly, six frame-translated custom database, and a combination of search engines to identify novel peptides. A portfolio of 4,387 novel peptide sequences initially identified was further screened through PepQuery validation tool (Clinical Proteomic Tumor Analysis Consortium), which yielded 1,558 novel peptides. We considered the dataset of 1,558 validated through PepQuery to understand their functional and clinical significance, leaving the rest to be further verified using other validation tools and approaches. The novel peptides mapped to the known gene sequences as well as to genomic regions yet undefined for translation, 580 novel peptides mapped to known protein-coding genes, 147 to non-protein-coding genes, and 831 belonged to novel translational sequences. The novel peptides belonging to protein-coding genes represented alternatively spliced events or 5' or 3' extensions, whereas others represented translation from pseudogenes, long noncoding RNAs, or novel peptides originating from uncharacterized protein-coding sequences-mostly from the intronic regions of known genes. Seventy-six of the 580 protein-coding genes were associated with cancer hallmark genes, which included key oncogenes, transcription factors, kinases, and cell surface receptors. Survival association analysis of the 76 novel peptide sequences revealed 10 of them to be significant, and we present a panel of six novel peptides, whose high expression was found to be strongly associated with poor survival of patients with human epidermal growth factor receptor 2-enriched subtype. Our analysis represents a landscape of novel peptides of different types that may be expressed in breast cancer tissues, whereas their presence in full-length functional proteins needs further investigations.

Keywords: CPTAC; alternative splicing; breast cancer; de novo transcript assembly; proteogenomics.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest The authors declare no competing interests.

Figures

None
Graphical abstract
Fig. 1
Fig. 1
Proteogenomic analysis and identification of novel peptides.A, a schematic view of the proteogenomic pipeline. Breast cancer transcriptomic and proteomic data from CPTAC resource was used for the analysis. The pipeline includes de novo assembly of RNA-Seq reads followed by six-frame translation for custom database creation to search against the MS/MS files from the proteomics analysis. The custom database generated for each of the samples was searched against the respective mgf files using the search engines, X!Tandem, MSGF+, and Tide. PeptideShaker was used for integrated identification of the candidate peptides and their corresponding proteins. The known peptides (RefSeq) were then filtered out from the total identifications to get the list of novel peptides, which were then subjected to ACTG analysis followed by categorization into different kinds of peptide categories using custom scripts as described under Experimental Procedures section. The novel peptides obtained were then validated using PepQuery. The novel peptides validated by PepQuery were categorized into those that map to protein-coding genes, noncoding genes, and uncharacterized ORFs. Numbers shown in brackets represent number of novel peptides in the respective groups. The different types of novel peptides obtained after ACTG categorization are also shown. The novel peptides mapping to known protein-coding genes were mapped to cancer hallmark genes and further assessed for clinical relevance in breast cancer by carrying out survival analysis. Validation with DeepMass:Prism is briefly discussed in the Results section. B, chromosome-wise distribution of the PepQuery-validated peptides as revealed by ACTG and their respective categories, indicated by the color code, is shown. The details of the categories are explained under Experimental Procedures section.
Fig. 2
Fig. 2
Density distribution plot (p value) of novel peptides identified in proteogenomic analysis by CPTAC and those from our analysis. Details about the number of peptides identified in our analysis as compared with CPTAC analysis are as follows: CPTAC—422, our analysis—1,558, and overlap—38. The basis and details of these numbers are given in the Results section.
Fig. 3
Fig. 3
Schematic representation of novel peptide categories to understand their functional and clinical significance. Of the 1,558 peptides validated by PepQuery (Fig. 1), 580 were found to map to known protein-coding genes,147 mapped to noncoding genes, and 801 mapped to uncharacterized ORFs. The different types of peptides seen in each of the categories along with the respective numbers are depicted using the pie chart. The peptides (n = 580) corresponding to 501 protein-coding genes were further mapped to cancer hallmarks to identify their tumor relevance. Seventy-six of them mapped to cancer hallmarks, and the corresponding novel peptide sequences were further subjected to survival analysis as described under Experimental Procedures and Results sections. The survival association plots for significant peptide sequences are given in Figure 4.
Fig. 4
Fig. 4
Survival analysis of novel peptide sequences mapping to protein-coding genes. Survival plots for the novel peptide sequences belonging to eight genes, namely FADD, FLT1, ALDOA, CXCL16, FGFR1, PLCB3, PPP2R2A, and RPA1, found to exhibit significant survival association are provided along with the respective peptide sequences. The novel peptides were quantified at transcript level using the breast cancer RNA-Seq data from TCGA. Red line represents high-expression group of patients, whereas blue line indicates low-expression group of patients. Number of patients at risk in the high- and low-expression groups are also shown. A, peptides showing survival association in luminal (FADD) and basal subtypes (FLT1). B, peptides showing survival association in HER2-enriched subtype (ALDOA, CXCL16, FGFR1, PLCB3, PPP2R2A, and RPA1). ALDOA, aldolase A; CXCL16, C-X-C motif chemokine ligand 16; FADD, Fas-associated death domain; FGFR1, fibroblast growth factor receptor 1; FLT1, Fms-related receptor tyrosine kinase 1; HER2, human epidermal growth factor receptor 2; PLCB3, phospholipase C beta 3; PPP2R2A, protein phosphatase 2 regulatory subunit B alpha; RPA1, replication protein A1.
Fig. 5
Fig. 5
MS/MS spectra of novel peptides of eight protein-coding genes with survival association as shown inFigure 4. The details of the peptides and their corresponding genes are given in supplemental Table S5.

References

    1. Adhikari S., Nice E.C., Deutsch E.W., Lane L., Omenn G.S., Pennington S.R., Paik Y.K., Overall C.M., Corrales F.J., Cristea I.M., Van Eyk J.E., Uhlén M., Lindskog C., Chan D.W., Bairoch A., et al. A high-stringency blueprint of the human proteome. Nat. Commun. 2020;11:5301. - PMC - PubMed
    1. Hartford C.C.R., Lal A. When long noncoding becomes protein coding. Mol. Cell Biol. 2020;40 - PMC - PubMed
    1. Ji Z., Song R., Regev A., Struhl K. Many lncRNAs, 5'UTRs, and pseudogenes are translated and some are likely to express functional proteins. Elife. 2015;4 - PMC - PubMed
    1. Chen X., Wan L., Wang W., Xi W.J., Yang A.G., Wang T. Re-recognition of pseudogenes: From molecular to clinical applications. Theranostics. 2020;10:1479–1499. - PMC - PubMed
    1. Rusk N. From pseudogenes to proteins. Nat. Methods. 2011;8:448–449. - PubMed

Publication types