Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 10;18(11):e1010702.
doi: 10.1371/journal.pcbi.1010702. eCollection 2022 Nov.

Protein prediction models support widespread post-transcriptional regulation of protein abundance by interacting partners

Affiliations

Protein prediction models support widespread post-transcriptional regulation of protein abundance by interacting partners

Himangi Srivastava et al. PLoS Comput Biol. .

Abstract

Protein and mRNA levels correlate only moderately. The availability of proteogenomics data sets with protein and transcript measurements from matching samples is providing new opportunities to assess the degree to which protein levels in a system can be predicted from mRNA information. Here we examined the contributions of input features in protein abundance prediction models. Using large proteogenomics data from 8 cancer types within the Clinical Proteomic Tumor Analysis Consortium (CPTAC) data set, we trained models to predict the abundance of over 13,000 proteins using matching transcriptome data from up to 958 tumor or normal adjacent tissue samples each, and compared predictive performances across algorithms, data set sizes, and input features. Over one-third of proteins (4,648) showed relatively poor predictability (elastic net r ≤ 0.3) from their cognate transcripts. Moreover, we found widespread occurrences where the abundance of a protein is considerably less well explained by its own cognate transcript level than that of one or more trans locus transcripts. The incorporation of additional trans-locus transcript abundance data as input features increasingly improved the ability to predict sample protein abundance. Transcripts that contribute to non-cognate protein abundance primarily involve those encoding known or predicted interaction partners of the protein of interest, including not only large multi-protein complexes as previously shown, but also small stable complexes in the proteome with only one or few stable interacting partners. Network analysis further shows a complex proteome-wide interdependency of protein abundance on the transcript levels of multiple interacting partners. The predictive model analysis here therefore supports that protein-protein interaction including in small protein complexes exert post-transcriptional influence on proteome compositions more broadly than previously recognized. Moreover, the results suggest mRNA and protein co-expression analysis may have utility for finding gene interactions and predicting expression changes in biological systems.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Genewise dispersion of protein predictability from transcriptome data.
Box plots of test set correlation coefficients between the transcript-predicted and actual protein level for each protein are shown across five feature sets (column: single/self transcript, CORUM interactors, STRING 800 high-confidence associated proteins; STRING 200 low-confidence associated proteins, and all transcripts) and three algorithms (multiple linear regression, elastic net, and random forest). In each plot, the x axis denotes the number of additive CPTAC data sets used to train the models as described in Methods; box: interquartile range; whiskers: +/– 1.5 IQR; notch: SEM.
Fig 2
Fig 2. Pathway enrichment of proteins with good and poor predictability.
A. Tree plots showing the clustering and relationships of gene ontology terms that are significantly enriched among proteins whose abundances are well predicted by their own transcripts (r ≥ 0.6). B. Tree plots of terms enriched among proteins whose abundances are poorly predicted by their own transcripts (r ≤ 0.3).
Fig 3
Fig 3. Proteins with improved predicted levels after inclusion of additional transcript features.
Four proteins with substantial predictability from transcriptome data upon the inclusion of additional features are shown: A. PCCB, B. CMC1, C. PSMG2, D. SMCR8. For each protein, the transcript-trained prediction of protein level is plotted on the x axis and the actual protein level is plotted on the y axis. The lack of variance in predicted protein levels from the self-transcript model is due to the regularization of the elastic net model, and corresponds to a lack of correlation between PCCB mRNA and protein (see Fig 4). Blue: train set, brown: test set. Columns denote the transcript feature set used to train the model. The number of features used to train the model in each feature set is shown inside each plot. r: Correlation coefficient.
Fig 4
Fig 4. mRNA-Protein correlations of PCCB and CMC1 with functionally associated proteins.
Two examples of proteins whose abundance is better explained by another transcript are shown. A. PCCB protein level is predicted by PCCA transcript but not its own transcript. B. CMC1 protein level is explained by MT-CO1 transcript level but not its own transcript. Substantial correlations across transcripts and proteins (≥ 0.4) are bolded.
Fig 5
Fig 5. mRNA-Protein correlations of PSMG2 and SMCR8 with functionally associated proteins.
Two examples of proteins whose abundance is better explained by another transcript are shown. A. PSMG2 protein level is predicted by PSMG1 transcript but not its own transcript. B. SMCR8 protein level is explained by C9orf72 transcript level but not its own transcript. Substantial correlations across transcripts and proteins (≥ 0.4) are bolded.
Fig 6
Fig 6. Directed graphs of protein and transcript interrelationships identify candidate regulatory genes.
A-C. Examples of directed graphs constructed from genome-wide relationships of transcript-predicted proteins, containing members of A. the propionyl-CoA carboxylase complex; B. the cytochrome c oxidase, mitochondrial complex; C. the PI4K2A-WASH complex, the RICH1/AMOT polarity complex, and others. In each subgraph, orange nodes have outflow edges only (i.e., they are contributing transcripts in the prediction models). Blue nodes are nodes that are connected to other nodes via at least one inflow edge (i.e., they represent proteins, and optionally also transcripts if they also have outward edges). Orange edges represent positive coefficients of the transcripts to the target proteins in the elastic net models; gray edges represent negative coefficients. All edges are directed from transcript to protein, and the widths of the edges are scaled by the weight. D. A highly connected subgraph of mitochondrial ribosome subunits containing 73 nodes and 834 edges. E. Persistent community detection and network representation of preferential node connections, showing a hierarchical relationship between the 28S and 39S subcomplex with the assembled 55S mitochondrial ribosome. F. Network representation of hub nodes defined as 15% of nodes ranked by betweenness centrality, which predicts a potential role of LACTB as a critical hub that lies upstream of multiple large and small mitochondrial ribosomal protein subunits. Node colors represent the pie chart diagram of the corresponding GO biological process described in the table. SHAP values of three proteins (MRPL20, MRPL19, MRPS34) are highlighted showing top model contributors.

Similar articles

Cited by

References

    1. Gygi SP, Rochon Y, Franza BR, Aebersold R. Correlation between protein and mRNA abundance in yeast. Mol Cell Biol. 1999;19: 1720–1730. doi: 10.1128/MCB.19.3.1720 - DOI - PMC - PubMed
    1. Liu Y, Beyer A, Aebersold R. On the Dependency of Cellular Protein Levels on mRNA Abundance. Cell. 2016;165: 535–550. doi: 10.1016/j.cell.2016.03.014 - DOI - PubMed
    1. Vogel C, Marcotte EM. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat Rev Genet. 2012;13: 227–232. doi: 10.1038/nrg3185 - DOI - PMC - PubMed
    1. Franks A, Airoldi E, Slavov N. Post-transcriptional regulation across human tissues. PLoS Comput Biol. 2017;13: e1005535. doi: 10.1371/journal.pcbi.1005535 - DOI - PMC - PubMed
    1. Upadhya SR, Ryan CJ. Experimental reproducibility limits the correlation between mRNA and protein abundances in tumour proteomic profiles. Systems Biology; 2021. Sep. doi: 10.1101/2021.09.22.461108 - DOI - PMC - PubMed

Publication types