Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Jun 5:2023.06.05.543335.
doi: 10.1101/2023.06.05.543335.

Illuminating Dark Proteins using Reactome Pathways

Affiliations

Illuminating Dark Proteins using Reactome Pathways

Timothy Brunson et al. bioRxiv. .

Abstract

Limited knowledge about a substantial portion of protein coding genes, known as "dark" proteins, hinders our understanding of their functions and potential therapeutic applications. To address this, we leveraged Reactome, the most comprehensive, open source, open-access pathway knowledgebase, to contextualize dark proteins within biological pathways. By integrating multiple resources and employing a random forest classifier trained on 106 protein/gene pairwise features, we predicted functional interactions between dark proteins and Reactome-annotated proteins. We then developed three scores to measure the interactions between dark proteins and Reactome pathways, utilizing enrichment analysis and fuzzy logic simulations. Correlation analysis of these scores with an independent single-cell RNA sequencing dataset provided supporting evidence for this approach. Furthermore, systematic natural language processing (NLP) analysis of over 22 million PubMed abstracts and manual checking of the literature associated with 20 randomly selected dark proteins reinforced the predicted interactions between proteins and pathways. To enhance the visualization and exploration of dark proteins within Reactome pathways, we developed the Reactome IDG portal, deployed at https://idg.reactome.org, a web application featuring tissue-specific protein and gene expression overlay, as well as drug interactions. Our integrated computational approach, together with the user-friendly web platform, offers a valuable resource for uncovering potential biological functions and therapeutic implications of dark proteins.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Analysis workflow to place dark proteins in the context of Reactome pathways via machine learning, enrichment analysis and mathematical modeling. FIs: Functional Interactions.
Figure 2.
Figure 2.. The performance of the trained random forest and its feature importance.
Figure 3.
Figure 3.. Distributions of the three scores used to quantify interacting pathways for proteins.
All three scores have been scaled between 0 and 1 for comparison purposes. A: Violin plot displaying the three interacting pathway scores for proteins that are annotated (IsAnnotated = true) and not annotated (IsAnnotated = false) in Reactome; B: Zoomed-in view of two simulation scores, Average_Activation and Average_Inhibition in A; C: Box plot presenting the interaction pathway scores for proteins categorized as Tbio, Tchem, Tclin, and Tdark. P-values in A and B were determined using the Welch two-sample t-test, while p-values in C were based on ANOVA.
Figure 4.
Figure 4.. scRNA-seq analysis results support predicted interacting pathways by showing a significantly positively skewed distribution of correlations between enrichment scores from predicted FIs and scRNA-seq co-expression (A) and unbiased distributions between annotated and not-annotated dark and not-dark proteins (B).
The right-most panel in B shows the numbers of interacting pathways used for correlation calculation for individual proteins. P-value: ****: <= 1.0E-04, ***: 1.00e-04 < p <= 1.00e-03, **: 1.00e-03 < p <= 1.00e-02, *: 1.00e-02 < p <= 5.00e-02, ns: p <= 1.00e+00.
Figure 5.
Figure 5.. BERT-based NLP workflow to systematically analyze PubMed abstracts to validate the interacting pathways predicted based on the trained random forest.
A: Illustration of the workflow. B: Detailed workflow with the inputs and outputs of the major steps shown.
Figure 6.
Figure 6.. BERT-based NLP analysis results support predicted interacting pathways for proteins by showing a significantly positively skewed distribution.
A: The distribution of Pearson correlations between NLP-based annotation scores and predicted FI-based enrichment scores exhibits a significantly positively skewed distribution. B: The correlation difference analysis for annotated and not-annotated dark and not-dark proteins. C: As A but for dark proteins only. The right-most panels in B and C show the numbers of interacting pathways used for correlation calculation for individual proteins. P-value: ****: <= 1.0E-04, ***: 1.00e-04 < p <= 1.00e-03, **: 1.00e-03 < p <= 1.00e-02, *: 1.00e-02 < p <= 5.00e-02, ns: p <= 1.00e+00.
Figure 7.
Figure 7.. Major features of the homepage of the Reactome IDG portal using predicting pathways of TANC1, a dark gene (https://idg.reactome.org/search/TANC1), as example.
A. The scatter plot view shows interacting pathways as dots colored and grouped based on their top-level pathways annotated in Reactome. Pathways are ordered based on the original Reactome hierarchical structure. B. The network view shows interacting pathways in a network where nodes represent pathways and edges represent the overlap of genes annotated in the two linked pathways. The two views can be switched by clicking the icon button at the bottom-left corner. C. The scatter plot showing the number of FI partners of TANC1 vs. the FI score predicted from the trained random forest classifier. D. The scatter plot showing the number of pairwise relationships of TANC1 collected for individual features. The features are colored and grouped based on their types.
Figure 8.
Figure 8.. Pathway and network views of an interacting pathway, Assembly and cell surface presentation of NMDA receptors, of TANC1 (https://idg.reactome.org/PathwayBrowser/#/R-HSA-9609736&FLG=TANC1&FLGINT&DSKEYS=0&SIGCUTOFF=0.75&FLGFDR=0.05&FIVIZ).
A. The Reactome-IDG pathway browser showing the enhanced pathway diagram overlaid with a tissue-specific gene expression data (Artery - Aorta from GTEx), a protein-protein interaction data (BioGridBioPlexStringDB|Homo Sapiens). In this diagram view, entities interacting with TANC1 based on FI Score >= 0.75 have their borders highlighted in magenta. B. The drug/target interaction view popped up by clicking the purple circle with a number at the top-left corner of an entity in the pathway diagram view. C. The FI network view of the pathway displayed after clicking the network view button in the button pane. Proteins in the network are highlighted based on their expression values for the selected tissue. Detailed information for individual proteins may be displayed in the information panel by right-clicking the proteins. Overlaid protein-protein interactions can be shown in a popup panel by clicking the “show pairwise” button (not shown) in the information panel.

References

    1. Meehan TF, Conte N, West DB, Jacobsen JO, Mason J, Warren J, Chen CK, Tudose I, Relac M, Matthews P, et al. : Disease model discovery from 3,328 gene knockouts by The International Mouse Phenotyping Consortium. Nat Genet 2017, 49:1231–1238. - PMC - PubMed
    1. Oprea TI, Bologa CG, Brunak S, Campbell A, Gan GN, Gaulton A, Gomez SM, Guha R, Hersey A, Holmes J, et al. : Unexplored therapeutic opportunities in the human genome. Nat Rev Drug Discov 2018, 17:317–332. - PMC - PubMed
    1. D’Eustachio P: Pathway databases: making chemical and biological sense of the genomic data flood. Chem Biol 2013, 20:629–635. - PMC - PubMed
    1. Barabasi AL, Gulbahce N, Loscalzo J: Network medicine: a network-based approach to human disease. Nat Rev Genet 2011, 12:56–68. - PMC - PubMed
    1. Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst Biol 2007, 3:88. - PMC - PubMed

Publication types