Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Dec 7;17(12):4267-4278.
doi: 10.1021/acs.jproteome.8b00393. Epub 2018 Oct 9.

Identifying High-Priority Proteins Across the Human Diseasome Using Semantic Similarity

Affiliations

Identifying High-Priority Proteins Across the Human Diseasome Using Semantic Similarity

Edward Lau et al. J Proteome Res. .

Abstract

Identifying the genes and proteins associated with a biological process or disease is a central goal of the biomedical research enterprise. However, relatively few systematic approaches are available that provide objective evaluation of the genes or proteins known to be important to a research topic, and hence researchers often rely on subjective evaluation of domain experts and laborious manual literature review. Computational bibliometric analysis, in conjunction with text mining and data curation, attempts to automate this process and return prioritized proteins in any given research topic. We describe here a method to identify and rank protein-topic relationships by calculating the semantic similarity between a protein and a query term in the biomerical literature while adjusting for the impact and immediacy of associated research articles. We term the calculated metric the weighted copublication distance (WCD) and show that it compares well to related approaches in predicting benchmark protein lists in multiple biological processes. We used WCD to extract prioritized "popular proteins" across multiple cell types, subanatomical regions, and standardized vocabularies containing over 20 000 human disease terms. The collection of protein-disease associations across the resulting human "diseasome" supports data analytical workflows to perform reverse protein-to-disease queries and functional annotation of experimental protein lists. We envision that the described improvement to the popular proteins strategy will be useful for annotating protein lists and guiding method development efforts as well as generating new hypotheses on understudied disease proteins using bibliometric information.

Keywords: bibliometric analysis; diseasome; high-priority proteins; normalized copublication distance; popular proteins; semantic similarity; targeted proteomics; weighted copublication distance.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Modeling the immediacy and impact of protein-associated publications. (a) Immediacy of a publication is modeled using a Weibull distribution such that recent publications published within the past decade are given greater weights than older publications that are associated with a protein. (b) Impact of a publication is modeled using a logistic transformation of the log10 citation count of the publication retrieved via the Europe PubMed Central (PMC) API. (c) Scatterplot of weighted copublication distance (WCD) versus publication counts. The top 10 prioritized proteins in three diseases (cystic fibrosis, diabetes mellitus, and hypertrophic cardiomyopathy) as measured by WCD are given as examples (labeled in red).
Figure 2.
Figure 2.
Receiver operating characteristic (ROC) analysis of protein list prediction. Area-under-ROC (AUROC) metric is used to compare the performance of weighted copublication distance (WCD) versus unadjusted normalized copublication distance (NCD) (Lam et al. 2015) and two published approaches GLAD4U (Jourquin et al. 2012) and PURPOSE (Yu et al. 2018) on 12 query terms. The results are compared against curated benchmark protein lists retrieved from the Comparative Toxicogenomics Database (CTD) or Gene Ontology (GO).
Figure 3.
Figure 3.
Prioritized proteins across the human diseasome. (a) Schematics for precompiling popular protein terms from three standard vocabularies related to human diseases and disease processes. (b) Distribution of total proteins per term in three vocabularies. The distribution of number of total (left) and significantly associated (right) proteins per term in each vocabulary (P ≤ 0.05). (c) Correlation matrix of protein associations for 832 Disease Ontology (DO) terms with 50 or more proteins associated at P ≤ 0.05. A minimal spanning tree of DO terms based on similarity of associated proteins. A protein network is constructed using all DO terms associated with any proteins as nodes. Edges connect pairs of DO terms with κ ≥ 0 or θ50 ≥ 0.2, from which a minimal spanning tree is constructed. (d,e) Zoomed-in views of selected disease labels around two network nodes.
Figure 4.
Figure 4.
Enriched terms in reverse protein-to-disease query (DO and HPO) versus Gene Ontology. (a) Schematics for performing reverse (protein-to-term) queries using precompiled popular protein lists in the human diseasome. (b) Enriched terms (hypergeometric test P ≤ 0.05) from (top) DO, (middle) HPO, and (bottom) GO Biological Processes were associated with differentially expressed genes (limma adjusted, P ≤ 0.01) in a microarray data set from a rodent model of heart failure. (c) Relationship between assigned DO, HPO, and GO terms. Top associated terms are shown for each significantly up-regulated (blue) or down-regulated (red) transcript (limma adjusted P ≤ 0.01) in the microarray data set from a rodent model of heart failure. The alluvial streams link the top enriched term of DO to the corresponding terms in HPO and GO for each transcript. For example, a number of up-regulated transcripts are associated with the “familial atrial fibrillation” term in DO, corresponding in part to the “arrhythmia” term in HPO and to the “regulation of heart rate by cardiac conduction” term in GO.

References

    1. Fortunato S; Bergstrom CT; Borner K; Evans JA; Helbing D; Milojević S; Petersen AM; Radicchi F; Sinatra R; Uzzi B; Vespignani A; Waltman L; Wang D; Barabasi AL Science of science. Science 2018, 359, eaao0185. - PMC - PubMed
    1. Lam MP; Venkatraman V; Xing Y; Lau E; Cao Q; Ng DC; Su AI; Ge J; Van Eyk JE; Ping P Data-Driven Approach To Determine Popular Proteins for Targeted Proteomics Translation of Six Organ Systems. J. Proteome Res. 2016, 15, 4126–4134. - PMC - PubMed
    1. Yu KH; Lee TM; Wang CS; Chen YJ; Re C; Kou SC; Chiang JH; Kohane IS; Snyder M Systematic Protein Prioritization for Targeted Proteomics Studies through Literature Mining. J. Proteome Res. 2018, 17, 1383–1396. - PubMed
    1. Lam MP; Venkatraman V; Cao Q; Wang D; Dincer TU; Lau E; Su AI; Xing Y; Ge J; Ping P; Van Eyk JE Prioritizing Proteomics Assay Development for Clinical Translation. J. Am. Coll. Cardiol. 2015, 66, 202–204. - PubMed
    1. Mora MI; Molina M; Odriozola L; Elortza F; Mato JM; Sitek B; Zhang P; He F; Latasa MU; Avila MA; Corrales FJ Prioritizing Popular Proteins in Liver Cancer: Remodelling One-Carbon Metabolism. J. Proteome Res. 2017, 16, 4506–4514. - PubMed

Publication types

LinkOut - more resources