Identifying High-Priority Proteins Across the Human Diseasome Using Semantic Similarity

Edward Lau¹, Vidya Venkatraman², Cody T Thomas³, Joseph C Wu¹, Jennifer E Van Eyk², Maggie P Y Lam³

Affiliations

¹ Stanford Cardiovascular Institute , Stanford University , Stanford , California 94305 , United States.
² Advanced Clinical Biosystems Research Institute, Department of Medicine and The Heart Institute , Cedars-Sinai Medical Center , Los Angeles , California 90048 , United States.
³ Department of Medicine, Division of Cardiology, Consortium for Fibrosis Research and Translation, Anschutz Medical Campus , University of Colorado Denver , Aurora , Colorado 80045 , United States.

PMID: 30256117
PMCID: PMC6606054
DOI: 10.1021/acs.jproteome.8b00393

Identifying High-Priority Proteins Across the Human Diseasome Using Semantic Similarity

Edward Lau et al. J Proteome Res. 2018.

. 2018 Dec 7;17(12):4267-4278.

doi: 10.1021/acs.jproteome.8b00393. Epub 2018 Oct 9.

Authors

Edward Lau¹, Vidya Venkatraman², Cody T Thomas³, Joseph C Wu¹, Jennifer E Van Eyk², Maggie P Y Lam³

Affiliations

¹ Stanford Cardiovascular Institute , Stanford University , Stanford , California 94305 , United States.
² Advanced Clinical Biosystems Research Institute, Department of Medicine and The Heart Institute , Cedars-Sinai Medical Center , Los Angeles , California 90048 , United States.
³ Department of Medicine, Division of Cardiology, Consortium for Fibrosis Research and Translation, Anschutz Medical Campus , University of Colorado Denver , Aurora , Colorado 80045 , United States.

PMID: 30256117
PMCID: PMC6606054
DOI: 10.1021/acs.jproteome.8b00393

Abstract

Identifying the genes and proteins associated with a biological process or disease is a central goal of the biomedical research enterprise. However, relatively few systematic approaches are available that provide objective evaluation of the genes or proteins known to be important to a research topic, and hence researchers often rely on subjective evaluation of domain experts and laborious manual literature review. Computational bibliometric analysis, in conjunction with text mining and data curation, attempts to automate this process and return prioritized proteins in any given research topic. We describe here a method to identify and rank protein-topic relationships by calculating the semantic similarity between a protein and a query term in the biomerical literature while adjusting for the impact and immediacy of associated research articles. We term the calculated metric the weighted copublication distance (WCD) and show that it compares well to related approaches in predicting benchmark protein lists in multiple biological processes. We used WCD to extract prioritized "popular proteins" across multiple cell types, subanatomical regions, and standardized vocabularies containing over 20 000 human disease terms. The collection of protein-disease associations across the resulting human "diseasome" supports data analytical workflows to perform reverse protein-to-disease queries and functional annotation of experimental protein lists. We envision that the described improvement to the popular proteins strategy will be useful for annotating protein lists and guiding method development efforts as well as generating new hypotheses on understudied disease proteins using bibliometric information.

Keywords: bibliometric analysis; diseasome; high-priority proteins; normalized copublication distance; popular proteins; semantic similarity; targeted proteomics; weighted copublication distance.

PubMed Disclaimer

Figures

**Figure 1.**
Modeling the immediacy and impact of protein-associated publications. (a) Immediacy of a publication is modeled using a Weibull distribution such that recent publications published within the past decade are given greater weights than older publications that are associated with a protein. (b) Impact of a publication is modeled using a logistic transformation of the log₁₀ citation count of the publication retrieved via the Europe PubMed Central (PMC) API. (c) Scatterplot of weighted copublication distance (WCD) versus publication counts. The top 10 prioritized proteins in three diseases (cystic fibrosis, diabetes mellitus, and hypertrophic cardiomyopathy) as measured by WCD are given as examples (labeled in red).

**Figure 2.**
Receiver operating characteristic (ROC) analysis of protein list prediction. Area-under-ROC (AUROC) metric is used to compare the performance of weighted copublication distance (WCD) versus unadjusted normalized copublication distance (NCD) (Lam et al. 2015) and two published approaches GLAD4U (Jourquin et al. 2012) and PURPOSE (Yu et al. 2018) on 12 query terms. The results are compared against curated benchmark protein lists retrieved from the Comparative Toxicogenomics Database (CTD) or Gene Ontology (GO).

**Figure 3.**
Prioritized proteins across the human diseasome. (a) Schematics for precompiling popular protein terms from three standard vocabularies related to human diseases and disease processes. (b) Distribution of total proteins per term in three vocabularies. The distribution of number of total (left) and significantly associated (right) proteins per term in each vocabulary (P ≤ 0.05). (c) Correlation matrix of protein associations for 832 Disease Ontology (DO) terms with 50 or more proteins associated at P ≤ 0.05. A minimal spanning tree of DO terms based on similarity of associated proteins. A protein network is constructed using all DO terms associated with any proteins as nodes. Edges connect pairs of DO terms with κ ≥ 0 or θ₅₀ ≥ 0.2, from which a minimal spanning tree is constructed. (d,e) Zoomed-in views of selected disease labels around two network nodes.

**Figure 4.**
Enriched terms in reverse protein-to-disease query (DO and HPO) versus Gene Ontology. (a) Schematics for performing reverse (protein-to-term) queries using precompiled popular protein lists in the human diseasome. (b) Enriched terms (hypergeometric test P ≤ 0.05) from (top) DO, (middle) HPO, and (bottom) GO Biological Processes were associated with differentially expressed genes (limma adjusted, P ≤ 0.01) in a microarray data set from a rodent model of heart failure. (c) Relationship between assigned DO, HPO, and GO terms. Top associated terms are shown for each significantly up-regulated (blue) or down-regulated (red) transcript (limma adjusted P ≤ 0.01) in the microarray data set from a rodent model of heart failure. The alluvial streams link the top enriched term of DO to the corresponding terms in HPO and GO for each transcript. For example, a number of up-regulated transcripts are associated with the “familial atrial fibrillation” term in DO, corresponding in part to the “arrhythmia” term in HPO and to the “regulation of heart rate by cardiac conduction” term in GO.

See this image and copyright information in PMC

References

1. Fortunato S; Bergstrom CT; Borner K; Evans JA; Helbing D; Milojević S; Petersen AM; Radicchi F; Sinatra R; Uzzi B; Vespignani A; Waltman L; Wang D; Barabasi AL Science of science. Science 2018, 359, eaao0185. - PMC - PubMed
1. Lam MP; Venkatraman V; Xing Y; Lau E; Cao Q; Ng DC; Su AI; Ge J; Van Eyk JE; Ping P Data-Driven Approach To Determine Popular Proteins for Targeted Proteomics Translation of Six Organ Systems. J. Proteome Res. 2016, 15, 4126–4134. - PMC - PubMed
1. Yu KH; Lee TM; Wang CS; Chen YJ; Re C; Kou SC; Chiang JH; Kohane IS; Snyder M Systematic Protein Prioritization for Targeted Proteomics Studies through Literature Mining. J. Proteome Res. 2018, 17, 1383–1396. - PubMed
1. Lam MP; Venkatraman V; Cao Q; Wang D; Dincer TU; Lau E; Su AI; Xing Y; Ge J; Ping P; Van Eyk JE Prioritizing Proteomics Assay Development for Clinical Translation. J. Am. Coll. Cardiol. 2015, 66, 202–204. - PubMed
1. Mora MI; Molina M; Odriozola L; Elortza F; Mato JM; Sitek B; Zhang P; He F; Latasa MU; Avila MA; Corrales FJ Prioritizing Popular Proteins in Liver Cancer: Remodelling One-Carbon Metabolism. J. Proteome Res. 2017, 16, 4506–4514. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identifying High-Priority Proteins Across the Human Diseasome Using Semantic Similarity

Affiliations

Identifying High-Priority Proteins Across the Human Diseasome Using Semantic Similarity

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources