Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 9;13(1):5304.
doi: 10.1038/s41467-022-33026-0.

Integrating and formatting biomedical data as pre-calculated knowledge graph embeddings in the Bioteque

Affiliations

Integrating and formatting biomedical data as pre-calculated knowledge graph embeddings in the Bioteque

Adrià Fernández-Torras et al. Nat Commun. .

Abstract

Biomedical data is accumulating at a fast pace and integrating it into a unified framework is a major challenge, so that multiple views of a given biological event can be considered simultaneously. Here we present the Bioteque, a resource of unprecedented size and scope that contains pre-calculated biomedical descriptors derived from a gigantic knowledge graph, displaying more than 450 thousand biological entities and 30 million relationships between them. The Bioteque integrates, harmonizes, and formats data collected from over 150 data sources, including 12 biological entities (e.g., genes, diseases, drugs) linked by 67 types of associations (e.g., 'drug treats disease', 'gene interacts with gene'). We show how Bioteque descriptors facilitate the assessment of high-throughput protein-protein interactome data, the prediction of drug response and new repurposing opportunities, and demonstrate that they can be used off-the-shelf in downstream machine learning tasks without loss of performance with respect to using original data. The Bioteque thus offers a thoroughly processed, tractable, and highly optimized assembly of the biomedical knowledge available in the public domain.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Building the Bioteque knowledge graph (KG).
a Metagraph of the Bioteque, showing all the entities and the most representative associations (metaedges) between them. b Circos plot representation of the KG, showing the relationships between nodes. c Treeplot showing the number of datasets used to construct each metaedge. d Total number of nodes (x-axis) and edges (y-axis) available for each entity type. The size of the circles is proportional to the number of metaedges in which the entities participate. e Number of edges (top row) and overlap (bottom row) between the datasets inside the ‘gene associates with disease’ (GEN-ass-DIS, left) and ‘protein interacts protein’ (GEN-ppi-GEN, right) associations. f Most popular nodes in the KG within the gene (GEN, blue), compound (CPD, red), disease (DIS, purple) and pathway (PWY, green) universe. Dataset associations were de-propagated across the corresponding ontologies (when possible) before computing the popularity of the nodes. A propagated version of this plot is shown in Supplementary Fig. 1.
Fig. 2
Fig. 2. Generating the Bioteque embeddings.
a Scheme of the methodology. First, we define the biological entities to be connected and the specific context to be explored. Then a source-target network is derived by traversing all the paths available from the source to the target nodes of a given metapath. The vicinity of each node in the network is then explored by a random walker and embedded in a 128-dimensional space. Finally, embeddings are evaluated and characterized. b Number of unique metapath embeddings linking each entity. In the middle plot, the filled dots indicate the total number of unique metapaths while the empty dots show the total number of metapath-dataset combinations. In the rightmost plot, we show the number of entity-specific datasets used in the metapaths. c Number of metapath-dataset embedding combinations obtained at each metapath length. Solid bars highlight the number of unique metpaths. d Number of nodes within each entity with at least one embedding in the Bioteque resource. Note that during metapath construction, perturbagen (PGN) entities are always mapped to the corresponding perturbed genes. Thus, although used to construct several metapaths, PGN nodes are not explicitly embedded, i.e., they are not the first or last nodes in the metapaths.
Fig. 3
Fig. 3. A Bioteque embedding summary card.
a 2D projection (opt-SNE) of the compound (CPD, blue) and disease (DIS, red) embeddings from the metapath ‘compound interacts protein associates with disease’ (CPD-int-GEN-ass-DIS). We highlight clusters of compounds and diseases sharing treatment evidence. We highlight some representative compounds and diseases found in these clusters, together with the drug targets associated with the diseases. b ROC curve validation when reconstructing the original network with the corresponding embeddings. c Visual representation of the embedding vectors of leukaemia (top) and Kaposi’s sarcoma (middle), together with the drug Etoposide (bottom). d Ranking proportion in which the putative CPD (n = 131,648) and DIS (n = 134,997) neighbours are found. Box plots indicate median (middle line), 25th, 75th percentile (box) and max value within the 1.5*75th percentile (whiskers). e Recapitulation of orthogonal associations by using embedding distances. The AUROC (x-axis) summarizes the performance obtained when ranking the orthogonal associations. Drug targets are collected from Drugbank, the Drug Repurposing Hub and PharmacoDB, and gene-disease associations are obtained from Open Targets.
Fig. 4
Fig. 4. Comparison of embeddings built from different metapaths and datasets.
a Four illustrative examples showing pairs of genes (GEN), compounds (CPD), diseases (DIS) and cell lines (CLL) with similarities or differences depending on the metapaths. The extended nomenclature of each metapath can be found in Supplementary Data 2. b Top metapaths (y-axis) recapitulating (AUROC, x-axis) gene molecular function (MFN, blue) and compound pharmacological class (PHC, red). The coloured bars indicate the proportion of nodes in the metapath that could be assessed (i.e., with annotated molecular function or pharmacological classes). c Gene embedding characterization of three reference PPI datasets, namely STRING, IntAct and OmniPath. We limited the analysis to the common gene universe (9395 genes) between the three sources.
Fig. 5
Fig. 5. Analysis of gene expression (GEx) embeddings.
a 2D projection of the raw GEx (left) and the corresponding Bioteque ‘cell has similar gex cell’ (CLL-gex-CLL) embedding (right). Each dot corresponds to one cell line and is coloured by tissue of origin. b Tissue recovery by the raw GEx and the CLL-gex-CLL embedding. c Drug response prediction performance (AUROC) for each drug in the GDSC resource from models trained with either the raw GEx (y-axis) or the CLL-gex-CLL embeddings (x-axis). d Recovering CCLE (left) and GDSC (right) cell-cell (CLL-CLL) similarities (green), cell-gene (CLL-GEN) upregulation (upr) similarities (blue) and CLL-GEN downregulation (dwr) similarities (red) using embedding distances from the GDSC and the CCLE embedding spaces, respectively. e Characterization of the CLL-CLL (left) and GEN–GEN (right) embedding similarities for three metapaths: CLL-gex-CLL (green), CLL-upr-GEN (blue) and CLL-dwr-GEN (red).
Fig. 6
Fig. 6. Assessing the novelty of the HuRI-III interactome.
a Embedding distance P values are calculated for each PPI in HuRI-III (x-axis) using the corresponding gene-gene (GEN–GEN) embeddings from a subset of metapaths (y-axis). Please, note that these P values do not reflect the significance of any statistical test, but indicate the normalized quantile rank position of a given observation in a background distance distribution (“Methods”). Red tones (lower P values) indicate similarity according to a given embedding space. The column and row next to the heatmap show the 10th percentile of the P value distribution for each metapath and the lowest P value for each edge, respectively. In blue, we grouped edges according to four levels of support. On the right, it is shown the enrichment scores (ES) (capped between 1 and 5 on the y-axis) across P values, the coverage (Cov), and the cumulative recall (Rec) across P values. b (Top) Recovery of HuRI-III edges (recall) and randomly permuted edges (FDR) by ‘protein interacts protein’ (GEN-ppi-GEN) embeddings across the P values (x-axis). The dashed line is placed at the 0.05 FDR (corresponding to a P value of 0.02). (Bottom) Number of HuRI-III interactions recovered by the GEN-ppi-GEN embedding at 0.05 FDR stratified by those covered in the original network (known PPIs), those not available in the network, hence, predicted by the embeddings (new PPIs), and those present in the original network but not covered at the given P value (missing PPIs). c Number of unique HuRI-III edges recovered at 0.05 FDR by the GEN-ppi-GEN and/or the three most supportive metapaths, including ‘gene has cellular components’ (GEN-has-CMP), ‘protein has domain’ (GEN-has-DOM), and ‘gene associates with pathway’ (GEN-ass-PWY). d Shapley force plots corresponding to the prediction of three PPIs with no direct evidence of physical interaction before HuRI-III was released. Red segments are metapath-specific P values that pushed predictions toward a high probability of interactions, while blue segments pulled predictions towards a low probability. The length of the segments is proportional to their impact on the prediction. The final output probability given by the model is found where both forces equalize (shown in white).
Fig. 7
Fig. 7. Prediction of drug indications and disease treatments from repoDB.
a Cumulative distribution (y-axis) of compounds (top) and diseases (bottom) according to the ranked position (x-axis) of the top predicted disease indication (top) or compound treatment (bottom) for the four tested models. The rankings are shown in percentages and only for the first 10% of compound/disease predictions (corresponding to the top 50 and 80 diseases and compounds, respectively). Dotted lines show the distribution for those compounds or diseases with only one positive indication in repoDB v1. b Classification performance obtained for each compound (n = 38, top plot) and disease (n = 67, bottom plot) with multiple (≥5) new indications reported in repoDB v2. Box plots indicate median (middle line), 25th, 75th percentile (box), and max and min value within the 1.5*25th and 1.5*75th percentile range (whiskers). c Number of different therapeutic areas (top) and disease families (bottom) covered by the predictions of the Long model. We considered a given therapeutic area or disease family to be covered when the model predicted one true indication or treatment (as in panel (a)) for at least 1%, 20%, 40%, 60%, or 80% of its instances.

References

    1. Baker M. Big biology: the ‘omes puzzle. Nature. 2013;494:416–419. doi: 10.1038/494416a. - DOI - PubMed
    1. Cantelli G, et al. The European Bioinformatics Institute (EMBL-EBI) in 2021. Nucleic Acids Res. 2022;50:D11–D19. doi: 10.1093/nar/gkab1127. - DOI - PMC - PubMed
    1. Rouillard AD, Wang Z, Ma’ayan A. Reprint of “Abstraction for data integration: Fusing mammalian molecular, cellular and phenotype big datasets for better knowledge extraction”. Comput. Biol. Chem. 2015;59:123–138. doi: 10.1016/j.compbiolchem.2015.08.005. - DOI - PubMed
    1. Rigden DJ, Fernandez XM. The 2021 Nucleic Acids Research database issue and the online molecular biology database collection. Nucleic Acids Res. 2021;49:D1–D9. doi: 10.1093/nar/gkaa1216. - DOI - PMC - PubMed
    1. Ma’ayan A, et al. Lean Big Data integration in systems biology and systems pharmacology. Trends Pharmacol. Sci. 2014;35:450–460. doi: 10.1016/j.tips.2014.07.001. - DOI - PMC - PubMed

Publication types