SPACE: STRING proteins as complementary embeddings

Dewei Hu¹, Damian Szklarczyk^{2

3}, Christian von Mering^{2

3}, Lars Juhl Jensen^{1

4}

Affiliations

¹ Novo Nordisk Foundation Center for Protein Research, Department of Cellular and Molecular Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark.
² Department of Molecular Life Sciences, University of Zurich, Zurich 8057, Switzerland.
³ SIB Swiss Institute of Bioinformatics, Amphipôle, Quartier UNIL-Sorge, Lausanne 1015, Switzerland.
⁴ ZS Discovery, ZS Associates, Kongens Lyngby 2800, Denmark.

PMID: 40924541
PMCID: PMC12453690
DOI: 10.1093/bioinformatics/btaf496

SPACE: STRING proteins as complementary embeddings

Dewei Hu et al. Bioinformatics. 2025.

. 2025 Sep 1;41(9):btaf496.

doi: 10.1093/bioinformatics/btaf496.

Authors

Dewei Hu¹, Damian Szklarczyk^{2

3}, Christian von Mering^{2

3}, Lars Juhl Jensen^{1

4}

Affiliations

¹ Novo Nordisk Foundation Center for Protein Research, Department of Cellular and Molecular Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark.
² Department of Molecular Life Sciences, University of Zurich, Zurich 8057, Switzerland.
³ SIB Swiss Institute of Bioinformatics, Amphipôle, Quartier UNIL-Sorge, Lausanne 1015, Switzerland.
⁴ ZS Discovery, ZS Associates, Kongens Lyngby 2800, Denmark.

PMID: 40924541
PMCID: PMC12453690
DOI: 10.1093/bioinformatics/btaf496

Abstract

Motivation: Representation learning has revolutionized sequence-based prediction of protein function and subcellular localization. Protein networks are an important source of information complementary to sequences, but the use of protein networks has proven to be challenging in the context of machine learning, especially in a cross-species setting.

Results: We leveraged the STRING database of protein networks and orthology relations for 1322 eukaryotes to generate network-based cross-species protein embeddings. We did this by first creating species-specific network embeddings and subsequently aligning them based on orthology relations to facilitate direct cross-species comparisons. We show that these aligned network embeddings ensure consistency across species without sacrificing quality compared to species-specific network embeddings. We also show that the aligned network embeddings are complementary to sequence embedding techniques, despite the use of sequence-based orthology relations in the alignment process. Finally, we validated the embeddings by using them for two well-established tasks: subcellular localization prediction and protein function prediction. Training logistic regression classifiers on aligned network embeddings and sequence embeddings improved the accuracy over using sequence alone, reaching performance numbers close to state-of-the-art deep-learning methods.

Availability and implementation: The source code and scripts for generating the network-based cross-species protein embeddings are available at https://github.com/deweihu96/SPACE. Precomputed network embeddings and sequence embeddings for all eukaryotic proteins are included in STRING version 12.0 (https://string-db.org/cgi/download).

PubMed Disclaimer

Figures

**Figure 1.**
SPACE workflow and demonstration of successful cross-species embedding alignment. (a) Overview of the SPACE workflow. The pipeline begins with input from the STRING database in two forms: protein–protein interaction networks and protein sequences. The networks are processed through node2vec to generate 128-dimensional species-specific embeddings. The network alignment process first aligns 48 seed species using the FedCoder method to create a 512-dimensional shared latent space, then aligns each remaining non-seed species to their corresponding taxonomic groups (fungi, plants, animals, or protists) in this established latent space using autoencoders. In parallel, sequences are processed through the ProtT5 encoder to generate sequence embeddings. (b) UMAP visualization demonstrates cross-species embedding alignment’s effectiveness. The plots show aligned network protein embeddings for four evolutionarily diverse seed species (*Homo sapiens*, *Saccharomyces cerevisiae*, *Arabidopsis thaliana*, and *Dictyostelium discoideum*) and one non-seed species (*Rattus norvegicus*). Colored points represent proteins from the named species, while gray points show the background distribution of proteins from other species. The overlapping patterns in the embeddings demonstrate successful alignment, with some regions representing functional associations found throughout eukaryotes and others representing functions specific to particular kingdoms. The unmapped cluster from *R. norvegicus* is mainly composed by olfactory proteins.

**Figure 2.**
Comparison of protein embedding methods across diverse eukaryotic species using KEGG pathways. The plots show receiver operating characteristic (ROC) curves comparing three different embedding approaches: aligned network embeddings (solid lines), node2vec embeddings (dashed lines), and ProtT5 sequence embeddings (dotted lines). Results are presented for 12 representative species divided into four panels: (top left) animal species including *Homo sapiens*, *Drosophila melanogaster*, and *Danio rerio*; (top right) plants including *Arabidopsis thaliana*, *Zea mays*, and *Solanum tuberosum*; (bottom left) fungi including *Saccharomyces cerevisiae*, *Schizosaccharomyces pombe*, and *Cryptococcus neoformans*; and (bottom right) protists including *Dictyostelium discoideum*, *Trypanosoma brucei*, and *Plasmodium falciparum*. For each species, all possible protein pairs where both proteins are annotated in KEGG pathways were evaluated and ranked by their cosine similarity scores. The x-axis shows cumulative false positives, while the y-axis shows cumulative true positives. True positives are defined as protein pairs sharing at least one KEGG pathway, while false positives are pairs without shared pathways. The curves demonstrate that aligned embeddings maintain pathway information comparable to original node2vec embeddings across most species, while both network-based methods consistently outperform sequence embeddings in capturing pathway relationships.

**Figure 3.**
Precision–recall curves of different embeddings and visualization of SPACE embeddings in protein subcellular localization prediction. (a) Precision–recall curves on SwissProt cross-validation set (24 816 proteins across 144 species) comparing SPACE (concatenation of aligned network and ProtT5 sequence embeddings, red), aligned network embeddings (black), and ProtT5 sequence embeddings (gray). (b) Precision–recall curves on Human Protein Atlas (HPA) test set (1646 human proteins), with DeepLoc2 predictions (blue star) included as an additional baseline. The curves demonstrate that SPACE embeddings consistently maintain higher precision across all recall values compared to individual embedding types. (c) UMAP visualization of aligned network embeddings based on their projections onto logistic regression weight vectors for subcellular localization prediction. The distinct clustering patterns demonstrate that the aligned embeddings successfully capture protein localization information across multiple species, with clear separation observed for major cellular compartments such as nucleus, mitochondrion, and cell membrane. Proteins with multiple localizations were excluded from this visualization to ensure clear compartment separation.

**Figure 4.**
Precision–recall curves of different embeddings in protein function prediction. (a) Molecular function, (b) biological process, and (c) cellular component. Each panel compares SPACE (concatenation of aligned network and ProtT5 sequence embeddings, red), aligned network embeddings (black), and ProtT5 sequence embeddings (gray). Stars indicate precision and recall values that yield maximum MicroF1 scores. The curves reveal that SPACE embeddings show particular strength in biological process prediction, while maintaining competitive performance in molecular function and cellular component prediction, highlighting the complementary nature of sequence and network information in capturing different aspects of protein function.

See this image and copyright information in PMC

References

1. Aleksander SA, Balhoff J, Carbon S et al. ; Gene Ontology Consortium. The gene ontology knowledgebase in 2023. Genetics 2023;224:iyad031. - PMC - PubMed
1. Ashburner M, Ball CA, Blake JA et al. Gene ontology: tool for the unification of biology. Nat Genet 2000;25:25–9. - PMC - PubMed
1. Baumgartner M, Dell’Aglio D, Paulheim H et al. Towards the web of embeddings: integrating multiple knowledge graph embedding spaces with FedCoder. J Web Semant 2023;75:100741.
1. Bernhofer M, Rost B. TMbed: transmembrane proteins predicted through language model embeddings. BMC Bioinformatics 2022;23:326. - PMC - PubMed
1. Bonetta R, Valentino G. Machine learning techniques for protein function prediction. Proteins: Struct Funct Bioinform 2020;88:397–413. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SPACE: STRING proteins as complementary embeddings

Affiliations

SPACE: STRING proteins as complementary embeddings

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous