Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Sep 1;41(9):btaf496.
doi: 10.1093/bioinformatics/btaf496.

SPACE: STRING proteins as complementary embeddings

Affiliations

SPACE: STRING proteins as complementary embeddings

Dewei Hu et al. Bioinformatics. .

Abstract

Motivation: Representation learning has revolutionized sequence-based prediction of protein function and subcellular localization. Protein networks are an important source of information complementary to sequences, but the use of protein networks has proven to be challenging in the context of machine learning, especially in a cross-species setting.

Results: We leveraged the STRING database of protein networks and orthology relations for 1322 eukaryotes to generate network-based cross-species protein embeddings. We did this by first creating species-specific network embeddings and subsequently aligning them based on orthology relations to facilitate direct cross-species comparisons. We show that these aligned network embeddings ensure consistency across species without sacrificing quality compared to species-specific network embeddings. We also show that the aligned network embeddings are complementary to sequence embedding techniques, despite the use of sequence-based orthology relations in the alignment process. Finally, we validated the embeddings by using them for two well-established tasks: subcellular localization prediction and protein function prediction. Training logistic regression classifiers on aligned network embeddings and sequence embeddings improved the accuracy over using sequence alone, reaching performance numbers close to state-of-the-art deep-learning methods.

Availability and implementation: The source code and scripts for generating the network-based cross-species protein embeddings are available at https://github.com/deweihu96/SPACE. Precomputed network embeddings and sequence embeddings for all eukaryotic proteins are included in STRING version 12.0 (https://string-db.org/cgi/download).

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
SPACE workflow and demonstration of successful cross-species embedding alignment. (a) Overview of the SPACE workflow. The pipeline begins with input from the STRING database in two forms: protein–protein interaction networks and protein sequences. The networks are processed through node2vec to generate 128-dimensional species-specific embeddings. The network alignment process first aligns 48 seed species using the FedCoder method to create a 512-dimensional shared latent space, then aligns each remaining non-seed species to their corresponding taxonomic groups (fungi, plants, animals, or protists) in this established latent space using autoencoders. In parallel, sequences are processed through the ProtT5 encoder to generate sequence embeddings. (b) UMAP visualization demonstrates cross-species embedding alignment’s effectiveness. The plots show aligned network protein embeddings for four evolutionarily diverse seed species (Homo sapiens, Saccharomyces cerevisiae, Arabidopsis thaliana, and Dictyostelium discoideum) and one non-seed species (Rattus norvegicus). Colored points represent proteins from the named species, while gray points show the background distribution of proteins from other species. The overlapping patterns in the embeddings demonstrate successful alignment, with some regions representing functional associations found throughout eukaryotes and others representing functions specific to particular kingdoms. The unmapped cluster from R. norvegicus is mainly composed by olfactory proteins.
Figure 2.
Figure 2.
Comparison of protein embedding methods across diverse eukaryotic species using KEGG pathways. The plots show receiver operating characteristic (ROC) curves comparing three different embedding approaches: aligned network embeddings (solid lines), node2vec embeddings (dashed lines), and ProtT5 sequence embeddings (dotted lines). Results are presented for 12 representative species divided into four panels: (top left) animal species including Homo sapiens, Drosophila melanogaster, and Danio rerio; (top right) plants including Arabidopsis thaliana, Zea mays, and Solanum tuberosum; (bottom left) fungi including Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Cryptococcus neoformans; and (bottom right) protists including Dictyostelium discoideum, Trypanosoma brucei, and Plasmodium falciparum. For each species, all possible protein pairs where both proteins are annotated in KEGG pathways were evaluated and ranked by their cosine similarity scores. The x-axis shows cumulative false positives, while the y-axis shows cumulative true positives. True positives are defined as protein pairs sharing at least one KEGG pathway, while false positives are pairs without shared pathways. The curves demonstrate that aligned embeddings maintain pathway information comparable to original node2vec embeddings across most species, while both network-based methods consistently outperform sequence embeddings in capturing pathway relationships.
Figure 3.
Figure 3.
Precision–recall curves of different embeddings and visualization of SPACE embeddings in protein subcellular localization prediction. (a) Precision–recall curves on SwissProt cross-validation set (24 816 proteins across 144 species) comparing SPACE (concatenation of aligned network and ProtT5 sequence embeddings, red), aligned network embeddings (black), and ProtT5 sequence embeddings (gray). (b) Precision–recall curves on Human Protein Atlas (HPA) test set (1646 human proteins), with DeepLoc2 predictions (blue star) included as an additional baseline. The curves demonstrate that SPACE embeddings consistently maintain higher precision across all recall values compared to individual embedding types. (c) UMAP visualization of aligned network embeddings based on their projections onto logistic regression weight vectors for subcellular localization prediction. The distinct clustering patterns demonstrate that the aligned embeddings successfully capture protein localization information across multiple species, with clear separation observed for major cellular compartments such as nucleus, mitochondrion, and cell membrane. Proteins with multiple localizations were excluded from this visualization to ensure clear compartment separation.
Figure 4.
Figure 4.
Precision–recall curves of different embeddings in protein function prediction. (a) Molecular function, (b) biological process, and (c) cellular component. Each panel compares SPACE (concatenation of aligned network and ProtT5 sequence embeddings, red), aligned network embeddings (black), and ProtT5 sequence embeddings (gray). Stars indicate precision and recall values that yield maximum MicroF1 scores. The curves reveal that SPACE embeddings show particular strength in biological process prediction, while maintaining competitive performance in molecular function and cellular component prediction, highlighting the complementary nature of sequence and network information in capturing different aspects of protein function.

References

    1. Aleksander SA, Balhoff J, Carbon S et al. ; Gene Ontology Consortium. The gene ontology knowledgebase in 2023. Genetics 2023;224:iyad031. - PMC - PubMed
    1. Ashburner M, Ball CA, Blake JA et al. Gene ontology: tool for the unification of biology. Nat Genet 2000;25:25–9. - PMC - PubMed
    1. Baumgartner M, Dell’Aglio D, Paulheim H et al. Towards the web of embeddings: integrating multiple knowledge graph embedding spaces with FedCoder. J Web Semant 2023;75:100741.
    1. Bernhofer M, Rost B. TMbed: transmembrane proteins predicted through language model embeddings. BMC Bioinformatics 2022;23:326. - PMC - PubMed
    1. Bonetta R, Valentino G. Machine learning techniques for protein function prediction. Proteins: Struct Funct Bioinform 2020;88:397–413. - PubMed

LinkOut - more resources