Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference

Frixos Papadopoulos¹, Tilman Sanchez-Elsner², Mahesan Niranjan¹, Ashley I Heinson³

Affiliations

¹ Vision-Learning-Control Group, Department of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton, Southampton, United Kingdom.
² Clinical and Experimental Sciences, Department of Medicine, University of Southampton, Southampton, United Kingdom.
³ Clinical Informatics Research Unit, Cancer Sciences, Department of Medicine, University of Southampton, Southampton, United Kingdom.

PMID: 40768408
PMCID: PMC12327643
DOI: 10.1371/journal.pone.0325531

Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference

Frixos Papadopoulos et al. PLoS One. 2025.

. 2025 Aug 6;20(8):e0325531.

doi: 10.1371/journal.pone.0325531. eCollection 2025.

Authors

Frixos Papadopoulos¹, Tilman Sanchez-Elsner², Mahesan Niranjan¹, Ashley I Heinson³

Affiliations

¹ Vision-Learning-Control Group, Department of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton, Southampton, United Kingdom.
² Clinical and Experimental Sciences, Department of Medicine, University of Southampton, Southampton, United Kingdom.
³ Clinical Informatics Research Unit, Cancer Sciences, Department of Medicine, University of Southampton, Southampton, United Kingdom.

PMID: 40768408
PMCID: PMC12327643
DOI: 10.1371/journal.pone.0325531

Abstract

Inferring protein function is a fundamental and long-standing problem in biology. Laboratory experiments in this field are often expensive, and therefore large-scale computational protein inference from readily available amino acid sequences is needed to understand in more detail the mechanisms underlying biological processes in living organisms. Recently, studies have utilised mathematical ideas from natural language processing and self-supervised learning, to derive features based on protein sequence information. In the area of language modelling, it has been shown that learnt representations from self-supervised pre-training can capture the semantic information of words well for downstream applications. In this study, we tested the ability of sequence-based protein representations learnt using self-supervised pre-training on a large protein database, on multiple protein inference tasks. We show that simple baseline representations in the form of bag-of-words histograms perform better than those based on self-supervised learning, on sequence similarity and protein inference tasks. By feature selection we show that the top discriminant features help bag-of-words capture important information for data-driven function prediction. These findings could have important implications for self-supervised learning models on protein sequences, and might encourage the consideration of alternative pre-training schemes for learning representations that capture more meaningful biological information from the sequence alone.

Copyright: © 2025 Papadopoulos et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Main classification setup and protein representation methods.**
LR is evaluated under 10foldCV using the AUC metric for binary classification. The SoT method assigns each trigram in the split sequence to the corresponding embedding obtained from SSL pre-training [17], and then sums up the embeddings to get a 100-d representation. Hist-8000 simply counts the occurrences of each trigram in the split sequence to build an 8000-d BoW-like representation. 10foldCV: 10-fold Cross-Validation, LR: Logistic Regression, SoT: Sum-of-learnt-Trigrams, Hist-8000: Histogram-8000, AUC: Area Under the Curve, SSL: Self-Supervised learning, BoW: Bag-of-Words, 8000-d: 8000-dimensional.

**Fig 2. Histogram representations outperform Sum-of-learnt-Trigrams representations in protein inference.**
Both (a) and (b) are for the adhesins data and LR classifier, see S1 File (supporting information) section ‘Simple Bag-of-Words outperforms Sum-of-learnt-Trigrams representations for protein inference’ for rest of tasks where the trend is largely the same. See (a) for the mean ROC-AUC curves after 10foldCV. In (b), the Hist-N representation (vertical dotted line) is highly accurate (scores found after the 90th percentile) when compared to the distribution of AUCs from 1000 random feature sets. In (a), x-axis is FPR and y-axis is TPR. In (b), y-axis is frequency of score. See section ‘Protein inference problems’ for data sources. Green curve: Hist-8000, red: SoT, black: Hist-N, purple: Hist-SDM12, blue: random classifier. Hist-8000: Histogram-8000, SoT: Sum-of-learnt-Trigrams, ROC: Receiver Operator Characteristic curve, AUC: Area Under the Curve, 10foldCV: 10-fold Cross-Validation, FPR: False Positive Rate, TPR: True Positive Rate, Hist-N: Histogram-N (selected features), Hist-SDM12: Histogram-Structural Derived Matrix-12, LR: Logistic Regression, S1: S1 File (supporting information).

See this image and copyright information in PMC

References

1. Berg JM, Tymoczko JL, Gatto GJJ, Stryer L. Biochemistry. Macmillan; 2015.
1. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35(Database issue):D61-5. doi: 10.1093/nar/gkl842 - DOI - PMC - PubMed
1. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733-45. doi: 10.1093/nar/gkv1189 - DOI - PMC - PubMed
1. Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Cold Spring Harbor Laboratory; 2019. 10.1101/653105 - DOI - PMC - PubMed
1. Bepler T, Berger B. Learning the protein language: evolution, structure, and function. Cell Syst. 2021;12(6):654-669.e3. doi: 10.1016/j.cels.2021.05.017 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
- PubMed Central
- Public Library of Science

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference

Affiliations

Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources