Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 6;20(8):e0325531.
doi: 10.1371/journal.pone.0325531. eCollection 2025.

Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference

Affiliations

Bag-of-words is competitive with sum-of-embeddings language-inspired representations on protein inference

Frixos Papadopoulos et al. PLoS One. .

Abstract

Inferring protein function is a fundamental and long-standing problem in biology. Laboratory experiments in this field are often expensive, and therefore large-scale computational protein inference from readily available amino acid sequences is needed to understand in more detail the mechanisms underlying biological processes in living organisms. Recently, studies have utilised mathematical ideas from natural language processing and self-supervised learning, to derive features based on protein sequence information. In the area of language modelling, it has been shown that learnt representations from self-supervised pre-training can capture the semantic information of words well for downstream applications. In this study, we tested the ability of sequence-based protein representations learnt using self-supervised pre-training on a large protein database, on multiple protein inference tasks. We show that simple baseline representations in the form of bag-of-words histograms perform better than those based on self-supervised learning, on sequence similarity and protein inference tasks. By feature selection we show that the top discriminant features help bag-of-words capture important information for data-driven function prediction. These findings could have important implications for self-supervised learning models on protein sequences, and might encourage the consideration of alternative pre-training schemes for learning representations that capture more meaningful biological information from the sequence alone.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Main classification setup and protein representation methods.
LR is evaluated under 10foldCV using the AUC metric for binary classification. The SoT method assigns each trigram in the split sequence to the corresponding embedding obtained from SSL pre-training [17], and then sums up the embeddings to get a 100-d representation. Hist-8000 simply counts the occurrences of each trigram in the split sequence to build an 8000-d BoW-like representation. 10foldCV: 10-fold Cross-Validation, LR: Logistic Regression, SoT: Sum-of-learnt-Trigrams, Hist-8000: Histogram-8000, AUC: Area Under the Curve, SSL: Self-Supervised learning, BoW: Bag-of-Words, 8000-d: 8000-dimensional.
Fig 2
Fig 2. Histogram representations outperform Sum-of-learnt-Trigrams representations in protein inference.
Both (a) and (b) are for the adhesins data and LR classifier, see S1 File (supporting information) section ‘Simple Bag-of-Words outperforms Sum-of-learnt-Trigrams representations for protein inference’ for rest of tasks where the trend is largely the same. See (a) for the mean ROC-AUC curves after 10foldCV. In (b), the Hist-N representation (vertical dotted line) is highly accurate (scores found after the 90th percentile) when compared to the distribution of AUCs from 1000 random feature sets. In (a), x-axis is FPR and y-axis is TPR. In (b), y-axis is frequency of score. See section ‘Protein inference problems’ for data sources. Green curve: Hist-8000, red: SoT, black: Hist-N, purple: Hist-SDM12, blue: random classifier. Hist-8000: Histogram-8000, SoT: Sum-of-learnt-Trigrams, ROC: Receiver Operator Characteristic curve, AUC: Area Under the Curve, 10foldCV: 10-fold Cross-Validation, FPR: False Positive Rate, TPR: True Positive Rate, Hist-N: Histogram-N (selected features), Hist-SDM12: Histogram-Structural Derived Matrix-12, LR: Logistic Regression, S1: S1 File (supporting information).

References

    1. Berg JM, Tymoczko JL, Gatto GJJ, Stryer L. Biochemistry. Macmillan; 2015.
    1. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35(Database issue):D61-5. doi: 10.1093/nar/gkl842 - DOI - PMC - PubMed
    1. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733-45. doi: 10.1093/nar/gkv1189 - DOI - PMC - PubMed
    1. Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Cold Spring Harbor Laboratory; 2019. 10.1101/653105 - DOI - PMC - PubMed
    1. Bepler T, Berger B. Learning the protein language: evolution, structure, and function. Cell Syst. 2021;12(6):654-669.e3. doi: 10.1016/j.cels.2021.05.017 - DOI - PMC - PubMed

LinkOut - more resources