. 2025 Aug 8;23(1):250.

doi: 10.1186/s12915-025-02348-y.

Prediction of human pathogenic start loss variants based on self-supervised contrastive learning

Jie Liu¹, Henghui Fan¹, Na Cheng², Yansen Su³, Junfeng Xia⁴

Affiliations

¹ Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, Anhui, China.
² School of Biomedical Engineering, Anhui Medical University, Hefei, 230032, Anhui, China.
³ School of Artificial Intelligence, Anhui University, Hefei, 230601, Anhui, China.
⁴ Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, Anhui, China. jfxia@ahu.edu.cn.

PMID: 40781627
PMCID: PMC12333246
DOI: 10.1186/s12915-025-02348-y

Prediction of human pathogenic start loss variants based on self-supervised contrastive learning

Jie Liu et al. BMC Biol. 2025.

. 2025 Aug 8;23(1):250.

doi: 10.1186/s12915-025-02348-y.

Authors

Jie Liu¹, Henghui Fan¹, Na Cheng², Yansen Su³, Junfeng Xia⁴

Affiliations

¹ Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, Anhui, China.
² School of Biomedical Engineering, Anhui Medical University, Hefei, 230032, Anhui, China.
³ School of Artificial Intelligence, Anhui University, Hefei, 230601, Anhui, China.
⁴ Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, Anhui, China. jfxia@ahu.edu.cn.

PMID: 40781627
PMCID: PMC12333246
DOI: 10.1186/s12915-025-02348-y

Abstract

Background: Start loss variants are a class of genetic variants that affect the bases of the start codon, disrupting the normal translation initiation process and leading to protein deletions or the production of different proteins. Accurate assessment of the pathogenicity of these variants is crucial for deciphering disease mechanisms and integrating genomics into clinical practice. However, among the tens of thousands of start loss variants in the human genome, only about 1% have been classified as pathogenic or benign. Computational methods that rely solely on small amounts of labeled data often lack sufficient generalization capabilities, restricting their effectiveness in predicting the impact of start loss variants.

Results: Here, we introduce StartCLR, a novel prediction method specifically designed for identifying pathogenic start loss variants. StartCLR captures variant context information from different dimensions by integrating embedding features from diverse DNA language models. Moreover, it employs self-supervised pre-training combined with supervised fine-tuning, enabling the effective utilization of both a large amount of unlabeled data and a small amount of labeled data to enhance prediction accuracy. Our experimental results show that StartCLR exhibits strong generalization and superior prediction performance across different test sets. Notably, when trained exclusively on high-confidence labeled data, StartCLR retains or even improves the prediction accuracy despite the reduced amount of labeled data.

Conclusions: Collectively, these findings highlight the potential of integrating self-supervised contrastive learning with unlabeled data to mitigate the challenge posed by the scarcity of labeled start loss variants.

Keywords: Fine-tune; Pathogenicity prediction; Self-supervised contrastive learning; Start loss variant.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
Framework of StartCLR for assessing the pathogenicity of start loss variants. A Pre-training phase: the encoder is pre-trained using unlabeled data through self-supervised contrastive learning. B Fine-tuning phase: the pre-trained encoder is fine-tuned, and the classifier is trained using labeled data to predict pathogenic start loss variants

**Fig. 2**
Prediction results of different model architectures on independent test sets. A AUC values of different model architectures on independent test sets. B AUPR values of different model architectures on independent test sets. Random referred to randomly initializing the encoder and training the classifier solely on the labeled fine-tuning dataset. Zero-shot CL involved using an encoder pre-trained on unlabeled data while keeping it frozen and training only the classifier on the fine-tuning dataset. CL denoted loading the pre-trained encoder and then further fine-tuning it together with the classifier using the fine-tuning dataset. Supervised referred to training both the encoder and classifier from scratch using only the fine-tuning dataset

**Fig. 3**
Performance of different data augmentation methods on independent test sets. A AUC values of different data augmentation methods on independent test sets. B AUPR values of different data augmentation methods on independent test sets. Token cutoff refers to setting the entire embedding vector of a single word to zero. Feature cutoff involved setting a specific embedding feature dimension of all words to zero. Dropout randomly set certain feature values within the entire embedding matrix to zero

**Fig. 4**
Performance evaluation of different features on independent test sets. A AUC values of different features on test subset 1. B AUPR values of different features on independent test set 1. C AUC values of different features on test subset 2. D AUPR values of different features on independent test set 2

**Fig. 5**
Statistics of missing predictions across different prediction methods on independent test sets. A Statistics of missing predictions for different methods on the independent test set 1. B Statistics of missing predictions for different methods on the independent test set 2

**Fig. 6**
Performance comparison of different variant pathogenicity prediction methods on independent test sets. A AUC values for pairwise comparisons of all methods on independent test set 1. B AUPR values for pairwise comparisons of all methods on independent test set 1. C AUC values for pairwise comparisons of all methods on independent test set 2. D AUPR values for pairwise comparisons of all methods on independent test set 2

**Fig. 7**
Prediction performance of different variant pathogenicity prediction methods on test subsets where all tools provide results, with no missing values. A ROC curves of different methods on test subset 1. B PR curves of different methods on test subset 1. C ROC curves of different methods on test subset 2. D PR curves of different methods on test subset 2

See this image and copyright information in PMC

References

1. Abou-Tayoun AN, Pesaran T, DiStefano MT, Oza A, Rehm HL, Biesecker LG, Harrison SM, et al. Recommendations for interpreting the loss of function PVS1 ACMG/AMP variant criterion. Hum Mutat. 2018;39(11):1517–24. - PMC - PubMed
1. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O’Connell J, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. - PMC - PubMed
1. Marian AJ. Clinical interpretation and management of genetic variants. JACC Basic Transl Sci. 2020;5(10):1029–42. - PMC - PubMed
1. Sriram A, Bohlen J, Teleman AA. Translation acrobatics: how cancer cells exploit alternate modes of translational initiation. EMBO Rep. 2018;19(10): e45947. - PMC - PubMed
1. Jia XC, He XY, Huang CT, Li J, Dong ZG, Liu KD. Protein translation: biological processes and therapeutic strategies for human diseases. Signal Transduct Target Ther. 2024;9(1):44. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Prediction of human pathogenic start loss variants based on self-supervised contrastive learning

Affiliations

Prediction of human pathogenic start loss variants based on self-supervised contrastive learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources