Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 8;23(1):250.
doi: 10.1186/s12915-025-02348-y.

Prediction of human pathogenic start loss variants based on self-supervised contrastive learning

Affiliations

Prediction of human pathogenic start loss variants based on self-supervised contrastive learning

Jie Liu et al. BMC Biol. .

Abstract

Background: Start loss variants are a class of genetic variants that affect the bases of the start codon, disrupting the normal translation initiation process and leading to protein deletions or the production of different proteins. Accurate assessment of the pathogenicity of these variants is crucial for deciphering disease mechanisms and integrating genomics into clinical practice. However, among the tens of thousands of start loss variants in the human genome, only about 1% have been classified as pathogenic or benign. Computational methods that rely solely on small amounts of labeled data often lack sufficient generalization capabilities, restricting their effectiveness in predicting the impact of start loss variants.

Results: Here, we introduce StartCLR, a novel prediction method specifically designed for identifying pathogenic start loss variants. StartCLR captures variant context information from different dimensions by integrating embedding features from diverse DNA language models. Moreover, it employs self-supervised pre-training combined with supervised fine-tuning, enabling the effective utilization of both a large amount of unlabeled data and a small amount of labeled data to enhance prediction accuracy. Our experimental results show that StartCLR exhibits strong generalization and superior prediction performance across different test sets. Notably, when trained exclusively on high-confidence labeled data, StartCLR retains or even improves the prediction accuracy despite the reduced amount of labeled data.

Conclusions: Collectively, these findings highlight the potential of integrating self-supervised contrastive learning with unlabeled data to mitigate the challenge posed by the scarcity of labeled start loss variants.

Keywords: Fine-tune; Pathogenicity prediction; Self-supervised contrastive learning; Start loss variant.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Framework of StartCLR for assessing the pathogenicity of start loss variants. A Pre-training phase: the encoder is pre-trained using unlabeled data through self-supervised contrastive learning. B Fine-tuning phase: the pre-trained encoder is fine-tuned, and the classifier is trained using labeled data to predict pathogenic start loss variants
Fig. 2
Fig. 2
Prediction results of different model architectures on independent test sets. A AUC values of different model architectures on independent test sets. B AUPR values of different model architectures on independent test sets. Random referred to randomly initializing the encoder and training the classifier solely on the labeled fine-tuning dataset. Zero-shot CL involved using an encoder pre-trained on unlabeled data while keeping it frozen and training only the classifier on the fine-tuning dataset. CL denoted loading the pre-trained encoder and then further fine-tuning it together with the classifier using the fine-tuning dataset. Supervised referred to training both the encoder and classifier from scratch using only the fine-tuning dataset
Fig. 3
Fig. 3
Performance of different data augmentation methods on independent test sets. A AUC values of different data augmentation methods on independent test sets. B AUPR values of different data augmentation methods on independent test sets. Token cutoff refers to setting the entire embedding vector of a single word to zero. Feature cutoff involved setting a specific embedding feature dimension of all words to zero. Dropout randomly set certain feature values within the entire embedding matrix to zero
Fig. 4
Fig. 4
Performance evaluation of different features on independent test sets. A AUC values of different features on test subset 1. B AUPR values of different features on independent test set 1. C AUC values of different features on test subset 2. D AUPR values of different features on independent test set 2
Fig. 5
Fig. 5
Statistics of missing predictions across different prediction methods on independent test sets. A Statistics of missing predictions for different methods on the independent test set 1. B Statistics of missing predictions for different methods on the independent test set 2
Fig. 6
Fig. 6
Performance comparison of different variant pathogenicity prediction methods on independent test sets. A AUC values for pairwise comparisons of all methods on independent test set 1. B AUPR values for pairwise comparisons of all methods on independent test set 1. C AUC values for pairwise comparisons of all methods on independent test set 2. D AUPR values for pairwise comparisons of all methods on independent test set 2
Fig. 7
Fig. 7
Prediction performance of different variant pathogenicity prediction methods on test subsets where all tools provide results, with no missing values. A ROC curves of different methods on test subset 1. B PR curves of different methods on test subset 1. C ROC curves of different methods on test subset 2. D PR curves of different methods on test subset 2

Similar articles

References

    1. Abou-Tayoun AN, Pesaran T, DiStefano MT, Oza A, Rehm HL, Biesecker LG, Harrison SM, et al. Recommendations for interpreting the loss of function PVS1 ACMG/AMP variant criterion. Hum Mutat. 2018;39(11):1517–24. - PMC - PubMed
    1. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O’Connell J, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. - PMC - PubMed
    1. Marian AJ. Clinical interpretation and management of genetic variants. JACC Basic Transl Sci. 2020;5(10):1029–42. - PMC - PubMed
    1. Sriram A, Bohlen J, Teleman AA. Translation acrobatics: how cancer cells exploit alternate modes of translational initiation. EMBO Rep. 2018;19(10): e45947. - PMC - PubMed
    1. Jia XC, He XY, Huang CT, Li J, Dong ZG, Liu KD. Protein translation: biological processes and therapeutic strategies for human diseases. Signal Transduct Target Ther. 2024;9(1):44. - PMC - PubMed

Substances

LinkOut - more resources