. 2023 Aug 7;24(1):182.

doi: 10.1186/s13059-023-03024-6.

Cross-protein transfer learning substantially improves disease variant prediction

Milind Jagota^#¹, Chengzhong Ye^#², Carlos Albors¹, Ruchir Rastogi¹, Antoine Koehl², Nilah Ioannidis^{1

3

4}, Yun S Song^{5

6

7}

Affiliations

¹ Computer Science Division, University of California, Berkeley, 94720, CA, USA.
² Department of Statistics, University of California, Berkeley, 94720, CA, USA.
³ Chan Zuckerberg Biohub, San Francisco, 94158, CA, USA.
⁴ Center for Computational Biology, University of California, Berkeley, 94720, CA, USA.
⁵ Computer Science Division, University of California, Berkeley, 94720, CA, USA. yss@berkeley.edu.
⁶ Department of Statistics, University of California, Berkeley, 94720, CA, USA. yss@berkeley.edu.
⁷ Center for Computational Biology, University of California, Berkeley, 94720, CA, USA. yss@berkeley.edu.

^# Contributed equally.

PMID: 37550700
PMCID: PMC10408151
DOI: 10.1186/s13059-023-03024-6

Cross-protein transfer learning substantially improves disease variant prediction

Milind Jagota et al. Genome Biol. 2023.

. 2023 Aug 7;24(1):182.

doi: 10.1186/s13059-023-03024-6.

Authors

Milind Jagota^#¹, Chengzhong Ye^#², Carlos Albors¹, Ruchir Rastogi¹, Antoine Koehl², Nilah Ioannidis^{1

3

4}, Yun S Song^{5

6

7}

Affiliations

¹ Computer Science Division, University of California, Berkeley, 94720, CA, USA.
² Department of Statistics, University of California, Berkeley, 94720, CA, USA.
³ Chan Zuckerberg Biohub, San Francisco, 94158, CA, USA.
⁴ Center for Computational Biology, University of California, Berkeley, 94720, CA, USA.
⁵ Computer Science Division, University of California, Berkeley, 94720, CA, USA. yss@berkeley.edu.
⁶ Department of Statistics, University of California, Berkeley, 94720, CA, USA. yss@berkeley.edu.
⁷ Center for Computational Biology, University of California, Berkeley, 94720, CA, USA. yss@berkeley.edu.

^# Contributed equally.

PMID: 37550700
PMCID: PMC10408151
DOI: 10.1186/s13059-023-03024-6

Abstract

Background: Genetic variation in the human genome is a major determinant of individual disease risk, but the vast majority of missense variants have unknown etiological effects. Here, we present a robust learning framework for leveraging saturation mutagenesis experiments to construct accurate computational predictors of proteome-wide missense variant pathogenicity.

Results: We train cross-protein transfer (CPT) models using deep mutational scanning (DMS) data from only five proteins and achieve state-of-the-art performance on clinical variant interpretation for unseen proteins across the human proteome. We also improve predictive accuracy on DMS data from held-out proteins. High sensitivity is crucial for clinical applications and our model CPT-1 particularly excels in this regime. For instance, at 95% sensitivity of detecting human disease variants annotated in ClinVar, CPT-1 improves specificity to 68%, from 27% for ESM-1v and 55% for EVE. Furthermore, for genes not used to train REVEL, a supervised method widely used by clinicians, we show that CPT-1 compares favorably with REVEL. Our framework combines predictive features derived from general protein sequence models, vertebrate sequence alignments, and AlphaFold structures, and it is adaptable to the future inclusion of other sources of information. We find that vertebrate alignments, albeit rather shallow with only 100 genomes, provide a strong signal for variant pathogenicity prediction that is complementary to recent deep learning-based models trained on massive amounts of protein sequence data. We release predictions for all possible missense variants in 90% of human genes.

Conclusions: Our results demonstrate the utility of mutational scanning data for learning properties of variants that transfer to unseen proteins.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Method overview. We develop computational missense variant effect predictors by training on functional assay data from very few proteins and achieve substantially improved performance over the state-of-the-art. We combine general protein sequence variation (EVE, ESM-1v), sequence variation at local evolutionary timescales (vertebrate alignments), protein structure (AlphaFold2, ProteinMPNN), and amino acid representations. We assess our models on unseen proteins across the human proteome and release predictions for all missense variants in 90% of human genes

**Fig. 2**
CPT-1 achieves state-of-the-art performance on clinical variant and functional assay prediction. A Receiver-operating characteristic (ROC) curves for ESM-1v, EVE, and our transfer model CPT-1 on annotated missense variants in ClinVar. CPT-1 improves the true positive rate at all false positive rates over both baselines and has a significantly higher AUROC. B Specificity in the clinically relevant high-sensitivity regime on ClinVar missense variants. When all models are constrained to recall nearly all pathogenic variants, CPT-1 improves on EVE and ESM-1v by large margins. C Per-gene AUROC on ClinVar missense variants in 886 genes with at least four benign and four pathogenic variants. Interquartile range and median are shown in black; the mean is shown in white. CPT-1 improves or equals the per-gene AUROC on 72% of genes for EVE and 79% of genes for ESM-1v. D CPT-1 outperforms REVEL on proteins that were not trained on by REVEL, demonstrating the value of developing predictors with cross-protein transfer in mind. E We trained regression versions of CPT-1 to predict functional assays (Methods). We show Spearman’s $ρ$ on DMS datasets of human proteins from ProteinGym (full details in Additional file 1: Table S3). The left plot compares CPT-1 to EVE, and the right compares CPT-1 to ESM-1v. In each plot, points above the diagonal line indicate a gene where CPT-1 outperforms the baseline. With the test protein held out in all cases, CPT-1 outperforms EVE on 16 out of 18 proteins and outperforms ESM-1v on 15 out of 18

**Fig. 3**
Training on DMS is important for CPT-1 performance. A We compared CPT-1 performance to several baselines that do not fully use the DMS data. These baselines were as follows: averaging EVE and ESM-1v, averaging random features (set to the correct sign), and averaging features selected by feature selection. CPT-1 outperforms these baselines, especially in the high-sensitivity regime. This demonstrates the value of a full training procedure on DMS data. B We examined the dependence of CPT-1 performance on the number of training genes used. Each dot indicates a specific choice of training genes, with the mean shown as a black horizontal bar. More training genes always increases average performance, but there is significant variance and performance increases appear to be saturating. We also examined the use of additional, more heterogeneous datasets from ProteinGym, finding that this did not increase performance (Additional file 1: Fig. S2)

**Fig. 4**
Vertebrate alignments are key to improved performance and a powerful baseline. A Specificity in the clinically relevant high-sensitivity regime on ClinVar missense variants. Removing vertebrate alignments from CPT-1 significantly decreases the margin of improvement over baseline. Conservation among 100 vertebrates is a powerful single feature baseline and is competitive with much more complex models in the high-sensitivity regime. Vertebrate alignments are much less powerful in the high specificity regime (Additional file 1: Table S2). B If a missense variant from ClinVar appears in a vertebrate alignment, it is highly likely to be benign. Of the variants that do not occur in any of our studied vertebrates, 39% are benign. Of the variants that occur in a vertebrate, 91% are benign. Of the variants that occur in a mammal (subset of vertebrates), 97% are benign. This signal is not fully leveraged by EVE and ESM-1v due to the sequence redundancy filtering that is employed by both methods and is key to the improved performance of CPT-1

**Fig. 5**
Insights from AlphaFold structures. A Specificity of CPT-1 in the clinically relevant high-sensitivity regime on ClinVar missense variants. Structural features slightly improve CPT-1 performance even though ProteinMPNN alone has poor performance. B Pathogenic ClinVar variants are more likely to have many contacts in the AlphaFold2 structure for the protein compared to benign variants

**Fig. 6**
Cross-gene imputation. EVE scores are not available for the vast majority of human proteins. To scale our method to the whole human proteome, we imputed EVE scores and other features that depend on a large MSA in genes where they are not available. We assessed the quality of our imputation on genes where EVE scores are available, to measure how well we do compared to using the true values. A, B CPT-1 with imputed EVE still outperforms ESM-1v and the true EVE scores. C, D Imputed EVE scores improve performance of CPT-1 compared to removing them entirely, but there is still a gap to using the true EVE scores

See this image and copyright information in PMC

References

1. Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46(D1):D1062–D1067. - PMC - PubMed
1. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434–443. - PMC - PubMed
1. Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, Thomas NS, et al. Human gene mutation database (HGMD®): 2003 update. Human Mutation. 2003;21(6):577–581. - PubMed
1. Van Hout CV, Tachmazidou I, Backman JD, Hoffman JD, Liu D, Pandey AK, et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature. 2020;586(7831):749–756. - PMC - PubMed
1. Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nat Methods. 2014;11(8):801–807. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Cross-protein transfer learning substantially improves disease variant prediction

Affiliations

Cross-protein transfer learning substantially improves disease variant prediction

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources