Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 28;15(1):9284.
doi: 10.1038/s41467-024-53520-x.

Chemoproteogenomic stratification of the missense variant cysteinome

Affiliations

Chemoproteogenomic stratification of the missense variant cysteinome

Heta Desai et al. Nat Commun. .

Abstract

Cancer genomes are rife with genetic variants; one key outcome of this variation is widespread gain-of-cysteine mutations. These acquired cysteines can be both driver mutations and sites targeted by precision therapies. However, despite their ubiquity, nearly all acquired cysteines remain unidentified via chemoproteomics; identification is a critical step to enable functional analysis, including assessment of potential druggability and susceptibility to oxidation. Here, we pair cysteine chemoproteomics-a technique that enables proteome-wide pinpointing of functional, redox sensitive, and potentially druggable residues-with genomics to reveal the hidden landscape of cysteine genetic variation. Our chemoproteogenomics platform integrates chemoproteomic, whole exome, and RNA-seq data, with a customized two-stage false discovery rate (FDR) error controlled proteomic search, which is further enhanced with a user-friendly FragPipe interface. Chemoproteogenomics analysis reveals that cysteine acquisition is a ubiquitous feature of both healthy and cancer genomes that is further elevated in the context of decreased DNA repair. Reference cysteines proximal to missense variants are also found to be pervasive, supporting heretofore untapped opportunities for variant-specific chemical probe development campaigns. As chemoproteogenomics is further distinguished by sample-matched combinatorial variant databases and is compatible with redox proteomics and small molecule screening, we expect widespread utility in guiding proteoform-specific biology and therapeutic discovery.

PubMed Disclaimer

Conflict of interest statement

K.M.B. is a member of the advisory board at Matchpoint Therapeutics. A.I.N. and F.Y. receive royalties from the University of Michigan for the sale of MSFragger and IonQuant software licenses to commercial entities. All license transactions are managed by the University of Michigan Innovation Partnerships office, and all proceeds are subject to university technology transfer policy. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Establishing an MSFragger-search pipeline for variant peptide identification.
A Two-stage FDR MSFragger-enabled variant searches–variant databases are generated from non-redundant reference protein sequences that are in-silico mutated to incorporate sequencing-derived missense variants followed by two-stage FDR MSFragger/PeptideProphet search to identify confident variant-containing peptides. First, raw spectra are searched against a normal reference protein database, confidently matched spectra (passing 1% FDR) are removed, and the remainder of spectra are searched with a variant tryptic database. B Chemoproteomics workflow to validate heavy and light biotin. HEK293T cell lysates were labeled with pan-reactive iodoacetamide alkyne (IAA) followed by ‘click’ conjugation onto heavy or light biotin azide enrichment handles in known ratios. Following neutravidin enrichment, samples are digested and subjected to MS/MS analysis. C Heavy to light ratios (H:L) from triplicate datasets (n = 3) comparing identifications from reference and variant searches; mean ratio value indicated, dashed lines indicate ground-truth log2 ratio, statistical significance was calculated using a two-sided Mann-Whitney U test, **p <0.01, ns p >0.05 (1:1, p = 0.002; 10:10, p = 0.083; 1:4, p = 0.84, 4:1, p = 0.093; 1:10, p = 0.056; 10:1. p = 0.061). D Retention time difference for heavy and light identified peptides for reference and variant searches; mean value indicated, statistical significance was calculated using a two-sided Mann-Whitney U test, ns p >0.05 05 (1:1, p = 0.47; 10:10, p = 0.42; 1:4, p = 0.45, 4:1, p = 0.57; 1:10, p = 0.13; 10:1. p = 0.34)… Box plot center line, median; limits are upper and lower quartiles; 1.5x interquartile range. Proteomic data is found in Supplementary Data 1 and source data in the Source Data file.
Fig. 2
Fig. 2. Acquired cysteines are prevalent across cancer genomes, particularly for high missense burden cell lines.
A The full scope of acquired cysteines in the COSMIC Cell Lines Project (COSMIC-CLP, cancer.sanger.ac.uk/cell_lines) (v96), were analyzed. B 1020 cell lines stratified by the number of gained cysteines and total missense mutations; color indicates cancer type for the top 15 highest missense count cell lines. C Net missense mutations (gained-lost) from COSMIC-CLP (v96). D Top 15 cell lines with highest missense burden from panel (B); linear regression and 95% confidence interval shaded in gray. E Overlap of genes with acquired cysteines in top 15 subsets from panel (B) with Census genes and targets of FDA-approved drugs. F Panel of cell lines used in this study with MMR status (dMMR = deficient mismatch repair, pMMR=proficient mismatch repair). Data is found in Supplementary Data 2 and source data is in the Source Data file.
Fig. 3
Fig. 3. Incorporating variants into sample-specific search databases.
A Sequencing portion of the ‘chemoproteogenomic’ workflow to identify chemoproteomic detected variants–extracted genomic DNA or RNA from cell lines undergo sequencing followed by variant calling using Platypus (v0.8.1) and GATK-Haplotype Caller (v4.1.8.1) for RNA and exomes respectively and predicted missense changes were computed. B Total numbers of missense mutations identified from either RNA-seq or WE-seq; stripe vs solid denotes common and rare variants. C Net amino acid changes for all cell lines combined. D Totals of gained and lost cysteine in each cell line separated by rare and common variants, dashed line indicates dMMR cell lines. E Net missense mutations (gained-lost) from dbSNP (4-23-18). F Non-synonymous changes are incorporated into reference protein sequences, and combinations of variants are generated for proteins with less than 25 variant sites to make customized FASTA databases. Details in methods. Supplementary Data 3 and Supplementary Data 4 and source data in the Source Data file.
Fig. 4
Fig. 4. Variant peptide identification on tumor cell lines.
A Cell lysates were labeled with pan-reactive iodoacetamide alkyne (IAA) followed by ‘click’ conjugation onto biotin azide enrichment handle. Samples were prepared and acquired using our established SP3-FAIMS chemoproteomic platform,, single pot solid phase sample preparation (SP3) sample cleanup, neutravidin enrichment, sequence-specific proteolysis, and LC-MS/MS analysis with field asymmetric ion mobility (FAIMS) device. Experimental spectra are searched using a custom FASTA for variant identification. The sample-set includes a reanalysis of previously reported datasets from Yan et al.. (Molt-4, Jurkat, Hec-1B, HCT-15, H661, and H2122 cell line) with newly acquired datasets (H1437, H358, Caco-2, Mia-PaCa-2, and MeWo cell lines). B Total numbers of unique missense variants identified from either RNA-seq or WE-seq or both after using two-stage MSFragger search and philosopher validation from duplicate (n = 2) datasets; stripe vs solid denotes common and rare variants, black triangles represent replicate total counts, indicated is sequencing source and type of variant. C Overlap of identified cysteines from variant searches with cysteines in the CysDB database. D Net amino acid changes for all cell lines combined. E Example of cysteines identified from loss-of-arginine/lysine peptides. Data is found in Supplementary Data 4, and source data is in the Source Data file.
Fig. 5
Fig. 5. Chemoproteogenomics identifies predicted deleterious sites.
A Scheme of CADD score analysis for two dMMR and non-dMMR cell lines. B Distribution of CADD scores for indicated variant grouping; statistical significance was calculated using a two-sided Mann-Whitney U test, ****p <0.0001 (Common, p = 3.1e-16; Rare, p = 5e-46). C Empirical cumulative distributions (ECDF) were computed for CADD scores with indicated grouping; statistical significance was calculated using a two-sample Kolmogorov-Smirnov test, ****p <0.0001 (Common, p = 1.7e−12; Rare, p = 6.4e-34). D CADD score distributions for gain-of-cysteine separated by grouping; statistical significance between gained cysteine values was calculated using a two-sample Kolmogorov-Smirnov test, ****p <0.0001 (Common, p = 3.1e-6; Rare, p = 6.4e-10). Data is found in Supplementary Data 3, and source data is in the Source Data file.
Fig. 6
Fig. 6. Chemoproteogenomics identifies SAAVs proximal to likely functional sites.
A Scheme for chemoproteomics data search to identify variants from duplicates (n = 2). B Crystal structure of HMGB1 indicating detected Cys110 and nearby Cys106 (yellow) (PDB: 6CIL). C Proportion of variants belonging to the indicated sites; AS/BS = in or near active site/binding site in genomics data as annotated by UniProtKB or Phosphosite; statistical significance calculated using the two-sample test of proportions, *** p <0.001, ****p <0.0001, ns p >0.05.D) Chemoproteogenomic-identified variants identified in or near active and binding sites with CADD score, common/rare, cell line dMMR/pMMR annotations. E Amino acid changes at protein methylation sites as identified by Phosphosite from genomics data. F Re-analysis of SP3-Rox oxidation state data in Jurkat cells (n = 6) acquired cysteines and 54 variants proximal to acquired cysteines. G Example of cysteines identified from loss-of-arginine/lysine peptides. H Schematic of highly variable HLA binding pocket containing cysteine with bound peptide. I Coverage of HLA cysteines from this study and in CysDB; color indicates HLA type or multi-mapped cysteines. J Crystal structure of HLA-B 14:02 (PDB: 3BXN) with highlighted Cys67 and Arg P2 position of bound peptide; alignments of Cys91 regions of three HLA-B alleles. K Workflow to visualize HLA cysteine labeling; first cells were harvested and treated with IAA followed by lysis, FLAG immunoprecipitation, and click onto rhodamine-azide. L Cys-dependent cell surface labeling of HLA-B alleles with IAA, the band indicated with a red arrow and non-specific band represented with an asterisk (representative of 2 two biological replicates). Data is found in Supplementary Data 3 and Supplementary Data 4, and source data is in the Source Data file.
Fig. 7
Fig. 7. Comparison of variants identified from cysteine enrichment and bulk proteomics.
A Workflow for high-pH fractionation of lysates. Cell lysates are treated with DTT and iodoacetamide followed by digestion, high-pH fractionation, and LC-MS/MS analysis. Triplicate high-pH sets (n = 3) for HCT-15 and Molt-4 cells were used. B Total numbers of unique missense variants identified from either RNA-seq or WE-seq or both after using a two-stage MSFragger search of high-pH datasets, black triangles represent replicate total counts. C Overlap of cysteine-containing peptide variants identified from bulk fractionation and cysteine enrichment datasets. D Fold enrichment of amino acids as a ratio of the net amino acid frequency (gain minus loss) to the amino acid frequency in all missense-containing proteins detected in high-pH and cys-enriched datasets. E High-pH detected variants stratified by CADD score and ClinVar clinical significance. F Peptide lengths of reference and variant peptides identified in dataset types, statistical significance using two-sample Kolmogorov-Smirnov tests, ****p <0.0001. G DE-seq normalized transcript counts for all RNA variants. ‘All’, variants detected from cys-enrichment ‘C’, and variants detected from high-pH fractionation ‘H’ in HCT-15 cells; bar indicates the mean value (All vs C, p = 7e-17; C vs H, p = 0.17; All vs H, p <2e-16). H Label-free quantitation (LFQ) intensities for proteins matched to all RNA variants ‘All’, variants detected from cys-enrichment ‘C’, and variants detected from high-pH fractionation ‘H’ in HCT-15 cells; bar indicates the mean value (All vs C, p <2e-16; C vs H, p = 0.19; All vs H, p <2e-16). I Variant allele frequencies (VAF) (total reads/total coverage per site) for RNA-seq variants called in HCT-15 and Molt-4 cells (All vs C,p = 0.74; C vs H, p = 0.053; All vs H, p = 9e-5). GI bar indicates the median, statistical significance was calculated using two-sample Kolmogorov-Smirnov tests, ****p <0.0001, ns p >0.05. Data is found in Supplementary Data 5, and source data is in the Source Data file.
Fig. 8
Fig. 8. Assessing ligandability of variant proximal cysteines and gain-of-cysteines.
A Schematic of activity-based screening of cysteine reactive compounds; cell lysates are labeled with compound or DMSO followed by chase with IAA and ‘click’ conjugation to heavy or light biotin click conjugation to our isotopically differentiated heavy and light biotin-azide reagents, tryptic digest, LC-MS/MS acquisition, and MSFragger analysis. B Chloroacetamide compound library. C Total quantified variants and total ligandable variants (log2 Ratio >2) identified stratified by cell line (KB02 data) or compound (HCT-15 cell line). D Correlation of high-confidence variant containing and reference cysteine ratio values from KB02 data. E Correlation of high-confidence variant containing and reference cysteine ratio values from SO compound data. F Log2 heavy to light ratio values for variant containing and reference cysteine peptides. G Subset of gain-of-cysteine peptide variant log2 ratios. Data is found in Supplementary Data 6, and source data is in the Source Data file.
Fig. 9
Fig. 9. Functional studies of CAND1 and HMGB1.
A WT and G1069C mutant CAND1 proteins bind Cul1 while the G1069C CAND1 mutation perturbs binding. HEK293T cells were co-transfected with FLAG-Cul1 and the given HA-tagged CAND1 protein (WT, G1069C, or G1069W) or control FLAG-GFP. Anti-FLAG resin was used to pull down FLAG-Cul1 from cell lysates along with any complexed proteins. Western Blots were incubated with the indicated primary antibodies, *indicates a non-specific HA band. B HMGB1 proteins were tested for the ability to induce TLR4-mediated immune response using HEK-Blue reporter cell lines (hTLR4 and Null control) and corresponding PRR assay. Results show mean response ratios (error bars = SD, n = 4 per condition) of hTLR4 and Null cells to increasing concentrations (μg/mL) of WT, R110C, and R110W proteins as indicated over 2 independent experiments. AT = commercially available all-thiol fully reduced HMGB1; diS = commercially available disulfide HMGB1; working concentration of 0.2 μg/mL for both. Reference lines on the graph indicate ECmax (solid line), EC50 (dashed line), and H2O control (dotted line) response ratios to canonical positive control ligand (LPS) specific to the hTLR4 cell line. Significance determined via unpaired two-tailed student’s t test; ** =  p <0.01, *** = p <0.001, **** = p <0.0001. TLR4 200 ng/mL (WT vs R110C, p = 0.009; WT vs R110W, p <0.0001); TLR4 600 ng/mL (WT vs R110C, p = 0.009; WT vs R110W, p <0.0001). C Response ratio curve of hTLR4 and Null cells to positive control ligand (LPS). EC values are generated using nonlinear regression (Asymmetric (five parameters), X is concentration). For (A), western blot data are representative of three independent measurements. Data is found in Supplementary Data 6, and source data is in the Source Data file.

Update of

References

    1. Auton, A. et al. A global reference for human genetic variation. Nature526, 68–74 (2015). - DOI - PMC - PubMed
    1. Rivero-Hinojosa, S. et al. Proteogenomic discovery of neoantigens facilitates personalized multi-antigen targeted T cell immunotherapy for brain tumors. Nat. Commun.12, 6689 (2021). - DOI - PMC - PubMed
    1. Sheynkman, G. M., Shortreed, M. R., Frey, B. L., Scalf, M. & Smith, L. M. Large-scale mass spectrometric detection of variant peptides resulting from nonsynonymous nucleotide differences. J. Proteome Res.13, 228–240 (2014). - DOI - PMC - PubMed
    1. Lau, E. et al. Splice-Junction-Based Mapping of Alternative Isoforms in the Human Proteome. Cell Rep.29, 3751–3765 (2019). - DOI - PMC - PubMed
    1. Chen, Y. J. et al. Proteogenomics of non-smoking Lung cancer in East Asia delineates molecular signatures of pathogenesis and progression. Cell182, 226–244 (2020). - DOI - PubMed

Publication types

Associated data