Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep;633(8030):662-669.
doi: 10.1038/s41586-024-07935-7. Epub 2024 Sep 11.

Multi-pass, single-molecule nanopore reading of long protein strands

Affiliations

Multi-pass, single-molecule nanopore reading of long protein strands

Keisuke Motone et al. Nature. 2024 Sep.

Abstract

The ability to sequence single protein molecules in their native, full-length form would enable a more comprehensive understanding of proteomic diversity. Current technologies, however, are limited in achieving this goal1,2. Here, we establish a method for the long-range, single-molecule reading of intact protein strands on a commercial nanopore sensor array. By using the ClpX unfoldase to ratchet proteins through a CsgG nanopore3,4, we provide single-molecule evidence that ClpX translocates substrates in two-residue steps. This mechanism achieves sensitivity to single amino acids on synthetic protein strands hundreds of amino acids in length, enabling the sequencing of combinations of single-amino-acid substitutions and the mapping of post-translational modifications, such as phosphorylation. To enhance classification accuracy further, we demonstrate the ability to reread individual protein molecules multiple times, and we explore the potential for highly accurate protein barcode sequencing. Furthermore, we develop a biophysical model that can simulate raw nanopore signals a priori on the basis of residue volume and charge, enhancing the interpretation of raw signal data. Finally, we apply these methods to examine full-length, folded protein domains for complete end-to-end analysis. These results provide proof of concept for a platform that has the potential to identify and characterize full-length proteoforms at single-molecule resolution.

PubMed Disclaimer

Conflict of interest statement

The University of Washington has filed provisional patent applications covering protein rereading (K.M. and J.N.) and sequence-to-signal simulation methods (D.K.-H., M.Q., and J.N.). J.N. is a consultant to Oxford Nanopore Technologies and holds share options in the company. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Nanopore protein reading using an unfoldase.
a, Schematic of the cis-based unfoldase approach on the MinION platform. The roman numerals correspond to the ionic current states in b. b, Example trace of protein P1. Deep spikes in the capture state are hypothesized to be transient structural fluctuations of the Smt3 domain in the pore. State iii can be discerned from a transient drop in current when the ClpX solution is initially loaded into the flow cell. c, Ensemble traces for protein P1 (blue, n = 34) and mutants P2 (purple, n = 17), P3 (orange, n = 21) and P4 (red, n = 12). Protein sequences are oriented from C to N, with all mutation regions shown in colour.
Fig. 2
Fig. 2. Characterizing single amino acid substitutions and ClpX stepping in PASTORs.
a, PASTOR sequence composition. b, Filtered nanopore current trace of PASTOR–HDKER. Colour boundaries are defined by automated YY segmentation. c, Top, an example PASTOR trace. The red boxes show the manually segmented YY dips. Bottom, the black horizontal lines denote the mean of individual steps. d, Distribution of the mean number of residues per step in each of the YY dips; n = 776 YY dips. e, Step dwell-time distribution. f, Average signal trace for the transformed VRs of each amino acid after Euclidian alignment of all the VRs equidistantly stretched to the same length. The VRs of a charged amino acid are shown as a dashed line (n of VRs and experiments are shown in Extended Data Table 1).
Fig. 3
Fig. 3. Single-molecule nanopore sequencing of single amino acid mutations.
a, Pipeline for PASTOR VR classification with machine learning (ML) models. b, Heatmap showing test accuracies in discriminating between all pairs of amino acid VR mutations, averaged over five random forests (n for VRs and experiments is shown in Extended Data Table 1). c, Example sequencing traces in the test set, for PASTOR-HDKER. Transformed ionic current traces are plotted with a box around the variable regions defined by the YY segmenter. The colour intensity of the boxes represents the ranking of the true class in the aminocaller’s prediction for each VR. For the 5-way classification task (top box shading), the classes are the five mutations found in that protein, whereas the 20-way classification task (bottom box shading) considers all possible amino acid classes. In each box, the letter corresponds to the model’s top prediction. Darker shades denote a more-accurate prediction, indicating that the correct label ranked high in the model’s predictions.
Fig. 4
Fig. 4. Rereading single protein molecules multiple times with an unfoldase slip sequence.
a, Working model of rereading. b, Top box shows example trace of PASTOR-reread showing three almost-complete reread events (blue trace). Our model’s predicted signal for the PASTOR-reread sequence (pink trace) was aligned to each reread. The fourth VR contains an asparagine mutation, but the corresponding signal level consistently resembles aspartate in all three instances of this PASTOR-reread trace. The modelled sequence was changed to contain an aspartate to reflect the putative PTM. Bottom box shows the approximate region of the strand that is in the nanopore over time.
Fig. 5
Fig. 5. Single-molecule mapping of kinase phosphorylation activity.
a, Traces of PASTOR-phos, in which each section (C- and N-terminal linkers, VR V, VR A and VR GLSARRL) is aligned to the lowest DTW-distance phosphorylation state model (Supplementary Fig. 12). Phosphorylation is indicated by the letter ‘P’ in a red circle. YY dips (denoted by pale grey boxes) are aligned to the model of a YY dip; n for traces and experiments is provided in Extended Data Table 3. b, Ensemble traces of VR GLSARRL for each condition. c, Relative frequency for each condition of molecules best matching each proteoform (Supplementary Table 3). CKII conditions are stacked. The phosphorylation count of proteoforms is shown above the bars (proteoform ID1 contains no phosphorylations).
Fig. 6
Fig. 6. Processive reading of folded protein domains.
a, Working model of ClpX-mediated processing of folded proteins. The roman numerals correspond to the ionic current states in b. b, Example trace of PASTOR-titin. c, Example traces of titin translocation (state vii), with black horizontal lines denoting the mean of individual putative ClpX steps, found with the Bayesian segmentation algorithm (Methods).
Extended Data Fig. 1
Extended Data Fig. 1. ClpX-mediated translocation.
a, Fraction of ClpX-mediated translocation events observed following capture events in the presence of no ATP (n = 230), 0.5 mM ATP + 0.5 mM ATPγS (n = 180), 4 mM ATP + 0.5 mM ATPγS (n = 27), or, 4 mM ATP (n = 16). b, ClpX-mediated translocation time in the presence of 0.5 mM ATP + 0.5 mM ATPγS (n = 7), 4 mM ATP + 0.5 mM ATPγS (n = 9), or 4 mM ATP (n = 8). Error bars denote standard deviations.
Extended Data Fig. 2
Extended Data Fig. 2. Consistency of YY dips and VRs in PASTORs enables scaling of ionic current traces.
a, Mean transformed current levels of YY and VR PASTOR segments. Error bars denote standard deviation. n = 1828 for YY dips and 1525 for VRs. There was a total of 305 PASTOR traces analyzed. b, Depiction of the process of scaling signals to the “transformed” current described in Methods.
Extended Data Fig. 3
Extended Data Fig. 3. ClpX stepping behavior.
a, Distribution of the proportion of time (out of the total duration of the signal) spent within the manually segmented YY-dip regions, for n = 305 traces. Mean is 0.318 and median is 0.319. This portion was used to estimate ClpX’s step size (Methods). b, Number of steps for each of the YY dips without back steps using Bayesian-based YY-segmentation, n = 776. c-e, Stepping behavior statistics when calculated with t-test segmentation method, n = 456. Note that this n is different from b, because with the different segmentation algorithm, different putative backsteps were found and subsequently different dips were filtered (Methods).
Extended Data Fig. 4
Extended Data Fig. 4. Variable regions in PASTOR.
a, Scatter plot of various features of the uncharged VRs, with error bars denoting one standard deviation, center point denoting the mean, and explanation of the features to the right. n of VRs, traces, and experiments shown in panels a-c shown in Extended Data Table 1. b, Bar blot of the variance of the max value of the transformed VRs corresponding to each amino acid. c, t-SNE map showing clustering of the pairwise DTW distance between each amino acid, with all amino acids other than D, E, and C being colored by the volume, the negative amino acids colored black, and C being highlighted in orange. d, Plot of all the VRs corresponding to asparagine in normal conditions (left) and in conditions that catalyze the deamidation of asparagine to aspartate (right). Lines colored teal if the max value of the transformed signal <1.3, and purple otherwise. n = 81 for normal conditions and 77 for deamidation conditions. e, t-SNE plot as in c, showing only asparagine and aspartate VRs. Asparagine VRs are colored blue if the max value of the transformed signal >= 1.3, and green otherwise. Asparagine VRs form a distinct cluster from aspartate and putative deamidated asparagine VRs (pPERMANOVA <1×10−6). Putative deamidated asparagine and aspartate are indistinguishable (pPERMANOVA= 0.8). f, Bar plot displaying mean percent of mutations that have been putatively deamidated or not (same threshold as in d, e) in VRs corresponding to asparagine, across technical replicates with n = 6, 4, 3, and 3 from left to right. Error bars denote standard deviations. g, Distance matrix of the DTW-distances between the aspartate, asparagine, and putative post-translationally modified asparagine to aspartate VRs shown in e. h, Violin plots showing distribution of the maximum height of transformed VRs, in normal and deamidation catalyzing conditions, for asparagine (N, green), and aspartate (D, purple), and the three other amino acid substitutions in PASTOR-VGDNY (valine, glycine, and tyrosine, brown). Horizontal lines denote min, median, and max. n = 88, 77, 81, 77, 68, and 77 from left to right.
Extended Data Fig. 5
Extended Data Fig. 5. A biophysical model for simulating nanopore ionic current traces directly from protein sequence.
ad, Description of model signal generation. a, A protein sequence to be modeled. b, Calculation of the volume and charge, scaled, for all amino acids in the window of size 20. c, Parabolic weighting of the values within a window. d, Plotting the value S for each window, by computing the dot product of the parabolic weight array and the window array, to create the full model signal. e, Comparison between the nanopore signal of an example ionic current signal of PASTOR-TWAFH (black line) and the modeled signal generated for the same protein sequence (pink line). Model signal shown with the time axis aligned to the experimental trace using DTW. f, Distributions of the DTW distances between the real (experimental) signal traces and the model signals of the same sequence (pink), or between the real signal traces and the model signals of 10,000 random sequences derived from the same amino acid distribution as the real sequence (orange). n of experimental traces ranges from 27 to 55.
Extended Data Fig. 6
Extended Data Fig. 6. Classification of single-amino acid mutations with a Random Forest model.
a, Heatmap each pairwise classification accuracy by a Random Forest model evaluated on the fixed test set, as in main Fig. 3b, with values. b, The accuracy of 5-way classification of the H, D, K, E, and R VRs with various training sizes, to compare the quality of data with different buffer conditions. Each condition was trained and tested on 100 different Random Forest models, each trained on a random train-test split. The extra data was allocated to the testing set. The original buffer data is the data used in Figs. 2, 3, with n in Extended Data Table 1. Both conditions consist of 2 independent runs. The models’ performance was consistent across both standard and elevated salt conditions. c, Accuracy in a 20-way classification when “accuracy” is defined as the correct label being in the top-N most probable classes. The dummy classifier chooses one label at random. Results averaged over 20 models.
Extended Data Fig. 7
Extended Data Fig. 7. Rereading with an unfoldase slip sequence and estimating its impact on barcode sequencing accuracy.
a, Estimated back-slipping distance for ClpX concentrations at 1000 nM (n = 141), 200 nM (n = 609), 40 nM (n = 777), and 8 nM (n = 999). The very first full-length read (Read 1) of each analyte protein molecule was excluded from this analysis. b and c, Number of all reads and full-length reads per PASTOR-reread molecule, respectively. The dotted lines indicate medians for ClpX concentrations at 1000 nM (n = 26), 200 nM (n = 37), 40 nM (n = 23), and 8 nM (n = 20). d, Simulated effect of rereading on 2 (Y, D), 4 (A, W, R, D), 7 (G, Q, W, F, R, D, E), 10 (A, G, V, N, Y, W, F, R, D, E), 14 (C, A, G, T, V, N, Q, M, Y, W, F, R, D, E), 17 (C, S, A, G, T, V, N, Q, M, I, Y, W, F, H, R, D, E), and 20-way (all 20 a.a.) classification tasks, compared to a baseline random classifier. Each value is the average over 100 train-test trials. e, Projected sequencing accuracy of barcode designs using the accuracies from d. Points of the same color represent different amounts of bits allocated to error correcting codes (see Methods).
Extended Data Fig. 8
Extended Data Fig. 8. Reading kinase phosphorylation activity on single protein molecules.
a, Maximum transformed signal value for each trace. Transparency of each scatter point is proportional to the n of traces for that condition. For each of C-terminal linker, VR V, VR A, and N-terminal linker conditions, the CKII incubation conditions’ maximum values were significantly higher than the blank and PKA incubation conditions (pMann-Whitney, one-sided < 10−8 for each, after Bonferroni correction). GLSARRL region corresponds to the C-terminal third of the VR. The GLSARRL region’s maximum values are significantly higher in the PKA than the blank incubation condition (pMann-Whitney, one-way = 5 × 10−39, after Bonferroni correction) and the two CKII incubation conditions are significantly different from each of the two other conditions (pMann-Whitney < 10−5 for each, after Bonferroni correction). b, Interquartile range (IQR) of the number of putative phosphorylations per molecule. No kinase incubation shows fewer phosphorylations than 1 hr incubation (pMann-Whitney, one-way = 4 × 10−16), and 1 hr incubation in CKII shows fewer phosphorylations than 26 hr incubation (pMann-Whitney, one-way = 5 × 10−6). Center line, box, whiskers, and diamonds represent median, IQR, 1.5 IQR, and outliners, respectively. c, Interquartile ranges of putative linker phosphorylations per molecule for each of the kinase incubation conditions, corresponding to single or double phosphorylations on a linker. Center line, box, whiskers, and diamonds represent median, interquartile range (IQR), 1.5 IQR, and outliners, respectively. CKII 26 hr incubation molecules have significantly more putative single phosphorylations than the CKII 1 hr incubation condition (pMann-Whitney, one-sided, Bonferroni corrected = 0.002) and the blank and PKA incubation conditions (pMann-Whitney, one-sided, Bonferroni corrected = 3 × 10−28). CKII 26 hr incubation molecules also have significantly more putative double phosphorylations than the CKII 1 hr incubation condition (pMann-Whitney, one-sided, Bonferroni corrected = 0.01) and the blank and PKA incubation conditions (pMann-Whitney, one-sided, Bonferroni corrected = 3 × 10−58). For all panels, n of traces and experiments in Extended Data Table 3. *P < 10−5.
Extended Data Fig. 9
Extended Data Fig. 9. Processive reading of folded protein domains.
a, Example trace of PASTOR-Titin, not zoomed into translocation state (open pore to open pore). Roman numerals correspond to states described in main Fig. 6a. b, Example trace of PASTOR-dTitin. c, Distribution of total unfolding time for Titin (n = 21) and dTitin (n = 14). d, Ensemble traces of state vii of PASTOR-Aβ15 (pink, n = 21), -Aβ42 (purple, n = 15), -Titin (red, n = 20), and -dTitin (orange, n = 12). Protein sequences are shown in the C-to-N direction, and asterisks represent the C47E and C63E mutations between Titin and dTitin. e, t-SNE plot based on pairwise DTW distances for state vii, showing Aβ15 and Aβ42 form a distinct cluster from Titin and dTitin (pPERMANOVA ≤ 1×10−6). Aβ15 vs Aβ42 and Titin vs dTitin states vii are indistinguishable (pPERMANOVA = 0.99, 0.67, respectively). f, Distance matrix of the DTW-distances between the traces of Aβ15, Aβ42, Titin and dTitin, shown in e. g, Example trace of PASTOR-Aβ42. h, Relationship between protein length and translocation time. State vii dwell time is plotted for Aβ15, Aβ42, Titin, and dTitin, as well as translocation time for the 8 PASTORs with no folded domain insert (n = 672). The dotted line was fitted with the mean dwell times of each protein class (slope corresponds to a translocation rate of 16 ms/aa or 63 aa/sec, R2 = 0.998). i, Distributions of the DTW distances of each of the protein translocations for folded domain proteins to model signal(s). In blue, they are compared to the model signal of the protein sequence, and in orange, they are each compared to the model of 10,000 random sequences derived from the same sequence distribution. The protein translocations include the regions corresponding to the folded domain translocation (state vii) and the N-terminal half of the PASTOR YY dips and VRs (state viii). The signals corresponding to the C-terminal half of the PASTOR YY dips and VRs (state v) and the folded domain unfolding (state vi) are excluded from the analysis, because the model does not predict unfolding patterns (main Fig. 6b). n = 20, 12, 21, and 15 for PASTOR-Titin, PASTOR-dTitin, PASTOR-Aβ15, and PASTOR-Aβ42, respectively.
Extended Data Fig. 10
Extended Data Fig. 10. Quantification of ClpX-mediated protein translocations.
a, Yield of nanopore runs on the R9.4.1 flow cell. Quality open pore denotes a pore that was consistently at open pore current at the time of analyte loading. The R9.4.1 flow cell has a maximum of 512 pores available for measurement and the number of quality open pores used for measurement fluctuates depending on the flow cell condition. The initial run method with an analyte concentration of 500 nM (purple) includes PASTOR and PASTOR-phos data. The optimized run method with an analyte concentration of 5 nM (blue) includes PASTOR-HDKER data. The optimized run method with an analyte concentration of 500 nM (green) includes PASTOR-HDKER and PASTOR-phos data. n = 35, 3, and 3, for the original run (500 nM analyte), optimized run (5 nM analyte), and optimized run (500 nM analyte) conditions, respectively. The number of translocations per quality open pore was significantly (pt-test, one-sided = 5 ×10−8) higher for the optimized run method (500 nM analyte) condition than the original run method (500 nM analyte), and the other comparisons were non-significant (pt-test, original vs. 5nm optimized, pt-test, one-sided, 5nM optimized vs 500nM optimized > 0.05). Error bars denote standard deviations. b, SDS-PAGE analysis of purified PASTOR-HDKER protein. The protein band appears at a higher position on the gel than the actual molecular weight of the protein (50.2 kDa) due to its highly net negatively-charged state. c, Bulk ClpXP degradation assay on purified PASTOR-HDKER protein. The substrate protein was incubated with an ATP regeneration mix and ClpP in the presence or absence of ClpX. d, Residual PASTOR substrate was quantified based on the peak area of the PASTOR-HDKER protein band normalized by the ClpP protein band on each lane using ImageJ software. Raw gels shown in Supplementary Figs. 15 and 16.

Update of

References

    1. Chandramouli, K. & Qian, P.-Y. Proteomics: challenges, techniques and possibilities to overcome biological sample complexity. Hum. Genomics Proteomics2009, 239204 (2009). - PMC - PubMed
    1. Dupree, E. J. et al. A critical review of bottom-up proteomics: the good, the bad, and the future of this field. Proteomes8, 14 (2020). - PMC - PubMed
    1. Van der Verren, S. E. et al. A dual-constriction biological nanopore resolves homonucleotide sequences with high fidelity. Nat. Biotechnol.38, 1415–1420 (2020). - PMC - PubMed
    1. Dorey, A. & Howorka, S. Nanopore DNA sequencing technologies and their applications towards single-molecule proteomics. Nat. Chem.16, 314–334 (2024). - PubMed
    1. Smith, L. M., Kelleher, N. L. & the Consortium for Top Down Proteomics. Proteoform: a single term describing protein complexity. Nat. Methods10, 186–187 (2013). - PMC - PubMed

LinkOut - more resources