Protein identification using Cryo-EM and artificial intelligence guides improved sample purification

Kenneth D Carr^{1

2}, Dane Evan D Zambrano^{1

2}, Connor Weidle^{1

2}, Alex Goodson^{1

2}, Helen E Eisenach^{1

2}, Harley Pyles^{1

2}, Alexis Courbet^{1

2}, Neil P King^{1

2}, Andrew J Borst^{1

2}

Affiliations

¹ Department of Biochemistry, University of Washington, Seattle, WA 98195, USA.
² Institute for Protein Design, University of Washington, Seattle, WA 98195, USA.

PMID: 39958810
PMCID: PMC11830286
DOI: 10.1016/j.yjsbx.2025.100120

Protein identification using Cryo-EM and artificial intelligence guides improved sample purification

Kenneth D Carr et al. J Struct Biol X. 2025.

. 2025 Jan 21:11:100120.

doi: 10.1016/j.yjsbx.2025.100120. eCollection 2025 Jun.

Authors

Affiliations

¹ Department of Biochemistry, University of Washington, Seattle, WA 98195, USA.
² Institute for Protein Design, University of Washington, Seattle, WA 98195, USA.

PMID: 39958810
PMCID: PMC11830286
DOI: 10.1016/j.yjsbx.2025.100120

Abstract

Protein purification is essential in protein biochemistry, structural biology, and protein design, enabling the determination of protein structures, the study of biological mechanisms, and the characterization of both natural and de novo designed proteins. However, standard purification strategies often encounter challenges, such as unintended co-purification of contaminants alongside the target protein. This issue is particularly problematic for self-assembling protein nanomaterials, where unexpected geometries may reflect novel assembly states, cross-contamination, or native proteins originating from the expression host. Here, we used an automated structure-to-sequence pipeline to first identify an unknown co-purifying protein found in several purified designed protein samples. By integrating cryo-electron microscopy (Cryo-EM), ModelAngelo's sequence-agnostic model-building, and Protein BLAST, we identified the contaminant as dihydrolipoamide succinyltransferase (DLST). This identification was validated through comparisons with DLST structures in the Protein Data Bank, AlphaFold 3 predictions based on the DLST sequence from our E. coli expression vector, and traditional biochemical methods. The identification informed subsequent modifications to our purification protocol, which successfully excluded DLST from future preparations. To explore the potential broader utility of this approach, we benchmarked four computational methods for DLST identification across varying resolution ranges. This study demonstrates the successful application of a structure-to-sequence protein identification workflow, integrating Cryo-EM, ModelAngelo, Protein BLAST, and AlphaFold 3 predictions, to identify and ultimately help guide the removal of DLST from sample purification efforts. It highlights the potential of combining Cryo-EM with AI-driven tools for accurate protein identification and addressing purification challenges across diverse contexts in protein science.

Keywords: AlphaFold 3; Automated Model Building; Contamination; Cryo-EM; Cryo-Electron Microscopy; DLST; Dihydrolipoamide Succinyltransferase; Dihydrolipoyllysine-residue succinyltransferase; E. coli; Hmmsearch; ModelAngelo; Multiple Sequence Alignment; PDB; Protein BLAST; Protein Data Bank; Protein Purification; Structure Prediction; TCA Cycle; Tricarboxylic Acid Cycle; Western Blot.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Fig. 1**
Characterization of an unknown co-eluting protein via electron microscopy. (A) SDS-PAGE of the IMAC eluate for the two designed protein components prior to mixing for assembly into the target designed octahedral nanoparticle. There is a clearly defined band in lane 1 for the one of the two components of the designed nanoparticle. The second component, shown in lane 2, is present in lower amounts and is less pure. (B) SEC trace of the assembled nanoparticle. The peak corresponding to the nanoparticle is highlighted in stripes to represent the co-elution of the contaminant with the on-target nanoparticle. (C) DLS trace of the SEC-purified sample highlighted in stripes to represent the contaminant and on-target nanoparticle diameters are not separated efficiently by DLS. The buffer (dotted line) was run as a control and contains detergent micelles found at 6.30 nm in diameter. (D) A portion of a ns-EM micrograph of the heterogeneous sample. The pink box represents the designed nanoparticle and the blue box represents the contaminant nanoparticle. (E) 2D ns-EM class averages of the on-target and contaminant nanoparticle species. A total of 1,131 on-target nanoparticles were processed and are represented here by two 2D class averages. 30,927 particles were processed for the contaminant species and are represented here by six 2D class averages. Corresponding particle numbers are reflective of all particles of each species in the dataset, not only those of the displayed classes. (F) 3D ns-EM map along the 2-, 3-, and 4-fold axes of symmetry of the contaminant nanoparticle generated using CryoSPARC v4.4. (G) 2D Cryo-EM class averages and (H) 3D Cryo-EM reconstruction viewed along the 2-, 3-, and 4-fold axes of symmetry of the contaminant nanoparticle using CryoSPARC v4.5. (Blue = contaminant protein; Pink = on-target two component nanoparticle). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

**Fig. 2**
Structure-to-sequence workflow for the unambiguous identification of DLST. (A) An overview of the 10-step sequence-to-structure workflow used to identify DLST and build our atomic model used for structural analysis. (B) The sequence-agnostic ModelAngelo output using our 2.51 Å Cryo-EM map. Chains between 1 and 10 residues (pink), 11 to 100 residues (yellow), and longer than 100 residues (blue) are displayed in sphere view along the octahedral 3-fold axis of symmetry. An accompanying pie chart displays the percentage of all residues belonging to chains of those length ranges. (C-D) Published DLST crystal structure PDB:1SCZ (Schormann et al., 2004) (C) and AlphaFold 3 prediction model of DLST UniProt sequence A0A140NDX4 (Abramson et al., 2024) (D) each docked into the Cryo-EM density map of the unknown co-eluting protein. (E-F) A single subunit of our Cryo-EM model aligned to 1SCZ (RMSD 0.512 Å) (E) and the AlphaFold 3 model (RMSD 0.518 Å) (F). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

**Fig. 3**
Modifications to the protein purification protocol result in an increased sample purity. (A) Representative ns-EM micrograph utilizing the optimized purification protocol. (B) Corresponding 2D ns-EM class averages of the on-target nanoparticle and DLST, with corresponding particle numbers listed for each. (C) Pie charts showing the relative abundance of the on-target nanoparticle and DLST as processed using the original purification protocol and the improved purification protocol. (Blue = contaminant protein, DLST; Pink = on-target two component nanoparticle). (D) Cropped anti-DLST Western Blot. IMAC soluble (S), flowthrough (FT), wash (W) and elution (E) fractions were run on SDS-PAGE followed by Western Blot. The LiCor Chameleon® 700 Pre-stained Protein Ladder (L) was used. Blue labels arrows indicate the molecular weight of a single subunit of DLST. A full annotated Western Blot as well as the accompanying SDS-PAGE gel can be referenced in Sup. Fig. 4 and Sup. Table S4. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

**Fig. 4**
Impact of Cryo-EM resolution and search method on accurate DLST identification. The final Cryo-EM map was low-pass filtered to 3.00 Å, 8.00 Å, 9.00 Å, and each 0.25 Å interval between 4.00 Å and 7.00 Å. Each map was used as the input for the computational steps of our workflow, comparing the efficacy of three identification approaches: Protein Blast using a consensus sequence, Protein BLAST using the longest 10 chains, and hmmsearch using HMM profiles generated by ModelAngelo for the longest 10 chains. Chains with 1–10 residues are shown in pink, 11–100 residues in yellow, and chains longer than 100 residues in blue. (A) Low-pass filtered Cryo-EM maps and models between 4.00 Å and 6.75 Å. Additional maps and models spanning a broader resolution range can be found in Sup. Fig. S6. (B) Line graph illustrating the percentage of residues organized by chain length for each generated model. Raw data used for the line graph can be found in Sup. Table S5. (C) Line graph comparing the efficacy of the four identification methods across resolutions. Each data point represents the average percent identification accuracy for DLST using the specified method, with error bars representing the 95 % confidence interval for methods which had more than one data point per resolution. Asterisks (*) represent the lowest resolution data that returned any results for that search method. A dashed line marks the range between 6.50 Å and 7.00 Å on the hmmsearch (ModelAngelo) data to communicate that no results were returned for that method at 6.75 Å and consequently 6.50 Å is the lowest resolution that was able to achieve a non-zero score for that method. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

See this image and copyright information in PMC

References

1. Abramson J., Adler J., Dunger J., Evans R., Green T., Pritzel A., Ronneberger O., Willmore L., Ballard A.J., Bambrick J., Bodenstein S.W., Evans D.A., Hung C.-C., O’Neill M., Reiman D., Tunyasuvunakool K., Wu Z., Žemgulytė A., Arvaniti E., Beattie C., Bertolli O., Bridgland A., Cherepanov A., Congreve M., Cowen-Rivers A.I., Cowie A., Figurnov M., Fuchs F.B., Gladman H., Jain R., Khan Y.A., Low C.M.R., Perlin K., Potapenko A., Savy P., Singh S., Stecula A., Thillaisundaram A., Tong C., Yakneen S., Zhong E.D., Zielinski M., Žídek A., Bapst V., Kohli P., Jaderberg M., Hassabis D., Jumper J.M. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500. doi: 10.1038/s41586-024-07487-w. - DOI - PMC - PubMed
1. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. - DOI - PubMed
1. Andi B., Soares A.S., Shi W., Fuchs M.R., McSweeney S., Liu Q. Structure of the dihydrolipoamide succinyltransferase catalytic domain from Escherichia coli in a novel crystal form: a tale of a common protein crystallization contaminant. Acta Crystallogr. F Struct. Biol. Commun. 2019;75:616–624. doi: 10.1107/S2053230X19011488. - DOI - PMC - PubMed
1. Arakawa T., Ejima D., Tsumoto K., Obeyama N., Tanaka Y., Kita Y., Timasheff S.N. Suppression of protein interactions by arginine: a proposed mechanism of the arginine effects. Biophys. Chem. 2007;127:1–8. doi: 10.1016/j.bpc.2006.12.007. - DOI - PubMed
1. Bale J.B., Gonen S., Liu Y., Sheffler W., Ellis D., Thomas C., Cascio D., Yeates T.O., Gonen T., King N.P., Baker D. Accurate design of megadalton-scale two-component icosahedral protein complexes. Science. 2016;353:389–394. doi: 10.1126/science.aaf8818. - DOI - PMC - PubMed

Grants and funding

R01 GM129325/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Elsevier Science
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Protein identification using Cryo-EM and artificial intelligence guides improved sample purification

Affiliations

Protein identification using Cryo-EM and artificial intelligence guides improved sample purification

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials