findMySequence: a neural-network-based approach for identification of unknown proteins in X-ray crystallography and cryo-EM

Grzegorz Chojnowski¹, Adam J Simpkin², Diego A Leonardo³, Wolfram Seifert-Davila⁴, Dan E Vivas-Ruiz⁵, Ronan M Keegan⁶, Daniel J Rigden²

Affiliations

¹ European Molecular Biology Laboratory, Hamburg Unit, Notkestrasse 85, 22607 Hamburg, Germany.
² Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom.
³ São Carlos Institute of Physics, University of São Paulo, Avenida João Dagnone 1100, São Carlos, SP 13563-120, Brazil.
⁴ European Molecular Biology Laboratory, Meyerhofstraße 1, 69117 Heidelberg, Germany.
⁵ Laboratorio de Biología Molecular, Facultad de Ciencias Biológicas, Universidad Nacional Mayor de San Marcos, Avenida Venezuela Cdra 34 S/N, Ciudad Universitaria, Lima, Peru.
⁶ Rutherford Appleton Laboratory, Research Complex at Harwell, UKRI-STFC, Didcot OX11 0FA, United Kingdom.

PMID: 35059213
PMCID: PMC8733886
DOI: 10.1107/S2052252521011088

findMySequence: a neural-network-based approach for identification of unknown proteins in X-ray crystallography and cryo-EM

Grzegorz Chojnowski et al. IUCrJ. 2021.

. 2021 Dec 1;9(Pt 1):86-97.

doi: 10.1107/S2052252521011088. eCollection 2022 Jan 1.

Authors

Grzegorz Chojnowski¹, Adam J Simpkin², Diego A Leonardo³, Wolfram Seifert-Davila⁴, Dan E Vivas-Ruiz⁵, Ronan M Keegan⁶, Daniel J Rigden²

Affiliations

¹ European Molecular Biology Laboratory, Hamburg Unit, Notkestrasse 85, 22607 Hamburg, Germany.
² Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom.
³ São Carlos Institute of Physics, University of São Paulo, Avenida João Dagnone 1100, São Carlos, SP 13563-120, Brazil.
⁴ European Molecular Biology Laboratory, Meyerhofstraße 1, 69117 Heidelberg, Germany.
⁵ Laboratorio de Biología Molecular, Facultad de Ciencias Biológicas, Universidad Nacional Mayor de San Marcos, Avenida Venezuela Cdra 34 S/N, Ciudad Universitaria, Lima, Peru.
⁶ Rutherford Appleton Laboratory, Research Complex at Harwell, UKRI-STFC, Didcot OX11 0FA, United Kingdom.

PMID: 35059213
PMCID: PMC8733886
DOI: 10.1107/S2052252521011088

Abstract

Although experimental protein-structure determination usually targets known proteins, chains of unknown sequence are often encountered. They can be purified from natural sources, appear as an unexpected fragment of a well characterized protein or appear as a contaminant. Regardless of the source of the problem, the unknown protein always requires characterization. Here, an automated pipeline is presented for the identification of protein sequences from cryo-EM reconstructions and crystallographic data. The method's application to characterize the crystal structure of an unknown protein purified from a snake venom is presented. It is also shown that the approach can be successfully applied to the identification of protein sequences and validation of sequence assignments in cryo-EM protein structures.

Keywords: SIMBAD; bioinformatics; cryo-EM; findMySequence; neural networks; protein sequences; protein structures; structure determination.

PubMed Disclaimer

Figures

**Figure 1**
A schematic representation of the *findMySequence* usage workflow. Key steps are grouped in dashed rectangles: (a) structure solution and model building, (b) model interpretation, (c) sequence-database queries, and (d) sequence assignment and model building. All steps except model tracing (a) are integrated in the software and performed automatically.

**Figure 2**
Sequence-identification benchmarks for crystal structure models solved with MR using *Phaser*. Sequence identity of an identified sequence to the target sequence as a function of (a) R _free-factor value of a MR solution rebuilt using *ARP/wARP* without an input sequence and (b) sequence identity of a MR search model to the target structure. The continuous and dashed curves are logistic regression estimates of a probability that an identified sequence will have at least 80% sequence identity to the target sequence.

**Figure 3**
Sequence-identification benchmarks for 909 cryo-EM models of ribosomal proteins. (a) Comparison of the method performances for an identification of models built *de novo* against small (proteomes) and large (PDB100) sequence databases. (b) Comparison of the method performances for models built *de novo* and those based on refined deposited coordinates. Histograms of the median local resolution of the test-set proteins are shown in grey (in arbitrary units). The continuous curves are logistic regression estimates of a probability that an identified sequence will have at least 80% sequence identity to the target sequence.

**Figure 4**
Sequence-identification and assignment benchmarks for EM models. (a) Identity of a sequence identified for models built *de novo* using *ARP/wARP* as a function of *HMMsearch* best-single-domain sequence-alignment score. (b) Identity of a sequence assigned to continuous fragments of deposited EM models as a function of the sequence-assignment score (p value) for protein-fragment lengths of 10, 50 and 100 residues selected at random from test-set models. The continuous curves on the plots are logistic regression estimates of a probability that an identified sequence will have at least 80% sequence identity to the reference model. The orange circles represent three reference chains with register error that were not used for the logistic regression calculations.

**Figure 5**
Fragment of an S21 protein model from *E. coli* 70S ribosome at 3.0 Å resolution (PDB ID/EMDB ID 5we4/8814). (a) In the deposited coordinates, many side chains outside a well resolved map and a proline inside a regular alpha helix may raise suspicion. (b) After sequence re-assignment and side-chain rebuilding with *findMySequence*, the map features are better explained by the model. Only residue range 34–44 in chain u and a corresponding map are shown for clarity.

**Figure 6**
Consecutive steps of crystal structure determination and sequence identification of a protein with hemolytic activity purified from *B. atrox* venom. (a) Initial MR solution after 30 *REFMAC5* refinement cycles with jelly-body restraints. The same fragment in (b) the *ARP/wARP* model re-traced without an input sequence and used as an input for *findMySequence*, and in (c) the final model. The 2F _o − F _c maps are contoured at the 1σ level above the mean. The free atoms used for sparse electron-density map representation in *ARP/wARP* are shown as grey spheres. Water molecules are shown as red spheres.

**Figure 7**
A model of the Voa1 assembly factor and a corresponding cryo-EM reconstruction at 3.5 Å resolution (PDB and EMDB IDs 6c6l and 7348, respectively) (Roh *et al.*, 2018 ▸). Only a residue range of 217–247 is shown for clarity.

See this image and copyright information in PMC

References

1. Abergel, C. (2013). Acta Cryst. D69, 2167–2173. - PMC - PubMed
1. Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Nucleic Acids Res. 25, 3389–3402. - PMC - PubMed
1. Amazonas, D. R., Portes-Junior, J. A., Nishiyama-Jr, M. Y., Nicolau, C. A., Chalkidis, H. M., Mourão, R. H. V., Grazziotin, F. G., Rokyta, D. R., Gibbs, H. L., Valente, R. H., Junqueira-de-Azevedo, I. L. M. & Moura-da-Silva, A. M. (2018). J. Proteomics, 181, 60–72. - PubMed
1. Battye, T. G. G., Kontogiannis, L., Johnson, O., Powell, H. R. & Leslie, A. G. W. (2011). Acta Cryst. D67, 271–281. - PMC - PubMed
1. Beckham, K. S. H., Ritter, C., Chojnowski, G., Ziemianowicz, D. S., Mullapudi, E., Rettel, M., Savitski, M. M., Mortensen, S. A., Kosinski, J. & Wilmanns, M. (2021). Sci. Adv. 7, eabg9923. - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

findMySequence: a neural-network-based approach for identification of unknown proteins in X-ray crystallography and cryo-EM

Affiliations

findMySequence: a neural-network-based approach for identification of unknown proteins in X-ray crystallography and cryo-EM

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources