Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 1;9(Pt 1):86-97.
doi: 10.1107/S2052252521011088. eCollection 2022 Jan 1.

findMySequence: a neural-network-based approach for identification of unknown proteins in X-ray crystallography and cryo-EM

Affiliations

findMySequence: a neural-network-based approach for identification of unknown proteins in X-ray crystallography and cryo-EM

Grzegorz Chojnowski et al. IUCrJ. .

Abstract

Although experimental protein-structure determination usually targets known proteins, chains of unknown sequence are often encountered. They can be purified from natural sources, appear as an unexpected fragment of a well characterized protein or appear as a contaminant. Regardless of the source of the problem, the unknown protein always requires characterization. Here, an automated pipeline is presented for the identification of protein sequences from cryo-EM reconstructions and crystallographic data. The method's application to characterize the crystal structure of an unknown protein purified from a snake venom is presented. It is also shown that the approach can be successfully applied to the identification of protein sequences and validation of sequence assignments in cryo-EM protein structures.

Keywords: SIMBAD; bioinformatics; cryo-EM; findMySequence; neural networks; protein sequences; protein structures; structure determination.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A schematic representation of the findMy­Sequence usage workflow. Key steps are grouped in dashed rectangles: (a) structure solution and model building, (b) model interpretation, (c) sequence-database queries, and (d) sequence assignment and model building. All steps except model tracing (a) are integrated in the software and performed automatically.
Figure 2
Figure 2
Sequence-identification benchmarks for crystal structure models solved with MR using Phaser. Sequence identity of an identified sequence to the target sequence as a function of (a) R free-factor value of a MR solution rebuilt using ARP/wARP without an input sequence and (b) sequence identity of a MR search model to the target structure. The continuous and dashed curves are logistic regression estimates of a probability that an identified sequence will have at least 80% sequence identity to the target sequence.
Figure 3
Figure 3
Sequence-identification benchmarks for 909 cryo-EM models of ribosomal proteins. (a) Comparison of the method performances for an identification of models built de novo against small (proteomes) and large (PDB100) sequence databases. (b) Comparison of the method performances for models built de novo and those based on refined deposited coordinates. Histograms of the median local resolution of the test-set proteins are shown in grey (in arbitrary units). The continuous curves are logistic regression estimates of a probability that an identified sequence will have at least 80% sequence identity to the target sequence.
Figure 4
Figure 4
Sequence-identification and assignment benchmarks for EM models. (a) Identity of a sequence identified for models built de novo using ARP/wARP as a function of HMMsearch best-single-domain sequence-alignment score. (b) Identity of a sequence assigned to continuous fragments of deposited EM models as a function of the sequence-assignment score (p value) for protein-fragment lengths of 10, 50 and 100 residues selected at random from test-set models. The continuous curves on the plots are logistic regression estimates of a probability that an identified sequence will have at least 80% sequence identity to the reference model. The orange circles represent three reference chains with register error that were not used for the logistic regression calculations.
Figure 5
Figure 5
Fragment of an S21 protein model from E. coli 70S ribosome at 3.0 Å resolution (PDB ID/EMDB ID 5we4/8814). (a) In the deposited coordinates, many side chains outside a well resolved map and a proline inside a regular alpha helix may raise suspicion. (b) After sequence re-assignment and side-chain rebuilding with findMy­Sequence, the map features are better explained by the model. Only residue range 34–44 in chain u and a corresponding map are shown for clarity.
Figure 6
Figure 6
Consecutive steps of crystal structure determination and sequence identification of a protein with hemolytic activity purified from B. atrox venom. (a) Initial MR solution after 30 REFMAC5 refinement cycles with jelly-body restraints. The same fragment in (b) the ARP/wARP model re-traced without an input sequence and used as an input for findMy­Sequence, and in (c) the final model. The 2F oF c maps are contoured at the 1σ level above the mean. The free atoms used for sparse electron-density map representation in ARP/wARP are shown as grey spheres. Water molecules are shown as red spheres.
Figure 7
Figure 7
A model of the Voa1 assembly factor and a corresponding cryo-EM reconstruction at 3.5 Å resolution (PDB and EMDB IDs 6c6l and 7348, respectively) (Roh et al., 2018 ▸). Only a residue range of 217–247 is shown for clarity.

References

    1. Abergel, C. (2013). Acta Cryst. D69, 2167–2173. - PMC - PubMed
    1. Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Nucleic Acids Res. 25, 3389–3402. - PMC - PubMed
    1. Amazonas, D. R., Portes-Junior, J. A., Nishiyama-Jr, M. Y., Nicolau, C. A., Chalkidis, H. M., Mourão, R. H. V., Grazziotin, F. G., Rokyta, D. R., Gibbs, H. L., Valente, R. H., Junqueira-de-Azevedo, I. L. M. & Moura-da-Silva, A. M. (2018). J. Proteomics, 181, 60–72. - PubMed
    1. Battye, T. G. G., Kontogiannis, L., Johnson, O., Powell, H. R. & Leslie, A. G. W. (2011). Acta Cryst. D67, 271–281. - PMC - PubMed
    1. Beckham, K. S. H., Ritter, C., Chojnowski, G., Ziemianowicz, D. S., Mullapudi, E., Rettel, M., Savitski, M. M., Mortensen, S. A., Kosinski, J. & Wilmanns, M. (2021). Sci. Adv. 7, eabg9923. - PMC - PubMed