. 2024 Jan 4;11(1):30.

doi: 10.1038/s41597-023-02879-5.

The 100-protein NMR spectra dataset: A resource for biomolecular NMR data analysis

Affiliations

¹ Institute of Molecular Physical Science, ETH Zurich, 8093, Zurich, Switzerland. piotr.klukowski@phys.chem.ethz.ch.
² Institute of Biochemistry, ETH Zurich, 8093, Zurich, Switzerland.
³ Institute of Biotechnology, University of Helsinki, 00100, Helsinki, Finland.
⁴ Institute of Molecular Physical Science, ETH Zurich, 8093, Zurich, Switzerland.
⁵ Department of Chemistry and Chemical Biology, and Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY, 12180, USA.
⁶ Institute of Molecular Physical Science, ETH Zurich, 8093, Zurich, Switzerland. roland.riek@phys.chem.ethz.ch.
⁷ Institute of Molecular Physical Science, ETH Zurich, 8093, Zurich, Switzerland. peter.guentert@phys.chem.ethz.ch.
⁸ Institute of Biophysical Chemistry, Goethe University, 60438, Frankfurt am Main, Germany. peter.guentert@phys.chem.ethz.ch.
⁹ Department of Chemistry, Tokyo Metropolitan University, Hachioji, 192-0397, Tokyo, Japan. peter.guentert@phys.chem.ethz.ch.

PMID: 38177162
PMCID: PMC10767026
DOI: 10.1038/s41597-023-02879-5

The 100-protein NMR spectra dataset: A resource for biomolecular NMR data analysis

Piotr Klukowski et al. Sci Data. 2024.

. 2024 Jan 4;11(1):30.

doi: 10.1038/s41597-023-02879-5.

Authors

Affiliations

¹ Institute of Molecular Physical Science, ETH Zurich, 8093, Zurich, Switzerland. piotr.klukowski@phys.chem.ethz.ch.
² Institute of Biochemistry, ETH Zurich, 8093, Zurich, Switzerland.
³ Institute of Biotechnology, University of Helsinki, 00100, Helsinki, Finland.
⁴ Institute of Molecular Physical Science, ETH Zurich, 8093, Zurich, Switzerland.
⁵ Department of Chemistry and Chemical Biology, and Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY, 12180, USA.
⁶ Institute of Molecular Physical Science, ETH Zurich, 8093, Zurich, Switzerland. roland.riek@phys.chem.ethz.ch.
⁷ Institute of Molecular Physical Science, ETH Zurich, 8093, Zurich, Switzerland. peter.guentert@phys.chem.ethz.ch.
⁸ Institute of Biophysical Chemistry, Goethe University, 60438, Frankfurt am Main, Germany. peter.guentert@phys.chem.ethz.ch.
⁹ Department of Chemistry, Tokyo Metropolitan University, Hachioji, 192-0397, Tokyo, Japan. peter.guentert@phys.chem.ethz.ch.

PMID: 38177162
PMCID: PMC10767026
DOI: 10.1038/s41597-023-02879-5

Abstract

Multidimensional NMR spectra are the basis for studying proteins by NMR spectroscopy and crucial for the development and evaluation of methods for biomolecular NMR data analysis. Nevertheless, in contrast to derived data such as chemical shift assignments in the BMRB and protein structures in the PDB databases, this primary data is in general not publicly archived. To change this unsatisfactory situation, we present a standardized set of solution NMR data comprising 1329 2-4-dimensional NMR spectra and associated reference (chemical shift assignments, structures) and derived (peak lists, restraints for structure calculation, etc.) annotations. With the 100-protein NMR spectra dataset that was originally compiled for the development of the ARTINA deep learning-based spectra analysis method, 100 protein structures can be reproduced from their original experimental data. The 100-protein NMR spectra dataset is expected to help the development of computational methods for NMR spectroscopy, in particular machine learning approaches, and enable consistent and objective comparisons of these methods.

PubMed Disclaimer

Conflict of interest statement

G.T.M. is a founder of Nexomics Biosciences, Inc. This does not represent a conflict of interest for this study. The other authors declare no competing interests.

Figures

**Fig. 1**
NMR spectra analysis workflow associated with the Dataset. Each protein record contains the protein sequence and a set of 2D–4D spectra, which undergo visual spectrum analysis (peak picking), yielding the coordinates of signals in the NMR spectra. Subsequently, identified signals are assigned to atoms in the protein sequence (chemical shift assignment). Assignments can then be used, for instance, to obtain interatomic distance restraints and to determine the three-dimensional protein structure. The Dataset documents all steps of this analysis for 100 proteins with (i) experimental, (ii) experimentally derived, and (iii) in-silico data, as indicated in the diagram.

**Fig. 2**
Overview of the Dataset comprising 100 proteins and 1329 spectra. Proteins are ordered by sequence length and identified by their PDB code or abbreviated name if not deposited in the PDB (left). The left panel shows the spectra available for each protein. Where multiple spectra are available for a given spectrum type, they typically have been acquired by separate measurement of aliphatic and aromatic ¹³C nuclei or with H₂O and D₂O solvent. The middle panel shows the completeness of the chemical shift assignments deposited in the BMRB. The right panel shows secondary structure elements and well-defined regions plotted versus the residue number. Well-defined regions, which are used for all RMSD calculations, were determined by CYRANGE.

**Fig. 3**
Statistics for data records in the Dataset. **(a)** Distribution of experiment types, spectrum dimensionality (2D–4D), and spectrometer frequency. **(b)** Distribution of number of data points and chemical shift ranges (ppm) across different dimensions for common 3D experiment types in the Dataset (Supplementary Table 4). For each spectrum type, the bottom row features histograms that represent the number of spectra with the specified number of data points in the given dimension, as indicated by the dimension label in the lower left corner. The upper row provides information about the chemical shift range in the spectrum file and the distribution of expected peaks in each dimension. The red line gives the number of spectra for which a given chemical shift value falls within the experimental spectral width in the given dimension. Similarly, the green line represents the number of spectra for which a given chemical shift value coincides (within tolerance) with at least one expected peak position based on the (unfolded) chemical shift assignments from the BMRB. Where the green line exceeds the red line, it indicates that folding is typically applied along that spectral axis.

**Fig. 4**
Contour plot of a [¹H,¹³C]-HSQC spectrum for the protein 1VDY. Positions of peaks back-calculated from the chemical shifts deposited in the BMRB are indicated by blue crosses.

**Fig. 5**
Data validation by automated spectrum analysis with ARTINA. The three panels show, for 100 proteins, the backbone RMSD between the ARTINA structure and the NMR structure deposited in the PDB, as well as the accuracy of the backbone and sidechain assignment by ARTINA relative to the assignments deposited in the BMRB. Proteins presented in bar plots are sorted clockwise by sequence length. Box plots present the distribution and the median of these quantities.

See this image and copyright information in PMC

Cited by

Super-resolution triple-resonance NMR spectroscopy for the sequential assignment of proteins.
Gampp O, Wenchel L, Güntert P, Klukowski P, Riek R. Gampp O, et al. Sci Adv. 2025 Aug 15;11(33):eadv6246. doi: 10.1126/sciadv.adv6246. Epub 2025 Aug 15. Sci Adv. 2025. PMID: 40815649 Free PMC article.

References

1. Wüthrich K. NMR studies of structure and function of biological macromolecules (Nobel Lecture) J. Biomol. NMR. 2003;27:13–39. doi: 10.1023/a:1024733922459. - DOI - PubMed
1. Hoch JC, et al. Biological Magnetic Resonance Data Bank. Nucleic Acids Res. 2023;51:D368–D376. doi: 10.1093/nar/gkac1050. - DOI - PMC - PubMed
1. Berman HM, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. - DOI - PMC - PubMed
1. Everett JK, et al. A community resource of experimental data for NMR / X-ray crystal structure pairs. Protein Sci. 2016;25:30–45. doi: 10.1002/pro.2774. - DOI - PMC - PubMed
1. Rosato A, et al. The second round of Critical Assessment of Automated Structure Determination of Proteins by NMR: CASD-NMR-2013. J. Biomol. NMR. 2015;62:413–424. doi: 10.1007/s10858-015-9953-4. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

R35 GM141818/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The 100-protein NMR spectra dataset: A resource for biomolecular NMR data analysis

Affiliations

The 100-protein NMR spectra dataset: A resource for biomolecular NMR data analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Medical