Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 4;11(1):30.
doi: 10.1038/s41597-023-02879-5.

The 100-protein NMR spectra dataset: A resource for biomolecular NMR data analysis

Affiliations

The 100-protein NMR spectra dataset: A resource for biomolecular NMR data analysis

Piotr Klukowski et al. Sci Data. .

Abstract

Multidimensional NMR spectra are the basis for studying proteins by NMR spectroscopy and crucial for the development and evaluation of methods for biomolecular NMR data analysis. Nevertheless, in contrast to derived data such as chemical shift assignments in the BMRB and protein structures in the PDB databases, this primary data is in general not publicly archived. To change this unsatisfactory situation, we present a standardized set of solution NMR data comprising 1329 2-4-dimensional NMR spectra and associated reference (chemical shift assignments, structures) and derived (peak lists, restraints for structure calculation, etc.) annotations. With the 100-protein NMR spectra dataset that was originally compiled for the development of the ARTINA deep learning-based spectra analysis method, 100 protein structures can be reproduced from their original experimental data. The 100-protein NMR spectra dataset is expected to help the development of computational methods for NMR spectroscopy, in particular machine learning approaches, and enable consistent and objective comparisons of these methods.

PubMed Disclaimer

Conflict of interest statement

G.T.M. is a founder of Nexomics Biosciences, Inc. This does not represent a conflict of interest for this study. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1
NMR spectra analysis workflow associated with the Dataset. Each protein record contains the protein sequence and a set of 2D–4D spectra, which undergo visual spectrum analysis (peak picking), yielding the coordinates of signals in the NMR spectra. Subsequently, identified signals are assigned to atoms in the protein sequence (chemical shift assignment). Assignments can then be used, for instance, to obtain interatomic distance restraints and to determine the three-dimensional protein structure. The Dataset documents all steps of this analysis for 100 proteins with (i) experimental, (ii) experimentally derived, and (iii) in-silico data, as indicated in the diagram.
Fig. 2
Fig. 2
Overview of the Dataset comprising 100 proteins and 1329 spectra. Proteins are ordered by sequence length and identified by their PDB code or abbreviated name if not deposited in the PDB (left). The left panel shows the spectra available for each protein. Where multiple spectra are available for a given spectrum type, they typically have been acquired by separate measurement of aliphatic and aromatic 13C nuclei or with H2O and D2O solvent. The middle panel shows the completeness of the chemical shift assignments deposited in the BMRB. The right panel shows secondary structure elements and well-defined regions plotted versus the residue number. Well-defined regions, which are used for all RMSD calculations, were determined by CYRANGE.
Fig. 3
Fig. 3
Statistics for data records in the Dataset. (a) Distribution of experiment types, spectrum dimensionality (2D–4D), and spectrometer frequency. (b) Distribution of number of data points and chemical shift ranges (ppm) across different dimensions for common 3D experiment types in the Dataset (Supplementary Table 4). For each spectrum type, the bottom row features histograms that represent the number of spectra with the specified number of data points in the given dimension, as indicated by the dimension label in the lower left corner. The upper row provides information about the chemical shift range in the spectrum file and the distribution of expected peaks in each dimension. The red line gives the number of spectra for which a given chemical shift value falls within the experimental spectral width in the given dimension. Similarly, the green line represents the number of spectra for which a given chemical shift value coincides (within tolerance) with at least one expected peak position based on the (unfolded) chemical shift assignments from the BMRB. Where the green line exceeds the red line, it indicates that folding is typically applied along that spectral axis.
Fig. 4
Fig. 4
Contour plot of a [1H,13C]-HSQC spectrum for the protein 1VDY. Positions of peaks back-calculated from the chemical shifts deposited in the BMRB are indicated by blue crosses.
Fig. 5
Fig. 5
Data validation by automated spectrum analysis with ARTINA. The three panels show, for 100 proteins, the backbone RMSD between the ARTINA structure and the NMR structure deposited in the PDB, as well as the accuracy of the backbone and sidechain assignment by ARTINA relative to the assignments deposited in the BMRB. Proteins presented in bar plots are sorted clockwise by sequence length. Box plots present the distribution and the median of these quantities.

Similar articles

Cited by

References

    1. Wüthrich K. NMR studies of structure and function of biological macromolecules (Nobel Lecture) J. Biomol. NMR. 2003;27:13–39. doi: 10.1023/a:1024733922459. - DOI - PubMed
    1. Hoch JC, et al. Biological Magnetic Resonance Data Bank. Nucleic Acids Res. 2023;51:D368–D376. doi: 10.1093/nar/gkac1050. - DOI - PMC - PubMed
    1. Berman HM, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. - DOI - PMC - PubMed
    1. Everett JK, et al. A community resource of experimental data for NMR / X-ray crystal structure pairs. Protein Sci. 2016;25:30–45. doi: 10.1002/pro.2774. - DOI - PMC - PubMed
    1. Rosato A, et al. The second round of Critical Assessment of Automated Structure Determination of Proteins by NMR: CASD-NMR-2013. J. Biomol. NMR. 2015;62:413–424. doi: 10.1007/s10858-015-9953-4. - DOI - PMC - PubMed

MeSH terms