Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 12;10(1):204.
doi: 10.1038/s41597-023-02101-6.

Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data

Affiliations

Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data

Preeti Choudhary et al. Sci Data. .

Abstract

More than 61,000 proteins have up-to-date correspondence between their amino acid sequence (UniProtKB) and their 3D structures (PDB), enabled by the Structure Integration with Function, Taxonomy and Sequences (SIFTS) resource. SIFTS incorporates residue-level annotations from many other biological resources. SIFTS data is available in various formats like XML, CSV and TSV format or also accessible via the PDBe REST API but always maintained separately from the structure data (PDBx/mmCIF file) in the PDB archive. Here, we extended the wwPDB PDBx/mmCIF data dictionary with additional categories to accommodate SIFTS data and added the UniProtKB, Pfam, SCOP2, and CATH residue-level annotations directly into the PDBx/mmCIF files from the PDB archive. With the integrated UniProtKB annotations, these files now provide consistent numbering of residues in different PDB entries allowing easy comparison of structure models. The extended dictionary yields a more consistent, standardised metadata description without altering the core PDB information. This development enables up-to-date cross-reference information at the residue level resulting in better data interoperability, supporting improved data analysis and visualisation.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The schematic overview of the core SIFTS pipeline and an additional process for exporting data into PDBx/mmCIF Files. The figure illustrates the different components of the core SIFTS pipeline, represented in yellow, and the corresponding outputs, indicated in green. The core SIFTS process generates various output files, including the SIFTS database, XML, CSV, and TSV files. The additional process, represented in the figure, is responsible for augmenting SIFTS data in updated PDBx/mmCIF files. The grey components in the figure denote data resources that are external to the SIFTS pipeline.
Fig. 2
Fig. 2
The PDBx/mmCIF extension incorporates mappings from various data resources. SIFTS annotations mapping PDB residues to various data resources are shown both per-segment (top) and per-residue (bottom). All the new SIFTS-specific or modified PDBx/mmCIF categories are shown in grey boxes. The new SIFTS-specific PDBx/mmCIF categories introduced to show per-segment annotations from UniProtKB and all the other external data resources (Pfam, SCOP2, CATH) are “_pdbx_sifts_unp_segments” and “_pdbx_sifts_xref_db_segments” respectively. “_pdbx_sifts_xref_db” is another new SIFTS-specific PDBx/mmCIF category introduced to show per-residue annotations. We also modified the “_atom_site” category to indicate the best mapped UniProtKB sequence.
Fig. 3
Fig. 3
Single placeholder in PDBx/mmCIF files to find all the annotations associated with any residue from external databases. This figure shows the “_pdbx_sifts_xref_db” category for PDB 4daj. This critical new data category can describe residue-level cross-references to external databases. The items specific to the UniProtKB database and other cross-reference databases are marked in beige and green coloured boxes respectively.
Fig. 4
Fig. 4
Category relationship diagram including new SIFTS specific PDBx/mmCIF categories. New SIFTS specific PDBx/mmCIF data categories are shown along with their data items. All the common data items amongst these new data categories are highlighted and their relationship is shown. Further, the relationship of the data items representing PDB residue numbers - “.seq_id”, “.seq_id_start” or “.seq_id_end” in these new data categories to existing data categories is shown.
Fig. 5
Fig. 5
Distinguishing between multiple instances of the same protein in the PDBx/mmCIF file. The data item “.instance_id” enables users to identify the two copies of the same protein, Streptavidin (UniProtKB accession P22629), in the dimeric Streptavidin structure (PDB 6s50).
Fig. 6
Fig. 6
Identification of split domains from PDBx/mmCIF file. The “_pdbx_sifts_xref_db_segments” category in the PDBx/mmCIF file of PDB 4daj helps to clearly identify discontinuous domains. The two halves of the M3 receptor domain are indicated by the same “.instance_id” but different “.segment_id”.
Fig. 7
Fig. 7
Superposition of protein structures using Mol*. The superposed apo and holo forms of human PTP1B protein are shown in green and beige colours, respectively, in Mol*. The WDP loop is in open (light green colour) conformation in the apo form (PDB 2HNP). Upon binding to various substrates/inhibitors this WDP loop attains closed (pink colour) conformation covering the catalytic site. The inhibitor bound in PDB 1Q6P is shown in the surface representation. The average RMSD between the four superposed structures as computed by Mol* is 1.40 Å. As seen in the tool-tip (bottom-right in the figure), residue W179 from PDB 3CWE and other residues in inhibitor bound PDBs 3CWE and 1Q6P have different author numbering compared to the unbound/substrate bound form (PDB 2HNP/1PTY). The UniProtKB numbering in the PDBx/mmCIF file provides a common reference frame for residue correspondence and supports superposition based on UniProtKB in Mol*.
Fig. 8
Fig. 8
The 2D visualisation components are interactively linked with 3D visualisation components on PDBe entry pages. Various 2D and 3D visualisation components seen on PDBe entry pages are interactively linked with each other. Here we show visualisation data for Mannose-1-phosphate guanyltransferase (PDB 7d72). (A) shows a 2D sequence feature viewer (ProtVista) and (B) shows a 2D topology viewer, along with (C) showing the 3D viewer, Mol*. As users select any residue (here ligand-binding residue ASP218 is selected) in ProtVista, it is automatically highlighted in Mol* and vice-versa. Users can also highlight a range of residues (e.g. domains) in any of these viewers. Here, we show the Pfam domain highlighted in red in the 2D topology viewer.

Similar articles

Cited by

References

    1. wwPDB consortium Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019;47:D520–D528. doi: 10.1093/nar/gky949. - DOI - PMC - PubMed
    1. The UniProt Consortium UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49:D480–D489. doi: 10.1093/nar/gkaa1100. - DOI - PMC - PubMed
    1. Brylinski M, Skolnick J. What is the relationship between the global structures of apo and holo proteins? Proteins. 2008;70:363–377. doi: 10.1002/prot.21510. - DOI - PubMed
    1. Burra PV, Zhang Y, Godzik A, Stec B. Global distribution of conformational states derived from redundant models in the PDB points to non-uniqueness of the protein structure. Proc. Natl. Acad. Sci. 2009;106:10505. doi: 10.1073/pnas.0812152106. - DOI - PMC - PubMed
    1. Lobanov MY, et al. ComSin: database of protein structures in bound (complex) and unbound (single) states in relation to their intrinsic disorder. Nucleic Acids Res. 2010;38:D283–D287. doi: 10.1093/nar/gkp963. - DOI - PMC - PubMed