Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data

Preeti Choudhary¹, Stephen Anyango², John Berrisford^{2

3}, James Tolchard^{2

4}, Mihaly Varadi², Sameer Velankar²

Affiliations

¹ Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. cypreeti@ebi.ac.uk.
² Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
³ AstraZeneca, Biomedical Campus, 1 Francis Crick Ave, Trumpington, Cambridge, CB2 0AA, UK.
⁴ Claude Bernard University, Villeurbanne, Lyon, 69100, France.

PMID: 37045837
PMCID: PMC10097656
DOI: 10.1038/s41597-023-02101-6

Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data

Preeti Choudhary et al. Sci Data. 2023.

. 2023 Apr 12;10(1):204.

doi: 10.1038/s41597-023-02101-6.

Authors

Preeti Choudhary¹, Stephen Anyango², John Berrisford^{2

3}, James Tolchard^{2

4}, Mihaly Varadi², Sameer Velankar²

Affiliations

¹ Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. cypreeti@ebi.ac.uk.
² Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
³ AstraZeneca, Biomedical Campus, 1 Francis Crick Ave, Trumpington, Cambridge, CB2 0AA, UK.
⁴ Claude Bernard University, Villeurbanne, Lyon, 69100, France.

PMID: 37045837
PMCID: PMC10097656
DOI: 10.1038/s41597-023-02101-6

Abstract

More than 61,000 proteins have up-to-date correspondence between their amino acid sequence (UniProtKB) and their 3D structures (PDB), enabled by the Structure Integration with Function, Taxonomy and Sequences (SIFTS) resource. SIFTS incorporates residue-level annotations from many other biological resources. SIFTS data is available in various formats like XML, CSV and TSV format or also accessible via the PDBe REST API but always maintained separately from the structure data (PDBx/mmCIF file) in the PDB archive. Here, we extended the wwPDB PDBx/mmCIF data dictionary with additional categories to accommodate SIFTS data and added the UniProtKB, Pfam, SCOP2, and CATH residue-level annotations directly into the PDBx/mmCIF files from the PDB archive. With the integrated UniProtKB annotations, these files now provide consistent numbering of residues in different PDB entries allowing easy comparison of structure models. The extended dictionary yields a more consistent, standardised metadata description without altering the core PDB information. This development enables up-to-date cross-reference information at the residue level resulting in better data interoperability, supporting improved data analysis and visualisation.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
The schematic overview of the core SIFTS pipeline and an additional process for exporting data into PDBx/mmCIF Files. The figure illustrates the different components of the core SIFTS pipeline, represented in yellow, and the corresponding outputs, indicated in green. The core SIFTS process generates various output files, including the SIFTS database, XML, CSV, and TSV files. The additional process, represented in the figure, is responsible for augmenting SIFTS data in updated PDBx/mmCIF files. The grey components in the figure denote data resources that are external to the SIFTS pipeline.

**Fig. 2**
The PDBx/mmCIF extension incorporates mappings from various data resources. SIFTS annotations mapping PDB residues to various data resources are shown both per-segment (top) and per-residue (bottom). All the new SIFTS-specific or modified PDBx/mmCIF categories are shown in grey boxes. The new SIFTS-specific PDBx/mmCIF categories introduced to show per-segment annotations from UniProtKB and all the other external data resources (Pfam, SCOP2, CATH) are “_pdbx_sifts_unp_segments” and “_pdbx_sifts_xref_db_segments” respectively. “_pdbx_sifts_xref_db” is another new SIFTS-specific PDBx/mmCIF category introduced to show per-residue annotations. We also modified the “_atom_site” category to indicate the best mapped UniProtKB sequence.

**Fig. 3**
Single placeholder in PDBx/mmCIF files to find all the annotations associated with any residue from external databases. This figure shows the “_pdbx_sifts_xref_db” category for PDB 4daj. This critical new data category can describe residue-level cross-references to external databases. The items specific to the UniProtKB database and other cross-reference databases are marked in beige and green coloured boxes respectively.

**Fig. 4**
Category relationship diagram including new SIFTS specific PDBx/mmCIF categories. New SIFTS specific PDBx/mmCIF data categories are shown along with their data items. All the common data items amongst these new data categories are highlighted and their relationship is shown. Further, the relationship of the data items representing PDB residue numbers - “.seq_id”, “.seq_id_start” or “.seq_id_end” in these new data categories to existing data categories is shown.

**Fig. 5**
Distinguishing between multiple instances of the same protein in the PDBx/mmCIF file. The data item “.instance_id” enables users to identify the two copies of the same protein, Streptavidin (UniProtKB accession P22629), in the dimeric Streptavidin structure (PDB 6s50).

**Fig. 6**
Identification of split domains from PDBx/mmCIF file. The “_pdbx_sifts_xref_db_segments” category in the PDBx/mmCIF file of PDB 4daj helps to clearly identify discontinuous domains. The two halves of the M3 receptor domain are indicated by the same “.instance_id” but different “.segment_id”.

**Fig. 7**
Superposition of protein structures using Mol*. The superposed apo and holo forms of human PTP1B protein are shown in green and beige colours, respectively, in Mol*. The WDP loop is in open (light green colour) conformation in the apo form (PDB 2HNP). Upon binding to various substrates/inhibitors this WDP loop attains closed (pink colour) conformation covering the catalytic site. The inhibitor bound in PDB 1Q6P is shown in the surface representation. The average RMSD between the four superposed structures as computed by Mol* is 1.40 Å. As seen in the tool-tip (bottom-right in the figure), residue W179 from PDB 3CWE and other residues in inhibitor bound PDBs 3CWE and 1Q6P have different author numbering compared to the unbound/substrate bound form (PDB 2HNP/1PTY). The UniProtKB numbering in the PDBx/mmCIF file provides a common reference frame for residue correspondence and supports superposition based on UniProtKB in Mol*.

**Fig. 8**
The 2D visualisation components are interactively linked with 3D visualisation components on PDBe entry pages. Various 2D and 3D visualisation components seen on PDBe entry pages are interactively linked with each other. Here we show visualisation data for Mannose-1-phosphate guanyltransferase (PDB 7d72). (A) shows a 2D sequence feature viewer (ProtVista) and (B) shows a 2D topology viewer, along with (C) showing the 3D viewer, Mol*. As users select any residue (here ligand-binding residue ASP218 is selected) in ProtVista, it is automatically highlighted in Mol* and vice-versa. Users can also highlight a range of residues (e.g. domains) in any of these viewers. Here, we show the Pfam domain highlighted in red in the 2D topology viewer.

See this image and copyright information in PMC

References

1. wwPDB consortium Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019;47:D520–D528. doi: 10.1093/nar/gky949. - DOI - PMC - PubMed
1. The UniProt Consortium UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49:D480–D489. doi: 10.1093/nar/gkaa1100. - DOI - PMC - PubMed
1. Brylinski M, Skolnick J. What is the relationship between the global structures of apo and holo proteins? Proteins. 2008;70:363–377. doi: 10.1002/prot.21510. - DOI - PubMed
1. Burra PV, Zhang Y, Godzik A, Stec B. Global distribution of conformational states derived from redundant models in the PDB points to non-uniqueness of the protein structure. Proc. Natl. Acad. Sci. 2009;106:10505. doi: 10.1073/pnas.0812152106. - DOI - PMC - PubMed
1. Lobanov MY, et al. ComSin: database of protein structures in bound (complex) and unbound (single) states in relation to their intrinsic disorder. Nucleic Acids Res. 2010;38:D283–D287. doi: 10.1093/nar/gkp963. - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data

Affiliations

Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources