Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 17:2024:baad093.
doi: 10.1093/database/baad093.

An optimized relational database for querying structural patterns in proteins

Affiliations

An optimized relational database for querying structural patterns in proteins

Renzo Angles et al. Database (Oxford). .

Abstract

A database is an essential component in almost any software system, and its creation involves more than just data modeling and schema design. It also includes query optimization and tuning. This paper focuses on a web system called GSP4PDB, which is used for searching structural patterns in proteins. The system utilizes a normalized relational database, which has proven to be inefficient even for simple queries. This article discusses the optimization of the GSP4PDB database by implementing two techniques: denormalization and indexing. The empirical evaluation described in the article shows that combining these techniques enhances the efficiency of the database when querying both real and artificial graph-based structural patterns.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Three-dimensional representation of the Zinc finger pattern (20). Alt Text: A Zinc finger pattern.
Figure 2.
Figure 2.
Example of graph-based structural patterns. It shows a zinc ligand connected with two specific amino acids (CYS and HIS) plus an undefined amino acid. Alt Text: A graph-based structural pattern.
Figure 3.
Figure 3.
Navigation bar and design area of GSP4PDB. The navigation bar contains the components that can be used to draw the graph-based structural pattern in the design area. Alt Text: Navigation bar and design area of GSP4PDB.
Figure 4.
Figure 4.
Output area of GSP4PDB. Each solution is a match of the graph-based structural pattern in a specific protein. Alt Text: Output area of GSP4PDB.
Figure 5.
Figure 5.
Entity-relationship diagram of the protein data used by GSP4PDB. It shows the entities, relationships and attributes identified and used to create the PostgreSQL database. Alt Text: Entity-relationship diagram of the GSP4PDB database.
Figure 6.
Figure 6.
Structure (or relational schema) of the database for storing protein information. For each table, we show attributes (first row), data types (second row) and a sample data tuple (third row). Primary keys and foreign keys are marked as [PK] and [FK], respectively. Alt Text: Relational schema of the GSP4PDB database.
Figure 7.
Figure 7.
SQL query template for a subgraph pattern of the form Ligand ⋯ distance ⋯ Amino. The parameters of the template are represented using squared brackets (e.g. [AMINO_OID]). Alt Text: SQL query template for a subgraph pattern.
Figure 8.
Figure 8.
Graph-based structural patterns representing real protein–ligand structural patterns. Alt Text: Real graph-based structural patterns.
Figure 9.
Figure 9.
Comparison of runtimes obtained for real graph-based structural patterns. Note that RPX = real graph pattern X, DB1 = Denormalized database, DB2 = Denormalized and indexed database, M1 = Machine 1 (8-GB RAM) and M2 = Machine 2 (92-GB RAM). Alt Text: The runtimes for real graph-based structural patterns.
Figure 10.
Figure 10.
Generic graph-based structural patterns. These were used to create many artificial graph patterns. Alt Text: Generic graph-based structural patterns.

References

    1. Dhifli Abdoulaye W. (2015) PGR: a novel graph repository of protein 3D-structures. J. Data Mining in Genomics & Proteomics, 6, 1–4.
    1. Anders G. and Nicola M. (2011). Managing the Protein Data Bank with DB2 pureXML IBM developerWorks, Technical Library.
    1. Angles R. and Arenas M. (2018) A graph-based approach for querying protein-ligand structural patterns. In: Lecture Notes in Bioinformatics, 10813, Springer, Cham, pp. 235–244.
    1. Angles R., Arenas-Salinas M., García R. et al. (2020) GSP4PDB: A web tool to visualize, search and explore protein-ligand structural patterns. BMC Bioinform., 21, 1–15. - PMC - PubMed
    1. Aslam N., Nadeem A. and Ellahi Babar M. et al. (2016) RPDB: A relational databank of protein structures. Pak. J. Agric. Sci., 53, 129–134.

Publication types