Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Mar 7:2011:bar003.
doi: 10.1093/database/bar003. Print 2011.

Quality assurance for the query and distribution systems of the RCSB Protein Data Bank

Affiliations

Quality assurance for the query and distribution systems of the RCSB Protein Data Bank

Wolfgang F Bluhm et al. Database (Oxford). .

Abstract

The RCSB Protein Data Bank (RCSB PDB, www.pdb.org) is a key online resource for structural biology and related scientific disciplines. The website is used on average by 165,000 unique visitors per month, and more than 2000 other websites link to it. The amount and complexity of PDB data as well as the expectations on its usage are growing rapidly. Therefore, ensuring the reliability and robustness of the RCSB PDB query and distribution systems are crucially important and increasingly challenging. This article describes quality assurance for the RCSB PDB website at several distinct levels, including: (i) hardware redundancy and failover, (ii) testing protocols for weekly database updates, (iii) testing and release procedures for major software updates and (iv) miscellaneous monitoring and troubleshooting tools and practices. As such it provides suggestions for how other websites might be operated.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Schematic representation of hardware redundancy and DNS failover. There are four clusters in three separate geographical locations: San Diego Supercomputer Center (SDSC) and Skaggs School of Pharmacy and Pharmaceutical Sciences (SSPPS), both at the University of California, San Diego (UCSD), and Rutgers, the State University of New Jersey. Each cluster contains multiple load balanced Web and FTP servers. A third party DNS provider is used to manage the DNS entries for the website (www.pdb.org) and the FTP site (ftp.wwpdb.org) including failover in case the primary cluster fails.
Figure 2.
Figure 2.
Staggered weekly update schedule of the Web and FTP servers. The overall aim is to balance the need for advanced staging of the update (red and orange) with as much failover to current data (green) as possible at any given time. The update cycle begins on Friday with the second SDSC cluster (SDSC 2). Two more clusters are updated on Monday and Tuesday. On Wednesday at 00:00 UTC, the update is made public by switching the DNS entry between the two SDSC clusters (thick outlines). A few hours are allowed for the DNS change to propagate until the update on the final and now out of date cluster (blue) is started. Green with thick outlines shows “live” clusters serving data to the public. Other clusters in green have the same current content. Clusters in red are being updated. Orange denotes a cluster with a finished update that contains “staged” data not yet available for public release. Blue shows a cluster whose data are out of date compared with the live public site.
Figure 3.
Figure 3.
Screenshot of fully executed Selenium IDE test suite in a Firefox browser. We have developed an extensive suite of testing scripts with several goals such as checking key elements of the user interface, verifying the integrity of the weekly update, and comparing results between multiple servers. The top-left panel shows the list of test scripts. The top-middle panel shows as an example the script for verifying that keyword searches are up to date. It selects the first entry ID from the weekly release, extracts a keyword from its title, and then performs a search for this keyword and asserts that the search results include the given entry ID. The top-right panel contains the controls for executing tests and shows the test results. The bottom panel is a regular browser window that shows the web pages being loaded by the scripts.
Figure 4.
Figure 4.
Screenshot of Cacti monitoring tool (www.cacti.net). Parameters such as internal and external traffic, CPU and memory usage, thread counts and Tomcat session counts are collected every few minutes and presented in graphical form for each server or cluster. The time window starts after the cluster had just been reinstalled with a new quarterly software release and shows the server load during successive Selenium tests on each server. Server names are redacted for security reasons.

References

    1. Berman HM, Westbrook J, Feng Z, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. - PMC - PubMed
    1. Berman H, Henrick K, Nakamura H. Announcing the worldwide Protein Data Bank. Nat. Struct. Biol. 2003;10:980. - PubMed
    1. Rose PW, Beran B, Bi C, et al. The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Res. 2011;39:D392–D401. - PMC - PubMed
    1. Prlić A, Bliven S, Rose PW, et al. Pre-calculated protein structure alignments at the RCSB PDB website. Bioinformatics. 2010;26:2983–2985. - PMC - PubMed
    1. Prlić A, Martinez MA, Dimitropoulos D, et al. Integration of open access literature into the RCSB Protein Data Bank using BioLit. BMC Bioinformatics. 2010;11:220. - PMC - PubMed

Publication types