Growth of novel protein structural data

Michael Levitt¹

Affiliations

PMID: 17360626
PMCID: PMC1802002
DOI: 10.1073/pnas.0611678104

Growth of novel protein structural data

Michael Levitt. Proc Natl Acad Sci U S A. 2007.

. 2007 Feb 27;104(9):3183-8.

doi: 10.1073/pnas.0611678104. Epub 2007 Feb 20.

Author

Michael Levitt¹

Affiliation

¹ Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305-5126, USA. michael.levitt@stanford.edu

PMID: 17360626
PMCID: PMC1802002
DOI: 10.1073/pnas.0611678104

Abstract

Contrary to popular assumption, the rate of growth of structural data has slowed, and the Protein Data Bank (PDB) has not been growing exponentially since 1995. Reaching such a dramatic conclusion requires careful measurement of growth of novel structures, which can be achieved by clustering entry sequences, or by using a novel index to down-weight entries with a higher number of sequence neighbors. These measures agree, and growth rates are very similar for entire PDB files, clusters, and weighted chains. The overall sizes of Structural Classification of Proteins (SCOP) categories (number of families, superfamilies, and folds) appear to be directly proportional to the number of deposited PDB files. Using our weighted chain count, which is most correlated to the change in the size of each SCOP category in any time period, shows that the rate of increase of SCOP categories is actually slowing down. This enables the final size of each of these SCOP categories to be predicted without examining or comparing protein structures. In the last 3 years, structures solved by structural genomics (SG) initiatives, especially the United States National Institutes of Health Protein Structure Initiative, have begun to redress the slowing growth of the PDB. Structures solved by SG are 3.8 times less sequence-redundant than typical PDB structures. Since mid-2004, SG programs have contributed half the novel structures measured by weighted chain counts. Our analysis does not rely on visual inspection of coordinate sets: it is done automatically, providing an accurate, up-to-date measure of the growth of novel protein structural data.

PubMed Disclaimer

Conflict of interest statement

The author declares no conflict of interest.

Figures

**Fig. 1.**
The number of clusters, N_C^ID (olive) and N_Cas^ID (include asymmetrical links (light green), drops with %ID. The weighted chain count, N_W^ID (magenta), has a very similar dependence on the %ID. All measures depend nonlinearly on the %ID but can be fitted well by a fifth order polynomial that passes through the point (0, 1); at 0% ID, there is 1 cluster and the total weighted chain count is 1. When asymmetrical links are included, the agreement with N_W^ID is less good and these links are generally excluded. Both N_C^ID and N_Cas^ID depend linearly on N_W^ID, with N_Cas^ID = 0.954N_W^ID − 225 and N_C^ID = 0.974N_W^ID + 421 (*Inset*).

**Fig. 2.**
Structural data growth since the inception of the PDB. (a) Growth of protein structural data deposited in the PDB since its inception in 1972. The average growth rate is 28.0% per year for PDB files (N_PDB; gray) and 28.4% for nonidentical chains (N_CHA; brown). Clustering or down-weighting sequence redundancy at the 25% identity level gives a lower growth rate, very similar for three measures: 24.4% for clustered chains (N_C²⁵; olive), 23.0% for weighted chains (N_W²⁵; magenta), and 23.4% for number of weighted residues (M_W²⁵; lavender; divided by 183, the average number of residues in a PDB chain today). These growth rates are extremely rapid, with a doubling of the data in less than three and a half years (22.5% annual growth rate). The growth as predicted by the Dickerson equation (18) is shown as a dashed blue line. Small differences are masked on the log scale used for the y axis. Thus, N_PDB and N_CHA seem very similar, but N_CHA = 1.12 × N_PDB. The novelty ratio (ΔN_W²⁵/ΔN_CHA; orange) is measured on the right-hand axis scale. (b) The annual growth rate smoothed over the previous four quarters fluctuates greatly. Each of the three growth rates, g[N_PDB] (gray), g[N_C²⁵] (olive), and g[N_W²⁵] (magenta), has three peaks. After 1995, there is a steady decline in rate of growth approximated by the green line (see text). Today (August 2006), all three growth rates are half their values in 1995. For exponential growth, the rates should be constant at 22.0%: this is clearly not the case. Growth using deposit dates is complicated by PDB files that are on hold. The coordinates and often the sequences of these entries are not available until their release date. Thus, for a fixed deposit date cutoff, the number of PDB files that can be analyzed (sequence is needed) changes with time until the entries have all been released. This makes the most recent data incomplete. We study this effect by including the on-hold files and recalculating the growth rate of PDB files. This increases the growth rate of the most recent entries (last 2 years in *b Inset*), but the effect is small reducing the growth rate by 4.4%, 3.7%, 2.5%, 1.8%, 0.8%, and 0.2% points as one goes back six quarters from October 2006. This problem does not occur with release dates [supporting information (SI) Fig. 7b Inset].

**Fig. 3.**
Sizes of SCOP categories depend on number of weighted chains and number of PDB files. (a) Sizes of the SCOP categories (families in blue, superfamilies in green, and folds in red) vary linearly (correlation >0.998) with the number of files released in the PDB (N_PDB). Large circles show the historical data for sizes from 13 releases of SCOP since October 1997 (see http://scop.mrc-lmb.cam.ac.uk/scop/count.html). The plot of N_W²⁵ vs. N_PDB bends upwards (magenta). (b) Variation of the sizes of the three SCOP categories with weighted chain count at 25% identity, N_W²⁵. The curves for the real data (blue, green, and red) bend downwards and are well fitted (small black diamonds) by a saturating function N = α × N_W²⁵/(1 + α × N_W²⁵/N^max), where α is a scale factor and N^max is the asymptotic saturation value, the maximum possible number. We find (α, N^max) = (0.707, 11811), (0.517, 3412), and (0.356, 1613) for families, superfamilies, and folds, respectively. Large diamonds are an extrapolation to the current PDB.

**Fig. 4.**
Growth of sizes of SCOP categories with and without structural genomic data since 2000. (a) Influence of the worldwide SG initiative on the growth of protein structural data. With the SG data, growth is more rapid but even this growth rate has fallen recently (Fig. 2b). The white dots show the real SCOP count before October 2004; the solid lines and filled circles show estimates of the growth of families, superfamilies, and folds from the growth in N_W²⁵ or N_W(SG−)²⁵. Release dates are used here. (b) Percentage of nonredundant sequence data (N_W²⁵) in PDB files deposited in the preceding year that comes or does not come from SG. Before the year 2000, almost all data were unrelated to SG projects; since 2004, almost half the data have come from SG.

See this image and copyright information in PMC

References

1. Kendrew JC, Dickerson RE, Strandberg BE, Hart RG, Davies DR, Phillips DC, Shore VC. Nature. 1961;185:422–427. - PubMed
1. Kendrew JC. Sci Am. 1961;205:96–111. - PubMed
1. Levitt M, Chothia C. Nature. 1976;261:552–558. - PubMed
1. Andreeva A, Howorth D, Brenner SE, Hubbard TJP, Chothia C, Murzin AG. Nucl Acids Res. 2004;32:D226–D229. - PMC - PubMed
1. Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Jr, Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M. Eur J Biochem. 1977;80:319–324. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Growth of novel protein structural data

Affiliation

Growth of novel protein structural data

Author

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources