Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Mar 20:6:6.
doi: 10.1186/1472-6807-6-6.

Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds

Affiliations

Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds

Ruslan I Sadreyev et al. BMC Struct Biol. .

Abstract

Background: As tertiary structure is currently available only for a fraction of known protein families, it is important to assess what parts of sequence space have been structurally characterized. We consider protein domains whose structure can be predicted by sequence similarity to proteins with solved structure and address the following questions. Do these domains represent an unbiased random sample of all sequence families? Do targets solved by structural genomic initiatives (SGI) provide such a sample? What are approximate total numbers of structure-based superfamilies and folds among soluble globular domains?

Results: To make these assessments, we combine two approaches: (i) sequence analysis and homology-based structure prediction for proteins from complete genomes; and (ii) monitoring dynamics of the assigned structure set in time, with the accumulation of experimentally solved structures. In the Clusters of Orthologous Groups (COG) database, we map the growing population of structurally characterized domain families onto the network of sequence-based connections between domains. This mapping reveals a systematic bias suggesting that target families for structure determination tend to be located in highly populated areas of sequence space. In contrast, the subset of domains whose structure is initially inferred by SGI is similar to a random sample from the whole population. To accommodate for the observed bias, we propose a new non-parametric approach to the estimation of the total numbers of structural superfamilies and folds, which does not rely on a specific model of the sampling process. Based on dynamics of robust distribution-based parameters in the growing set of structure predictions, we estimate the total numbers of superfamilies and folds among soluble globular proteins in the COG database.

Conclusion: The set of currently solved protein structures allows for structure prediction in approximately a third of sequence-based domain families. The choice of targets for structure determination is biased towards domains with many sequence-based homologs. The growing SGI output in the future should further contribute to the reduction of this bias. The total number of structural superfamilies and folds in the COG database are estimated as approximately 4000 and approximately 1700. These numbers are respectively four and three times higher than the numbers of superfamilies and folds that can currently be assigned to COG proteins.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Clustering and structure prediction for sequence domains. A. Formation of SMOGs. Individual proteins in each COG are split in sequence-based domains using ADDA database. The resulting sequence segments are grouped by sequence similarity within each COG; then these groups from different COGs are further clustered by complete linkage. The produced clusters comprise sequence modules from orthologous groups of proteins (SMOGs), which are used as elementary units for structure assignment and sequence-based clustering (see Methods for details). B. Structure prediction in SMOG sequences. Main steps of the procedure are labeled on the right. First, individual SMOG segments are compared to sequences and profiles for SCOP representatives from ASTRAL. Using alignments between members of the same SMOG, structure assignments at the SCOP superfamily level are propagated to the regions in the SMOG segments that are not directly linked to SCOP domains. These initial assignments are used to split SMOG segments into smaller fragments, generate PSI-BLAST profiles for these fragments, and perform PSI-BLAST searches against the database of SCOP domain sequences. These searches improve the precision of the initial assignments and produce additional assignments. In a given SMOG, regions with the same superfamily assignment are clustered with other regions of this SMOG, based on PSI-BLAST alignments of SMOG sequences to each other. These clusters are referred to as DOGs (see Methods for details). C. Formation of links between SMOGs. SMOGs 1 and 2 are linked based on the fraction W of queries from SMOG 1 that provide detection of sequences from SMOG 2 with E-value cutoff E. In the shown example, W = 3/5 = 0.6. If all individual hits have E-value lower than E, the link will be formed for W cutoffs lower than 0.6 (e.g. W = 0.5), but not for higher cutoffs (e.g. W = 1.0).
Figure 2
Figure 2
Total number of assigned SCOP superfamilies and folds as functions of the number of solved SMOGs. Each point represents a year, from 1995 to the present. See text for details.
Figure 3
Figure 3
Highly connected SMOGs are more likely to be solved. Distributions of node degree for all SMOGs, compared to the population of solved SMOGs and to the SMOGs solved by structure genomic initiative. A. Distributions for cumulative SMOG populations solved by a certain year (shown as absolute SMOG numbers on the log-log scale). SMOGs that are linked to the structures determined by year 1995 (blue), by year 2004 (red), and to the structures determined by structural genomics initiatives (green) are compared to the whole SMOG set (black). The inset shows the fraction of solved SMOGs for different bins of node degree. B. Distributions for SMOGs solved in a certain year and for SMOGs initially solved by SGI (shown as frequencies). SMOGs that were for the first time linked to a structure solved in 1995 (blue), 2003 (red), and to a structure produced by SGI (green) are compared to the whole SMOG set (black). Number of SMOGs in each population is indicated in parentheses.
Figure 4
Figure 4
Distributions of number of superfamilies in a SMOG cluster and of number of clusters with a given superfamily. Distributions of number of superfamilies in a SMOG cluster (fm) shown as log-log plots, for various linkage stringencies (A) and for different sizes of the population of solved SMOGs over the years (B). Distributions of number of clusters with a given superfamily assigned (gn), for various linkage stringencies (C) and for different sizes of the population of solved SMOGs over the years (D). To illustrate the sharpness of the distributions, power-law approximations of the continuous parts are shown as lines, with their exponents (γ) indicated in graph legends. In B and D, the lines for different years are very close, and only a single approximation is shown, for the most recent population of solved SMOGs. See text for details.
Figure 5
Figure 5
Average numbers of superfamilies assigned to a SMOG cluster and of clusters corresponding to a superfamily. Average number of superfamilies assigned to a SMOG cluster (A) and average number of clusters corresponding to a superfamily (B), for various linkage stringencies, plotted as functions of the number of solved SMOGs. Points in the graphs represent consecutive years, starting from 1995. Linkage stringencies are indicated as the cutoffs for E-value and W (fraction of queries in SMOG 1 that provide PSI-BLAST detection of sequences from SMOG 2). Graphs for separate SMOGs are marked as "No clustering".

Similar articles

Cited by

References

    1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. - DOI - PMC - PubMed
    1. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, et al. The Universal Protein Resource (UniProt) Nucleic Acids Res. 2005:D154–159. - PMC - PubMed
    1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank. Nucleic Acids Res. 2005:D34–38. - PMC - PubMed
    1. Burley SK. An overview of structural genomics. Nat Struct Biol. 2000;7:932–934. doi: 10.1038/80697. - DOI - PubMed
    1. Todd AE, Marsden RL, Thornton JM, Orengo CA. Progress of structural genomics initiatives: an analysis of solved target structures. J Mol Biol. 2005;348:1235–1260. doi: 10.1016/j.jmb.2005.03.037. - DOI - PubMed

Publication types