Assessing strategies for improved superfamily recognition

Ian Sillitoe¹, Mark Dibley, James Bray, Sarah Addou, Christine Orengo

Affiliations

PMID: 15937274
PMCID: PMC2253352
DOI: 10.1110/ps.041056105

Comparative Study

Assessing strategies for improved superfamily recognition

Ian Sillitoe et al. Protein Sci. 2005 Jul.

. 2005 Jul;14(7):1800-10.

doi: 10.1110/ps.041056105. Epub 2005 Jun 3.

Authors

Ian Sillitoe¹, Mark Dibley, James Bray, Sarah Addou, Christine Orengo

Affiliation

¹ Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, UK.

PMID: 15937274
PMCID: PMC2253352
DOI: 10.1110/ps.041056105

Abstract

There are more than 200 completed genomes and over 1 million nonredundant sequences in public repositories. Although the structural data are more sparse (approximately 13,000 nonredundant structures solved to date), several powerful sequence-based methodologies now allow these structures to be mapped onto related regions in a significant proportion of genome sequences. We review a number of publicly available strategies for providing structural annotations for genome sequences, and we describe the protocol adopted to provide CATH structural annotations for completed genomes. In particular, we assess the performance of several sequence-based protocols employing Hidden Markov model (HMM) technologies for superfamily recognition, including a new approach (SAMOSA [sequence augmented models of structure alignments]) that exploits multiple structural alignments from the CATH domain structure database when building the models. Using a data set of remote homologs detected by structure comparison and manually validated in CATH, a single-seed HMM library was able to recognize 76% of the data set. Including the SAMOSA models in the HMM library showed little gain in homolog recognition, although a slight improvement in alignment quality was observed for very remote homologs. However, using an expanded 1D-HMM library, CATH-ISL increased the coverage to 86%. The single-seed HMM library has been used to annotate the protein sequences of 120 genomes from all three major kingdoms, allowing up to 70% of the genes or partial genes to be assigned to CATH superfamilies. It has also been used to recruit sequences from Swiss-Prot and TrEMBL into CATH domain superfamilies, expanding the CATH database eightfold.

PubMed Disclaimer

Figures

**Figure 1.**
Coverage-vs.-error plot for assessing the performance of the 1D-HMM model library compared to pairwise intermediate sequence search. The CATHfull data set was used for benchmarking. Scanning query sequences against the CATH-ISL using BLAST, solid line; 1DHMM-S35 library built from CATH v2.4 S35 reps, dashed line; 1DHMM-S95 model library built from CATH v2.4 CATH S95 reps, dotted line.

**Figure 2.**
Coverage-vs.-error plots contrasting the performance measured for the 1D-HMM model library built from CATH v2.4 using two different benchmark data sets. Scanning query sequences against the CATH-ISL by BLAST (CATHfull benchmark data set), solid line; performance of the 1D-HMM-S35 models (CATHsingle benchmark data set), dashed line; performance of the 1D-HMM-S35 models (CATHfull benchmark data set), dotted line.

**Figure 3.**
Coverage-vs.-error plot comparing the performance of the combined 1D-HMM-S35 and 3D-HMM library with the 1D-HMMS35 library. Performance of the 1D-HMM-S35 model library, solid line; performance of the 1D-HMM-95 model library, dashed line; performance of the combined 1D-HMM-S35 and 3D-HMMM model libraries, dotted line. All HMM model libraries were built using CATH v1.7. Performance was assessed using the CATHremote data set.

**Figure 4.**
Exploring the cumulative effectiveness of scanning against a combined library of 1D-HMM-S35 and 3D-HMMs. All HMMs were built using CATH v2.4. Performance of all model libraries was assessed using the CATHremote data set. Performance of the 3D-HMM library, dashed line; that of the 1D-HMM-S35 library, solid line; that of the combined 1D-HMM-S35 and 3D-HMM libraries, dotted line.

**Figure 5.**
Contrasting the accuracy of sequence alignments generated by aligning query sequences against 1D-HMM models or 3D-HMMs. Plot shows the average percentage alignment quality for 1D- and 3D-HMMs.

**Figure 6.**
Increase in performance obtained by scanning an updated CATH 1D-HMM-S35 model library and an expanded CATH 1D-HMM-ISL model library. Performance of CATH 1D-HMM-S35 model library built from CATH v2.4, solid line; that of CATH 1D-HMM-S35 model library built from CATH v2.5.1, dashed line; that of expanded 1D-HMM-ISL model library built from CATH v2.4, dotted line.

**Figure 7.**
The proportion of sequences from nine selected complete genomes (three from each kingdom) that can be assigned to CATH domain families by scanning the sequences against the 1D-HMM-S35 library are shown in gray. Additional matches recognized by scanning against the 3D-HMM library are shown in black; additional matches recognized by scanning against the 1D-HMM-ISL library, white. All HMM libraries were built using CATH v2.4.

**Figure 8.**
Comparing the coverage-vs.-contact score plots for homologs (dotted lines), fold-relatives (dashed lines) and nonrelatives (solid lines) using multiple SSG templates from the cytokine superfamily (CATH 1.20.160.30), generated from four SSAP cluster cutoffs (70, 75, 80, 85).

See this image and copyright information in PMC

References

1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215 403–410. - PubMed
1. Attwood, T.,K., Avison, H., Beck, M.E., Bewley, M., Bleasby, A.J., Brewster, F., Cooper, P., Degtyarenko, K., Geddes, A.J., Flower, D.R., et al. 1997. The PRINTS database of protein fingerprints: A novel information resource for computational molecular biology. J. Chem. Inf. Comput. Sci. 37 417–424. - PubMed
1. Barton, G. and Sternberg, M. 1987. A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. J. Mol. Biol. 198 327–337. - PubMed
1. Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. 2004. The Pfam protein families database. Nucleic Acids Res. 32 D138–D141. - PMC - PubMed
1. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L. 2004. GenBank: Update. Nucleic Acids Res. 32 23–26. - PMC - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Assessing strategies for improved superfamily recognition

Affiliation

Assessing strategies for improved superfamily recognition

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources