Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Nov;3(11):e232.
doi: 10.1371/journal.pcbi.0030232.

CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures

Affiliations

CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures

Oliver C Redfern et al. PLoS Comput Biol. 2007 Nov.

Abstract

We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure-based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Percentage of Multidomain Chains with a Given Number of Component Domains
Figure 2
Figure 2. Example of a Multidomain Protein (PDB: 1cg2) Chain Containing a Discontiguous Domain
Domain two (blue) is inserted between two segments of domain one (red).
Figure 3
Figure 3. ROC (True Positive Rate Versus False Positive Rate) Curve Plotted for Different Structural Comparison Methods Based on the SAS, Where a Positive Match Represents a True CATH–SCOP Fold Match
TPR, true positive rate; FPR, false positive rate.
Figure 4
Figure 4. Graph of the Percentage of Correct Folds Matched Against the Ranked Native Score for the CATH–SCOP Dataset
Figure 5
Figure 5. Comparison of Alignment Quality of Domains Adopting the Same CATH Fold Using Two Geometric Scoring Schemes
(A) Percentage of correct fold pairs for a given SAS threshold. (B) Percentage of correct fold pairs for a given SiMax threshold.
Figure 6
Figure 6. Average Number of Aligned Residues per SAS
Figure 7
Figure 7. Graph Showing How the Alignments of Each Method Compared with Manually Validated BAliBASE Alignments
The higher the curve (or the curve with the greatest area underneath) represents the method that most agrees with the manually curated BAliBASE alignments.
Figure 8
Figure 8. Comparison of GT and DDP Scores with SVM Score for Assigning Domains to Multidomain Chains
Figure 9
Figure 9. Percentage of Domain Assigned (Blue) and Percentage of Domain Boundaries within Ten Residues of Verified Boundaries (Pink) at a Range of SVM Score Cutoffs
Figure 10
Figure 10. Domain Coverage Versus Quality of Domain Boundaries
Figure 11
Figure 11. Percentage of Domains with Correct Domain Boundaries (within Ten Residues) When Varying the Number of Representatives Taken from Each Superfamily in the Targeted Fold Groups
Figure 12
Figure 12. Graph of the Percentage of Correct (within Ten Residues) Domain Boundaries against the Sequence Identity between the Assigned Region and the Matched Domain
Figure 13
Figure 13. Superposition of the Catalase HPII (PDB 1iph; First Domain of Chain A) as It Is Classified in the CATH Database and Its Match to Bovine Beta-Lactoglobulin, Coloured Red, (PDB 1beb; Chain A), the Closest Relative Identified by CATHEDRAL
Figure 14
Figure 14. Flowchart of CATHEDRAL Algorithm for Assigning Folds and Domain Boundaries to Protein Chains

References

    1. Apic G, Gough J, Teichmann SA. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J Mol Biol. 2001;310:311–325. - PubMed
    1. Orengo CA, Jones DT, Thornton JM. Protein superfamilies and domain superfolds. Nature. 1994;372:631–634. - PubMed
    1. Coulson AF, Moult J. A unifold, mesofold, and superfold model of protein fold use. Proteins. 2002;46:61–71. - PubMed
    1. Grant A, Lee D, Orengo C. Progress towards mapping the universe of protein folds. Genome Biol. 2004;5:107. - PMC - PubMed
    1. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247:536–540. - PubMed

Publication types