Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2010 Jan;38(3):720-37.
doi: 10.1093/nar/gkp1049. Epub 2009 Nov 18.

GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains

Affiliations
Comparative Study

GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains

David A Lee et al. Nucleic Acids Res. 2010 Jan.

Abstract

GeMMA (Genome Modelling and Model Annotation) is a new approach to automatic functional subfamily classification within families and superfamilies of protein sequences. A major advantage of GeMMA is its ability to subclassify very large and diverse superfamilies with tens of thousands of members, without the need for an initial multiple sequence alignment. Its performance is shown to be comparable to the established high-performance method SCI-PHY. GeMMA follows an agglomerative clustering protocol that uses existing software for sensitive and accurate multiple sequence alignment and profile-profile comparison. The produced subfamilies are shown to be equivalent in quality whether whole protein sequences are used or just the sequences of component predicted structural domains. A faster, heuristic version of GeMMA that also uses distributed computing is shown to maintain the performance levels of the original implementation. The use of GeMMA to increase the functional annotation coverage of functionally diverse Pfam families is demonstrated. It is further shown how GeMMA clusters can help to predict the impact of experimentally determining a protein domain structure on comparative protein modelling coverage, in the context of structural genomics.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
A flow chart outlining the basic GeMMA method. This low-throughput approach is referred to as ‘Full Scale’ or ‘FS-GeMMA’.
Figure 2.
Figure 2.
A flow chart outlining the high-throughput or HT-GeMMA method. Steps within the grey box are executed on the nodes of a compute cluster. Pre-clustering is used to reduce the number of clusters in the initial HT-GeMMA iteration.
Figure 3.
Figure 3.
GeMMA purity, edit distance, VI distance and performance scores at a range of E-value cut-offs for (a) whole protein sequences in the SFLD benchmark, and (b) predicted conserved CATH domain sequences in the SFLD-Gene3D benchmark.
Figure 4.
Figure 4.
Purity, edit distance and VI distance for GeMMA with generalized (leave-one-out approach) and superfamily-specific E-value cut-offs and for SCI-PHY in the SFLD benchmark. Values for edit distance and VI distance for unclustered sequences are the initial values that are used in the calculation of the performance score. For unclustered sequences purity always has a value of zero.
Figure 5.
Figure 5.
Purity, edit distance and VI distance for FS-GeMMA and HT-GeMMA as the SFLD benchmark is progressively extended from SFLD whole protein to SFLD-Gene3D domain to Gene3D domain sequences. SFLD functional annotations are used throughout with no extra annotations being used in the Gene3D benchmark. Note that the high-throughput method HT-GeMMA is necessary to analyse the (large) Gene3D benchmark sets.
Figure 6.
Figure 6.
Distribution of performance scores for GeMMA and SCI-PHY in the Pfam benchmark.
Figure 7.
Figure 7.
Average difference in performance scores between GeMMA and SCI-PHY in the Pfam benchmark (GeMMA score minus SCI-PHY score) versus (a) family size and (b) family diversity. Family diversity is calculated as the number of 30% sequence identity multi-linkage clusters in the family.
Figure 8.
Figure 8.
The number of annotation types found in Pfam families and the resultant SCI-PHY and GeMMA subfamilies in the Pfam benchmark. Annotation types were counted as the number of four-level EC numbers. Only families and subfamilies containing up to eight different types of annotation are shown.
Figure 9.
Figure 9.
Inheritance of functional annotations within the Pfam benchmark families. This shows the post-transfer annotation coverage achievable using Gene3D S60 clusters (multi-linkage sequence clusters at 60% sequence identity) and SCI-PHY and GeMMA subfamilies, respectively.
Figure 10.
Figure 10.
Comparative modelling coverage of 11 superfamilies of predicted CATH domains chosen for structural genomics target selection by the Midwest Center for Structural Genomics. Coverage achieved within GeMMA subfamilies is compared to that within Gene3D S30 clusters (multi-linkage clusters at 30% sequence identity). The numbers above the columns are the percentage of good models as determined using the GA341 score incorporated in Modeller.
Figure 11.
Figure 11.
Illustration of the strategy employed to speed up HT-GeMMA. This uses a worked example described in the ‘Appendix’ section. Steps in the HT-GeMMA method are listed on the left and may be identified in the flow chart giving an overview of the method in Figure 2.

References

    1. Lee D, Redfern O, Orengo C. Predicting protein function from sequence and structure. Nat. Rev. Mol. Cell Biol. 2007;8:995–1005. - PubMed
    1. Brenner SE. Errors in genome annotation. Trends Genet. 1999;15:132–133. - PubMed
    1. Devos D, Valencia A. Intrinsic errors in genome annotation. Trends Genet. 2001;17:429–431. - PubMed
    1. Yeats C, Lees J, Reid A, Kellam P, Martin N, Liu X, Orengo C. Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res. 2008;36:D414–D418. - PMC - PubMed
    1. Cuff AL, Sillitoe I, Lewis T, Redfern OC, Garratt R, Thornton J, Orengo CA. The CATH classification revisited – architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res. 2009;37:D310–D314. - PMC - PubMed

Publication types