A fast and automated solution for accurately resolving protein domain architectures
- PMID: 20118117
- DOI: 10.1093/bioinformatics/btq034
A fast and automated solution for accurately resolving protein domain architectures
Abstract
Motivation: Accurate prediction of the domain content and arrangement in multi-domain proteins (which make up >65% of the large-scale protein databases) provides a valuable tool for function prediction, comparative genomics and studies of molecular evolution. However, scanning a multi-domain protein against a database of domain sequence profiles can often produce conflicting and overlapping matches. We have developed a novel method that employs heaviest weighted clique-finding (HCF), which we show significantly outperforms standard published approaches based on successively assigning the best non-overlapping match (Best Match Cascade, BMC).
Results: We created benchmark data set of structural domain assignments in the CATH database and a corresponding set of Hidden Markov Model-based domain predictions. Using these, we demonstrate that by considering all possible combinations of matches using the HCF approach, we achieve much higher prediction accuracy than the standard BMC method. We also show that it is essential to allow overlapping domain matches to a query in order to identify correct domain assignments. Furthermore, we introduce a straightforward and effective protocol for resolving any overlapping assignments, and producing a single set of non-overlapping predicted domains.
Availability and implementation: The new approach will be used to determine MDAs for UniProt and Ensembl, and made available via the Gene3D website: http://gene3d.biochem.ucl.ac.uk/Gene3D/. The software has been implemented in C++ and compiled for Linux: source code and binaries can be found at: ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/DomainFinder3/
Contact: yeats@biochem.ucl.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
Similar articles
-
The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis.Nucleic Acids Res. 2005 Jan 1;33(Database issue):D247-51. doi: 10.1093/nar/gki024. Nucleic Acids Res. 2005. PMID: 15608188 Free PMC article.
-
Gene3D: merging structure and function for a Thousand genomes.Nucleic Acids Res. 2010 Jan;38(Database issue):D296-300. doi: 10.1093/nar/gkp987. Epub 2009 Nov 11. Nucleic Acids Res. 2010. PMID: 19906693 Free PMC article.
-
Identification and distribution of protein families in 120 completed genomes using Gene3D.Proteins. 2005 May 15;59(3):603-15. doi: 10.1002/prot.20409. Proteins. 2005. PMID: 15768405
-
A multi-objective optimization approach accurately resolves protein domain architectures.Bioinformatics. 2016 Feb 1;32(3):345-53. doi: 10.1093/bioinformatics/btv582. Epub 2015 Oct 12. Bioinformatics. 2016. PMID: 26458889 Free PMC article.
-
Computer-assisted protein domain boundary prediction using the DomPred server.Curr Protein Pept Sci. 2007 Apr;8(2):181-8. doi: 10.2174/138920307780363415. Curr Protein Pept Sci. 2007. PMID: 17430199 Review.
Cited by
-
Functional classification of CATH superfamilies: a domain-based approach for protein function annotation.Bioinformatics. 2015 Nov 1;31(21):3460-7. doi: 10.1093/bioinformatics/btv398. Epub 2015 Jul 2. Bioinformatics. 2015. PMID: 26139634 Free PMC article.
-
Biological impact of mutually exclusive exon switching.PLoS Comput Biol. 2021 Mar 2;17(3):e1008708. doi: 10.1371/journal.pcbi.1008708. eCollection 2021 Mar. PLoS Comput Biol. 2021. PMID: 33651795 Free PMC article.
-
Genome3D: a UK collaborative project to annotate genomic sequences with predicted 3D structures based on SCOP and CATH domains.Nucleic Acids Res. 2013 Jan;41(Database issue):D499-507. doi: 10.1093/nar/gks1266. Epub 2012 Nov 30. Nucleic Acids Res. 2013. PMID: 23203986 Free PMC article.
-
cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly.Bioinformatics. 2019 May 15;35(10):1766-1767. doi: 10.1093/bioinformatics/bty863. Bioinformatics. 2019. PMID: 30295745 Free PMC article.
-
Mantis: flexible and consensus-driven genome annotation.Gigascience. 2021 Jun 2;10(6):giab042. doi: 10.1093/gigascience/giab042. Gigascience. 2021. PMID: 34076241 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources