On the quality of tree-based protein classification

doi:10.1093/bioinformatics/bti244

Comparative Study

. 2005 May 1;21(9):1876-90.

doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

On the quality of tree-based protein classification

Betty Lazareva-Ulitsky¹, Karen Diemer, Paul D Thomas

Affiliations

PMID: 15647305
DOI: 10.1093/bioinformatics/bti244

Comparative Study

On the quality of tree-based protein classification

Betty Lazareva-Ulitsky et al. Bioinformatics. 2005.

. 2005 May 1;21(9):1876-90.

doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

Authors

Betty Lazareva-Ulitsky¹, Karen Diemer, Paul D Thomas

Affiliation

¹ Computational Biology Department, Applied Biosystems, Foster City, CA 94404, USA. betty.lazareva@fc.celera.com

PMID: 15647305
DOI: 10.1093/bioinformatics/bti244

Abstract

Motivation: Phylogenetic analysis of protein sequences is widely used in protein function classification and delineation of subfamilies within larger families. In addition, the recent increase in the number of protein sequence entries with controlled vocabulary terms describing function (e.g. the Gene Ontology) suggests that it may be possible to overlay these terms onto phylogenetic trees to automatically locate functional divergence events in protein family evolution. Phylogenetic analysis of large datasets requires fast algorithms; and even 'fast', approximate distance matrix-based phylogenetic algorithms are slow on large datasets since they involve calculating maximum likelihood estimates of pairwise evolutionary distances. There have been many attempts to classify protein sequences on the family and subfamily level without reconstructing phylogenetic trees, but using hierarchical clustering with simpler distance measures, which also produce trees or dendrograms. How can these trees be compared in their ability to accurately classify protein sequences?

Results: Given a 'reference classification' or 'group membership labels' for a set of related protein sequences as well as a tree describing their relationships (e.g. a phylogenetic tree), we propose a method for dividing the tree into monophyletic or paraphyletic groups so as to optimize the correspondence between the reference groups and the tree-derived groups. We call the achieved optimal correspondence the 'accuracy of a tree-based classification (TBC)', which measures the ability of a tree to separate proteins of similar function into monophyletic or paraphyletic groups. We apply this measure to compare classical NJ and UPGMA phylogenetic trees with the trees obtained from hierarchical clustering using different protein similarity measures. Our preliminary analysis on a set of expert-curated protein families and alignments suggests that there is no uniformly superior algorithm, and that simple protein similarity measures combined with hierarchical clustering produce trees with reasonable and often the most accurate TBC. We used our measure to help us to design TIPS, a tree-building algorithm, based on agglomerative clustering with a similarity measure derived from profile scoring. TIPS is comparable with phylogenetic algorithms in terms of classification accuracy and is much faster on large protein families. Due to its time scalability and acceptable accuracy, TIPS is being used in the large-scale PANTHER protein classification project. The trees produced by different algorithms for different protein families can be viewed at http://panther.appliedbiosystems.com/pub/tree_quality/trees.jsp. For every tree and every level of classification granularity we provide the optimal TBC along with the reference classification.

Availability: The script that evaluates the accuracy of TBC is available at http://panther.appliedbiosystems.com/pub/tree_quality/index.jsp

PubMed Disclaimer

Cited by

A systematic pipeline for classifying bacterial operons reveals the evolutionary landscape of biofilm machineries.
Bundalovic-Torma C, Whitfield GB, Marmont LS, Howell PL, Parkinson J. Bundalovic-Torma C, et al. PLoS Comput Biol. 2020 Apr 1;16(4):e1007721. doi: 10.1371/journal.pcbi.1007721. eCollection 2020 Apr. PLoS Comput Biol. 2020. PMID: 32236097 Free PMC article.
Partially-supervised protein subclass discovery with simultaneous annotation of functional residues.
Georgi B, Schultz J, Schliep A. Georgi B, et al. BMC Struct Biol. 2009 Oct 26;9:68. doi: 10.1186/1472-6807-9-68. BMC Struct Biol. 2009. PMID: 19857261 Free PMC article.
Proteomic analysis of up-regulated proteins in human promonocyte cells expressing severe acute respiratory syndrome coronavirus 3C-like protease.
Lai CC, Jou MJ, Huang SY, Li SW, Wan L, Tsai FJ, Lin CW. Lai CC, et al. Proteomics. 2007 May;7(9):1446-60. doi: 10.1002/pmic.200600459. Proteomics. 2007. PMID: 17407183 Free PMC article.
Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets.
Albayrak A, Otu HH, Sezerman UO. Albayrak A, et al. BMC Bioinformatics. 2010 Aug 18;11:428. doi: 10.1186/1471-2105-11-428. BMC Bioinformatics. 2010. PMID: 20718947 Free PMC article.
Characterization and in vivo functional analysis of the Schizosaccharomyces pombe ICLN gene.
Barbarossa A, Antoine E, Neel H, Gostan T, Soret J, Bordonné R. Barbarossa A, et al. Mol Cell Biol. 2014 Feb;34(4):595-605. doi: 10.1128/MCB.01407-13. Epub 2013 Dec 2. Mol Cell Biol. 2014. PMID: 24298023 Free PMC article.

See all "Cited by" articles

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
- Ovid Technologies, Inc.
- Silverchair Information Systems
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

On the quality of tree-based protein classification

Affiliation

On the quality of tree-based protein classification

Authors

Affiliation

Abstract

Similar articles

Cited by

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Similar articles

Cited by

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous