. 2007 Aug;3(8):e160.

doi: 10.1371/journal.pcbi.0030160.

Automated protein subfamily identification and classification

Duncan P Brown¹, Nandini Krishnamurthy, Kimmen Sjölander

Affiliations

PMID: 17708678
PMCID: PMC1950344
DOI: 10.1371/journal.pcbi.0030160

Automated protein subfamily identification and classification

Duncan P Brown et al. PLoS Comput Biol. 2007 Aug.

. 2007 Aug;3(8):e160.

doi: 10.1371/journal.pcbi.0030160.

Authors

Duncan P Brown¹, Nandini Krishnamurthy, Kimmen Sjölander

Affiliation

¹ Department of Bioengineering, University of California, Berkeley, California, United States of America.

PMID: 17708678
PMCID: PMC1950344
DOI: 10.1371/journal.pcbi.0030160

Abstract

Function prediction by homology is widely used to provide preliminary functional annotations for genes for which experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these errors in function prediction but has been difficult to automate for high-throughput application. To address this limitation, we present a computationally efficient pipeline for phylogenomic classification of proteins. This pipeline uses the SCI-PHY (Subfamily Classification in Phylogenomics) algorithm for automatic subfamily identification, followed by subfamily hidden Markov model (HMM) construction. A simple and computationally efficient scoring scheme using family and subfamily HMMs enables classification of novel sequences to protein families and subfamilies. Sequences representing entirely novel subfamilies are differentiated from those that can be classified to subfamilies in the input training set using logistic regression. Subfamily HMM parameters are estimated using an information-sharing protocol, enabling subfamilies containing even a single sequence to benefit from conservation patterns defining the family as a whole or in related subfamilies. SCI-PHY subfamilies correspond closely to functional subtypes defined by experts and to conserved clades found by phylogenetic analysis. Extensive comparisons of subfamily and family HMM performances show that subfamily HMMs dramatically improve the separation between homologous and non-homologous proteins in sequence database searches. Subfamily HMMs also provide extremely high specificity of classification and can be used to predict entirely novel subtypes. The SCI-PHY Web server at http://phylogenomics.berkeley.edu/SCI-PHY/ allows users to upload a multiple sequence alignment for subfamily identification and subfamily HMM construction. Biologists wishing to provide their own subfamily definitions can do so. Source code is available on the Web page. The Berkeley Phylogenomics Group PhyloFacts resource contains pre-calculated subfamily predictions and subfamily HMMs for more than 40,000 protein families and domains at http://phylogenomics.berkeley.edu/phylofacts/.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

**Figure 1. Comparison of Family and Subfamily HMM Performance on Remote Homolog Detection**
Blue: family HMM results. Red: subfamily HMM results. (A) Coverage (x-axis) is plotted against e-value (y-axis). Coverage (or recall) is the fraction of homologous pairs (i.e., from the same SCOP superfamily) that receive a score of equal or greater significance. The e-value curves converge at a coverage of 0.79, the same coverage at which false positives first arise. This corresponds to an e-value of approximately 0.01. (B) ROC curve for family and subfamily HMMs, weighted by superfamily size. Subfamily HMMs receive an AUC of 0.947; family HMMs receive 0.943. (C) ROC curve for unweighted data. Subfamily HMMs and family HMMs have AUCs of 0.758 and 0.740, respectively. Together, these data show that while subfamily HMMs do not detect more homologs at a given false positive rate, they do find many more homologs at a given significance cutoff.

**Figure 2. Logistic Regression for Novel Subtype Identification**
The logistic regression fit for an example subfamily is shown. True subfamily members (X) and other family members (+) are shown, together with the fitted curve. When the two classes cannot be completely separated, as in this case, we see a smooth transition in the probability of subfamily membership.

**Figure 3. Novel Subtype Identification and Classification Accuracy as a Function of the Threshold on Subfamily Membership Probability**
(A) The red line shows the fraction of novel subfamilies correctly detected; the blue line shows the fraction of subfamily members correctly classified in leave-one-out experiments. Novelty detection is quite robust to the threshold setting, obtaining 80% success rate even at the lowest threshold (0.01). (B) The fraction of sequences classified to an incorrect subfamily during leave-one-out experiments. While low to begin with, the false positive error drops dramatically with the imposition of even a small threshold. A threshold of 0.10 probability of subfamily membership seems to be optimal; the false-positive classification rate is just 0.3%, while overall subfamily classification and novel subtype detection accuracy are both 88%. The x-axis shows the logistic regression probability threshold for subfamily membership assignment.

**Figure 4. The Encoding Cost as a Function of the SCI-PHY Iteration for the Secretin Family**
We subtract the encoding cost of the null hypothesis (that all sequences belong in a single subfamily) from the cost of encoding the subclass alignments at each iteration of the algorithm (y-axis: Cost_iteration − Cost_null). At program commencement, the number of subclasses equals the number of sequences and the encoding cost is high. The encoding cost curve decreases steadily to a minimum when similar sequences are joined and then increases as subtrees with different amino acid preferences are joined. The point in the agglomeration for which the encoding cost is minimal is used to determine a cut of the tree into subtrees, defining the SCI-PHY subfamily decomposition. If the minimum occurs when the encoding cost is zero, then all sequences are placed in a single class (i.e., no subfamilies are predicted). Negative “Encoding Cost” values indicate savings relative to the null hypothesis, and provide support for a division of the sequences into two or more subfamilies.

**Figure 5. Discordance between Subfamily Membership and Top-Scoring SHMM Can Be Indicative of Misalignments**
Sequence Q8S220, a singleton subfamily, was classified to its sibling subfamily, N2581. We show a comparison of the sequence as aligned in the original MSA (Q8S220-orig) and after alignment to SHMM N2581 (Q8S220-N2581). The consensus sequence for SHMM N2581 is also shown (N2581-consensus). After realignment, much of the sequence has been shifted, and several motifs now clearly match the N2581 consensus sequence (red boxes).

See this image and copyright information in PMC

References

1. Friedberg I. Automated protein function prediction—The genomic challenge. Brief Bioinform. 2006;7:225–242. - PubMed
1. Soro S, Tramontano A. The prediction of protein function at CASP6. Proteins. 2005;61(Supplement 7):201–213. - PubMed
1. Eisen JA. A phylogenomic study of the MutS family of proteins. Nucleic Acids Res. 1998;26:4291–4300. - PMC - PubMed
1. Andrade MA, Brown NP, Leroy C, et al. Automated genome sequence analysis and annotation. Bioinformatics. 1999;15:391–412. - PubMed
1. Groth D, Lehrach H, Hennig S. GOblet: A platform for Gene Ontology annotation of anonymous sequence data. Nucleic Acids Res. 2004;32:W313–W317. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

R01 HG002769/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated protein subfamily identification and classification

Affiliation

Automated protein subfamily identification and classification

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources