Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Mar 20:8:78.
doi: 10.1186/1471-2164-8-78.

Predicting protein function by machine learning on amino acid sequences--a critical evaluation

Affiliations

Predicting protein function by machine learning on amino acid sequences--a critical evaluation

Ali Al-Shahib et al. BMC Genomics. .

Abstract

Background: Predicting the function of newly discovered proteins by simply inspecting their amino acid sequence is one of the major challenges of post-genomic computational biology, especially when done without recourse to experimentation or homology information. Machine learning classifiers are able to discriminate between proteins belonging to different functional classes. Until now, however, it has been unclear if this ability would be transferable to proteins of unknown function, which may show distinct biases compared to experimentally more tractable proteins.

Results: Here we show that proteins with known and unknown function do indeed differ significantly. We then show that proteins from different bacterial species also differ to an even larger and very surprising extent, but that functional classifiers nonetheless generalize successfully across species boundaries. We also show that in the case of highly specialized proteomes classifiers from a different, but more conventional, species may in fact outperform the endogenous species-specific classifier.

Conclusion: We conclude that there is very good prospect of successfully predicting the function of yet uncharacterized proteins using machine learning classifiers trained on proteins of known function.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Function prediction classifier performance. Performance for seven human pathogens and thirteen functional classes is shown as % AUC, and values larger than 50 indicate a better than random classifier. One can see that in four of the seven species prediction results are significantly better than random across all classes. Only on three small genomes (T. pallidum, U. urealyticum and M. genitalium) performance is much weaker. Specific classes (co-factor metabolism, cellular processes, DNA metabolism) show particularly good performance. The functional class that is most easily distinguished from the others contains 'transport and binding proteins' – this good performance is probably due to the characteristic hydrophobic motifs in the transmembrane and binding regions of these proteins. Colors indicate the AUC values, ranging from 0 (dark blue) to 100% (dark red). The same color scale is used for all figures in this paper.
Figure 2
Figure 2
Discrimination between proteins of known and unknown function. The results of five random splits of test and training set are shown, and for comparison the lower two rows show the median performance of the function prediction classifier and the 'transport and binding' classifier for each species. For each species, except M. genitalium the average performance on the known-vs.-unknown task is better than on the function prediction task. In the case of T. pallidum, known and unknown proteins can be distinguished with almost perfect performance.
Figure 3
Figure 3
Species-species discrimination. The AUC for classifiers trained to distinguish between proteins from each species pair is shown (median of five replicates). With the exception of H. ducreyi vs. S. agalactiae and H. ducreyi vs. M. genitalium, all comparisons yield excellent classification performance. This means that proteins from different source organisms can be distinguished with surprising accuracy based solely on amino acid sequence features. The unrooted tree to the left shows the phylogenetic relationships of the seven bacterial species, based on 16S rRNA analysis.
Figure 4
Figure 4
Classifier transfer across species boundaries. The median AUC for 13 functional classes is shown. The 'training species' is shown in the rows, the 'test' species in the columns. It can be seen that classifiers perform almost as well on a 'foreign' species as they do on the species they were originally trained on (diagonal). Performance is worst for the classifiers from T. pallidum, U. urealyticum and M. genitalium, and in these three cases the classifiers from the other four species give significantly better performance than those from the original species (sign test, p < 0.001).
Figure 5
Figure 5
Feature concordance between species. The feature lists selected for function prediction in each species using the Wilcoxon filter as described were analyzed for concordance. The feature selection procedure generates sorted lists of features. The agreement between these lists can be calculated using a rank correlation method, for example Kendall's Coefficient of Concordance. A good correlation (reflected in a small p-value) indicates that the same features are high in the list of selected features. The p-values of Kendall's Coefficient of Concordance for each pairwise comparison are shown. The feature lists for the first five species show high correlation, while those of the two mycoplasmal species differ significantly. This may explain the difference in performance on these two species. Note that the matrix is not symmetrical, because different features will be removed by the redundancy filtering step depending on which species is used as a reference
Figure 6
Figure 6
Summary of predictive performance and expected performance on proteins of unknown function. The first 13 columns show the AUCs for each functional class in each of the 7 × 7 species-species transfers. The order of functional classes and species is the same as in figure 1. The 14th column shows the corresponding species-species discrimination AUCs (from Figure 3) and the 15th column the distinction between known and unknown proteins for the species from which the classifier is derived (from Figure 2). To predict the expected performance on proteins of unknown function, find the species-species contrast that corresponds most closely to the known-unknown contrast of interest. The corresponding function prediction AUCs should give a reasonable estimate for the expected performance. It can be seen here that functional classes that are easily distinguished within a species will also successfully transfer between species, and such predictors (e.g. 'transport and binding', column 13) will also yield reliable results on the proteins of unknown function.

References

    1. Delneri D, Brancia FL, Oliver SG. Towards a truly integrative biology through the functional genomics of yeast. Curr Opin Biotechnol. 2001;12:87–91. doi: 10.1016/S0958-1669(00)00179-8. - DOI - PubMed
    1. Norin M, Sundstrom M. Structural proteomics: developments in structure-to-function predictions. Trends Biotechnol. 2002;20:79–84. doi: 10.1016/S0167-7799(01)01884-4. - DOI - PubMed
    1. Baker D, Sali A. Protein structure prediction and structural genomics. Science. 2001;294:93–96. doi: 10.1126/science.1065659. - DOI - PubMed
    1. Que QQ, Winzeler EA. Large-scale mutagenesis and functional genomics in yeast. Funct Integr Genomics. 2002;2:193–198. doi: 10.1007/s10142-002-0057-3. - DOI - PubMed
    1. Zhang C, Kim SH. Overview of structural genomics: from structure to function. Curr Opin Chem Biol. 2003;7:28–32. doi: 10.1016/S1367-5931(02)00015-7. - DOI - PubMed

Publication types

LinkOut - more resources