Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Oct 29:10:361.
doi: 10.1186/1471-2105-10-361.

Predicting sulfotyrosine sites using the random forest algorithm with significantly improved prediction accuracy

Affiliations

Predicting sulfotyrosine sites using the random forest algorithm with significantly improved prediction accuracy

Zheng Rong Yang. BMC Bioinformatics. .

Abstract

Background: Tyrosine sulfation is one of the most important posttranslational modifications. Due to its relevance to various disease developments, tyrosine sulfation has become the target for drug design. In order to facilitate efficient drug design, accurate prediction of sulfotyrosine sites is desirable. A predictor published seven years ago has been very successful with claimed prediction accuracy of 98%. However, it has a particularly low sensitivity when predicting sulfotyrosine sites in some newly sequenced proteins.

Results: A new approach has been developed for predicting sulfotyrosine sites using the random forest algorithm after a careful evaluation of seven machine learning algorithms. Peptides are formed by consecutive residues symmetrically flanking tyrosine sites. They are then encoded using an amino acid hydrophobicity scale. This new approach has increased the sensitivity by 22%, the specificity by 3%, and the total prediction accuracy by 10% compared with the previous predictor using the same blind data. Meanwhile, both negative and positive predictive powers have been increased by 9%. In addition, the random forest model has an excellent feature for ranking the residues flanking tyrosine sites, hence providing more information for further investigating the tyrosine sulfation mechanism. A web tool has been implemented at http://ecsb.ex.ac.uk/sulfotyrosine for public use.

Conclusion: The random forest algorithm is able to deliver a better model compared with the Hidden Markov Model, the support vector machine, artificial neural networks, and others for predicting sulfotyrosine sites. The success shows that the random forest algorithm together with an amino acid hydrophobicity scale encoding can be a good candidate for peptide classification.

PubMed Disclaimer

Figures

Figure 1
Figure 1
RF ROC curves for the 10-mer, 20-mer and 30-mer data sets. The horizontal axes are the false alarm rates (1 - specificity) and vertical axes are the sensitivity. For specific threshold for discriminating between positive (true sulfotyrosine sites) and negative (unconfirmed sulfotyrosine sites) data points, there will be a pair of these two values, i.e., 1 - specificity and sensitivity. A pair of values is then represented by a point in this two-dimensional space. Each curve is made by connecting all these points. A model is said to be robust whether its ROC curve is close to the top left corner. The area under a ROC curve is a quantitative indicator of this robustness.
Figure 2
Figure 2
SVM ROC curves for the 10-mer, 20-mer and 30-mer data sets. The horizontal axes are the false alarm rates (1 - specificity) and vertical axes are the sensitivity. For specific threshold for discriminating between positive (true sulfotyrosine sites) and negative (unconfirmed sulfotyrosine sites) data points, there will be a pair of these two values, i.e., 1 - specificity and sensitivity. A pair of values is then represented by a point in this two-dimensional space. Each curve is made by connecting all these points. A model is said to be robust whether its ROC curve is close to the top left corner. The area under a ROC curve is a quantitative indicator of this robustness.
Figure 3
Figure 3
The correlation of the predictions between 10-mer model predictions (horizontal axis) and the 20-mer model predictions (vertical axis) for the blind data set.
Figure 4
Figure 4
The correlation of the predictions between 10-mer model predictions (horizontal axis) and the 30-mer model predictions (vertical axis) for the blind data set.
Figure 5
Figure 5
The correlation of the predictions between 20-mer model predictions (horizontal axis) and the 30-mer model predictions (vertical axis) for the blind data set.
Figure 6
Figure 6
The ranking results of residues in three RF models. The horizontal axis represents residue positions in peptides. The upper panel is for the 10-mer data, hence having residue positions ranging from N5 to C5. The middle panel is for the 20-mer data, hence 20 bars. The lower panel is for the 30-mer data, hence 30 bars. The vertical axis indicates the mean decrease Gini measures.
Figure 7
Figure 7
The conditional density functions drawn at N1 and C1 residues, respectively. The horizontal axes represent the Cornette scale values while the vertical axes represent the density values. The density functions are estimated using the kernel approach using the R stats package with default parameter setting. The graph shows that the density functions drawn at N1 demonstrate a larger separation between two classes while this difference is getting smaller for the residue C1, which does not have a high rank value from RF models. Note that negative means unconfirmed sulfotyrosine whilst positive means experimentally verified sulfotyrosine.

Similar articles

Cited by

References

    1. Hille A, Rosa P, Huttner WB. Tyrosine sulfation: a post-translational modification of proteins destined for secretion? FEBS Lett. 1984;177:129–134. - PubMed
    1. Andersen BN. Species variation in the tyrosine sulfation of mammalian gastrins. Gen Comp Endocrinol. 1985;58:44–50. - PubMed
    1. Danielsen EM. Tyrosine sulfation, a post-translational modification of microvillar enzymes in the small intestinal enterocyte. EMBO J. 1987;6:2891–2896. - PMC - PubMed
    1. Negishi M, Pedersen LG, Petrotchenko E, Shevtsov S, Gorokhov A, Kakuta Y, Pedersen LC. Structure and function of sulfotransferases. Arch Biochem Biophys. 2001;390:149–157. - PubMed
    1. Leitinger B, Brown JL, Spies M. Tagging secretory and membrane proteins witha tyrosine sulfation site. The Journal of Biological Chemistry. 1984;269:8115–8121. - PubMed

Publication types

LinkOut - more resources