Tuning intrinsic disorder predictors for virus proteins

Gal Almog¹, Abayomi S Olabode¹, Art F Y Poon^{1

2

3}

Affiliations

¹ Department of Pathology & Laboratory Medicine, Western University, Dental Sciences Building, Rm. 4044 London, Ontario, Canada, N6A 5C1.
² Department of Applied Mathematics, Western University, Middlesex College Room 255, 1151 Richmond Street London, Ontario, Canada, N6A 5B7.
³ Department of Microbiology & Immunology, Western University, 1151 Richmond Street London, Ontario, Canada, N6A 3K.

PMID: 33614158
PMCID: PMC7882063
DOI: 10.1093/ve/veaa106

Tuning intrinsic disorder predictors for virus proteins

Gal Almog et al. Virus Evol. 2021.

. 2021 Jan 25;7(1):veaa106.

doi: 10.1093/ve/veaa106. eCollection 2021 Jan.

Authors

Gal Almog¹, Abayomi S Olabode¹, Art F Y Poon^{1

2

3}

Affiliations

¹ Department of Pathology & Laboratory Medicine, Western University, Dental Sciences Building, Rm. 4044 London, Ontario, Canada, N6A 5C1.
² Department of Applied Mathematics, Western University, Middlesex College Room 255, 1151 Richmond Street London, Ontario, Canada, N6A 5B7.
³ Department of Microbiology & Immunology, Western University, 1151 Richmond Street London, Ontario, Canada, N6A 3K.

PMID: 33614158
PMCID: PMC7882063
DOI: 10.1093/ve/veaa106

Abstract

Many virus-encoded proteins have intrinsically disordered regions that lack a stable, folded three-dimensional structure. These disordered proteins often play important functional roles in virus replication, such as down-regulating host defense mechanisms. With the widespread availability of next-generation sequencing, the number of new virus genomes with predicted open reading frames is rapidly outpacing our capacity for directly characterizing protein structures through crystallography. Hence, computational methods for structural prediction play an important role. A large number of predictors focus on the problem of classifying residues into ordered and disordered regions, and these methods tend to be validated on a diverse training set of proteins from eukaryotes, prokaryotes, and viruses. In this study, we investigate whether some predictors outperform others in the context of virus proteins and compared our findings with data from non-viral proteins. We evaluate the prediction accuracy of 21 methods, many of which are only available as web applications, on a curated set of 126 proteins encoded by viruses. Furthermore, we apply a random forest classifier to these predictor outputs. Based on cross-validation experiments, this ensemble approach confers a substantial improvement in accuracy, e.g., a mean 36 per cent gain in Matthews correlation coefficient. Lastly, we apply the random forest predictor to severe acute respiratory syndrome coronavirus 2 ORF6, an accessory gene that encodes a short (61 AA) and moderately disordered protein that inhibits the host innate immune response. We show that disorder prediction methods perform differently for viral and non-viral proteins, and that an ensemble approach can yield more robust and accurate predictions.

Keywords: ensemble classifier; intrinsically disordered proteins; machine learning; protein disorder prediction; virus proteins.

PubMed Disclaimer

Figures

**Figure 1.**
Performance of predictors on viral data set. (A) Scatterplot of MCC and AUC values for 21 predictors applied to the viral protein data set. (B) Slopegraph comparing the MCC values for 13 predictors applied to both non-viral and viral data sets. Because the three variants of the ESpritz model obtained identical MCC values, the corresponding labels were merged. Two labels (PONDRFIT, PONDR.VSL2) were displaced to prevent overlaps on the left and right sides, respectively.

**Figure 2.**
Principal components analysis plot of the root mean squared errors (RMSEs) for 13 disorder predictors on viral (red, triangles) and non-viral (blue, circles) protein sequences. The percentages of total variance explained by the first two principal components are indicated in parentheses in the respective axis labels.

**Figure 3.**
Box plot of the average decrease in Gini impurity by each feature in the random forest, for 10 random runs of the random forest model. Vertical line indicates the median, the box is the interquartile range (IQR; range from first to third quartiles). The left whisker extends to the first datum greater that Q1 − 1.5 × IQR and the right whisker extends to the last datum smaller than Q3 + 1.5 × IQR. Individual points are outliers that lie outside this range.

**Figure 4.**
Disorder predictions for novel ORF6 in SARS-CoV-2. The first row represents the random forest model predictions, with subsequent rows corresponding to individual predictors. The entire protein length is represented on the x-axis, each grid is an amino acid. Red squares indicate disordered predictions and blue squares indicate ordered predictions.

See this image and copyright information in PMC

References

1. Atkins J. D. et al. (2015) ‘ Disorder Prediction Methods, Their Applicability to Different Protein Targets and Their Usefulness for Guiding Experimental Studies’, International Journal of Molecular Sciences, 16: 19040–54. - PMC - PubMed
1. Attia A. (2012) ‘ Ensemble Prediction of Intrinsically Disordered Regions in Proteins’, BMC Bioinformatics, 13: 111.
1. Belshaw R., Pybus O. G., Rambaut A. (2007) ‘ The Evolution of Genome Compression and Genomic Novelty in RNA Viruses’, Genome Research, 17: 1496–504. - PMC - PubMed
1. Boughorbel S., Jarray F., El-Anbari M. (2017) ‘ Optimal Classifier for Imbalanced Data Using Matthews Correlation Coefficient Metric’, PLoS One, 12: e0177678. - PMC - PubMed
1. Breiman L. (2001) ‘ Random Forests’, Machine Learning, 45: 5–32.

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Tuning intrinsic disorder predictors for virus proteins

Affiliations

Tuning intrinsic disorder predictors for virus proteins

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources