Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec 30;36(Suppl_2):i754-i761.
doi: 10.1093/bioinformatics/btaa808.

APOD: accurate sequence-based predictor of disordered flexible linkers

Affiliations

APOD: accurate sequence-based predictor of disordered flexible linkers

Zhenling Peng et al. Bioinformatics. .

Abstract

Motivation: Disordered flexible linkers (DFLs) are abundant and functionally important intrinsically disordered regions that connect protein domains and structural elements within domains and which facilitate disorder-based allosteric regulation. Although computational estimates suggest that thousands of proteins have DFLs, they were annotated experimentally in <200 proteins. This substantial annotation gap can be reduced with the help of accurate computational predictors. The sole predictor of DFLs, DFLpred, trade-off accuracy for shorter runtime by excluding relevant but computationally costly predictive inputs. Moreover, it relies on the local/window-based information while lacking to consider useful protein-level characteristics.

Results: We conceptualize, design and test APOD (Accurate Predictor Of DFLs), the first highly accurate predictor that utilizes both local- and protein-level inputs that quantify propensity for disorder, sequence composition, sequence conservation and selected putative structural properties. Consequently, APOD offers significantly more accurate predictions when compared with its faster predecessor, DFLpred, and several other alternative ways to predict DFLs. These improvements stem from the use of a more comprehensive set of inputs that cover the protein-level information and the application of a more sophisticated predictive model, a well-parametrized support vector machine. APOD achieves area under the curve = 0.82 (28% improvement over DFLpred) and Matthews correlation coefficient = 0.42 (180% increase over DFLpred) when tested on an independent/low-similarity test dataset. Consequently, APOD is a suitable choice for accurate and small-scale prediction of DFLs.

Availability and implementation: https://yanglab.nankai.edu.cn/APOD/.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Architecture of the APOD predictor. We generate the local window-based features from the part of the sequence profile highlights in gray box, where AAC means the amino acid composition, SS and RSA denotes the features generated from the putative secondary structure and relative solvent accessibility, respectively, CONS stands for the sequence conservation-related features and SWdis represents the disorder-related features extracted by utilizing sliding window. The protein-level features are highlighted using the black box and they cover the DisCon
Fig. 2.
Fig. 2.
Predictive quality of the LR and the SVM models on the training dataset TR166. The predictive quality is measured by the average AUC that quantifies quality of the propensity scores (A) and the average MCC that measures quality of the binary predictions (B). We report average of the results on the five folds in the 5-fold cross validation on TR166. The highest average AUC and MCC are highlighted using the black diamonds
Fig. 3.
Fig. 3.
The distributions of the absolute PBC values and the disorder values in the training dataset TR166. The boxes represents the 20th (bottom of the box), 50th (median) and 80th (top of the box) percentile of the PBC/DisCon values while the red error bars give the maximal and the minimal values. (A) The PBC values across features in specific feature groups that include AAC (amino acid composition); SS+RSA (features generated from putative secondary structure and relative solvent accessibility); CONS (features generated from sequence conservation); SWdis (features extracted from the putative disorder based on sliding window) and DisCon (protein-level features generated from the putative disorder). (B) The distributions of the DisCon values across DFL proteins [proteins with the DFL region(s)], non-DFL proteins and the complete training dataset
Fig. 4.
Fig. 4.
The ablation analysis of the APOD predictor on the training dataset TR166. We compare the complete APOD model with its versions that rely on the parametrized SVM models that utilizes a single feature group. (A) AUC values and (B) the MCC values collected based on the 5-fold cross validation on TR166. The considered feature groups include AAC (amino acid composition); SS+RSA (features generated from putative secondary structure and relative solvent accessibility); CONS (features generated from sequence conservation); SWdis (features extracted from the putative disorder based on sliding window); DisCon (protein-level features generated from the putative disorder); and Dis that combines SWdis and DisCon features (shown in gray)
Fig. 5.
Fig. 5.
Comparative assessment of APOD and DFLpred on the test dataset TE82. (A) The MCC, AUC, precision (Pre) and recall (Rec) values. (B) The corresponding ROC curves
Fig. 6.
Fig. 6.
ROC curves of the APOD and the four selected indirect predictors of DFLs that secure AUC > 0.51 on the test dataset.

Similar articles

Cited by

References

    1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
    1. Anand S., Mohanty D. (2012) Inter-domain movements in polyketide synthases: a molecular dynamics study. Mol. Biosyst., 8, 1157–1171. - PubMed
    1. Arbesu M., Pons M. (2019) Integrating disorder in globular multidomain proteins: fuzzy sensors and the role of SH3 domains. Arch. Biochem. Biophys., 677, 108161. - PubMed
    1. Atas H. et al. (2018) Phylogenetic and other conservation-based approaches to predict protein functional sites. Methods Mol. Biol., 1762, 51–69. - PubMed
    1. Babu M.M. (2016) The contribution of intrinsically disordered regions to protein function, cellular complexity, and human disease. Biochem. Soc. Trans., 44, 1185–1200. - PMC - PubMed

Publication types