. 2017 Oct 11;7(1):12961.

doi: 10.1038/s41598-017-13210-9.

Amyloidogenic motifs revealed by n-gram analysis

Michał Burdukiewicz¹, Piotr Sobczyk², Stefan Rödiger³, Anna Duda-Madej⁴, Paweł Mackiewicz¹, Małgorzata Kotulska⁵

Affiliations

¹ Department of Genomics, University of Wrocław, Wrocław, Poland.
² Faculty of Pure and Applied Mathematics, Wrocław University of Science and Technology, Wrocław, Poland.
³ Institute of Biotechnology, Brandenburg University of Technology Cottbus-Senftenberg, Senftenberg, Germany.
⁴ Department of Microbiology, Wrocław Medical University, Wrocław, Poland.
⁵ Faculty of Fundamental Problems of Technology, Department of Biomedical Engineering, Wrocław University of Science and Technology, Wrocław, Poland. malgorzata.kotulska@pwr.edu.pl.

PMID: 29021608
PMCID: PMC5636826
DOI: 10.1038/s41598-017-13210-9

Amyloidogenic motifs revealed by n-gram analysis

Michał Burdukiewicz et al. Sci Rep. 2017.

. 2017 Oct 11;7(1):12961.

doi: 10.1038/s41598-017-13210-9.

Authors

Michał Burdukiewicz¹, Piotr Sobczyk², Stefan Rödiger³, Anna Duda-Madej⁴, Paweł Mackiewicz¹, Małgorzata Kotulska⁵

Affiliations

¹ Department of Genomics, University of Wrocław, Wrocław, Poland.
² Faculty of Pure and Applied Mathematics, Wrocław University of Science and Technology, Wrocław, Poland.
³ Institute of Biotechnology, Brandenburg University of Technology Cottbus-Senftenberg, Senftenberg, Germany.
⁴ Department of Microbiology, Wrocław Medical University, Wrocław, Poland.
⁵ Faculty of Fundamental Problems of Technology, Department of Biomedical Engineering, Wrocław University of Science and Technology, Wrocław, Poland. malgorzata.kotulska@pwr.edu.pl.

PMID: 29021608
PMCID: PMC5636826
DOI: 10.1038/s41598-017-13210-9

Abstract

Amyloids are proteins associated with several clinical disorders, including Alzheimer's, and Creutzfeldt-Jakob's. Despite their diversity, all amyloid proteins can undergo aggregation initiated by short segments called hot spots. To find the patterns defining the hot spots, we trained predictors of amyloidogenicity, using n-grams and random forest classifiers. Since the amyloidogenicity may not depend on the exact sequence of amino acids but on their more general properties, we tested 524,284 reduced amino acid alphabets of different lengths (three to six letters) to find the alphabet providing the best performance in cross-validation. The predictor based on this alphabet, called AmyloGram, was benchmarked against the most popular tools for the detection of amyloid peptides using an external data set and obtained the highest values of performance measures (AUC: 0.90, MCC: 0.63). Our results showed sequential patterns in the amyloids which are strongly correlated with hydrophobicity, a tendency to form β-sheets, and lower flexibility of amino acid residues. Among the most informative n-grams of AmyloGram we identified 15 that were previously confirmed experimentally. AmyloGram is available as the web-server: http://smorfland.uni.wroc.pl/shiny/AmyloGram/ and as the R package AmyloGram. R scripts and data used to produce the results of this manuscript are available at http://github.com/michbur/AmyloGramAnalysis .

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Figure 1**
The scheme of reduced alphabets generation and n-gram extraction from studied peptide sequences. (A) Generation of 18,535 unique amino acid encodings using all possible combinations of selected 17 physicochemical properties. Amino acids (AA) are clustered into groups (ID) using a combination of various physicochemical properties (P1, P2, P3, P4, …). (B) Extraction of n-grams. (1) Extraction of overlapping hexapeptides from peptides with known amyloidicity status. (2) Encoding amino acids of hexapeptides into corresponding groups (reduced alphabet) using alphabets generated (shown in (A)). (3) Extraction of encoded n-grams of different types: continuous with the length from 1 to 3 residues; gapped 2-grams with a gap of the length from 1 to 3 residues; gapped 3-grams with a single gap between residues (not all possibilities are shown). (4) Selection of informative n-grams using Quick Permutation Test (QuiPT). (5) Cross-validation of encodings using random forest classifier, which is trained on the informative n-grams.

**Figure 2**
Distribution of mean AUC values of classifiers with various encodings for every possible combination of training and testing data set including different lengths of sequences. The left and right ends of boxes correspond to the 0.25 and 0.75 quartiles. The bar inside the box represents the median. The gray circles correspond to the encodings with the AUC outside the 0.95 confidence interval.

**Figure 3**
The frequency of important n-grams used by the best-performing classifier in amyloid and non-amyloid sequences. Amino acids possible on a given position of the n-grams are specified inside the brackets. X denotes any amino acid. The frequency was computed using the total number of occurrences divided by the number of possible n-grams of their length. Open and closed circles denote experimentally validated n-grams occurring in motifs found in amyloidogenic and non-amyloidogenic sequences, respectively.

**Figure 4**
Similarity and AUC of the reduced alphabets studied in the cross-validation. Classifiers the most similar to the best-performing classifier have the highest values of AUC. The color of the square is proportional to the number of alphabets in its area.

See this image and copyright information in PMC

References

1. Vidal R, Ghetti B. Characterization of amyloid deposits in neurodegenerative diseases. Methods Mol. Biol. (Clifton, NJ) 2011;793:241–258. doi: 10.1007/978-1-61779-328-8_16. - DOI - PubMed
1. Härd T, Lendel C. Inhibition of Amyloid Formation. J. Mol. Biol. 2012;421:441–465. doi: 10.1016/j.jmb.2011.12.062. - DOI - PubMed
1. Chaturvedi, S. K., Siddiqi, M. K., Alam, P. & Khan, R. H. Protein misfolding and aggregation: Mechanism, factors and detection. Process. Biochem. 51(9), 1183–1192 (2016).
1. Sawaya MR, et al. Atomic structures of amyloid cross-β spines reveal varied steric zippers. Nat. 2007;447:453–457. doi: 10.1038/nature05695. - DOI - PubMed
1. Garbuzynskiy SO, Lobanov MY, Galzitskaya OV. FoldAmyloid: a method of prediction of amyloidogenic regions from protein sequence. Bioinforma. (Oxford, England) 2010;26:326–332. doi: 10.1093/bioinformatics/btp691. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Amyloidogenic motifs revealed by n-gram analysis

Affiliations

Amyloidogenic motifs revealed by n-gram analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources