. 2023 Oct 16;3(1):vbad151.

doi: 10.1093/bioadv/vbad151. eCollection 2023.

NetAllergen, a random forest model integrating MHC-II presentation propensity for improved allergenicity prediction

Yuchen Li¹, Peter Wad Sackett¹, Morten Nielsen^{1

2}, Carolina Barra¹

Affiliations

¹ Department of Health Technology, Technical University of Denmark, Kgs. Lyngby, Copenhagen 2800, Denmark.
² Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, San Martin 1650, Argentina.

PMID: 37901344
PMCID: PMC10603389
DOI: 10.1093/bioadv/vbad151

NetAllergen, a random forest model integrating MHC-II presentation propensity for improved allergenicity prediction

Yuchen Li et al. Bioinform Adv. 2023.

. 2023 Oct 16;3(1):vbad151.

doi: 10.1093/bioadv/vbad151. eCollection 2023.

Authors

Yuchen Li¹, Peter Wad Sackett¹, Morten Nielsen^{1

2}, Carolina Barra¹

Affiliations

¹ Department of Health Technology, Technical University of Denmark, Kgs. Lyngby, Copenhagen 2800, Denmark.
² Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, San Martin 1650, Argentina.

PMID: 37901344
PMCID: PMC10603389
DOI: 10.1093/bioadv/vbad151

Abstract

Motivation: Allergy is a pathological immune reaction towards innocuous protein antigens. Although only a narrow fraction of plant or animal proteins induce allergy, atopic disorders affect millions of children and adults and cost billions in healthcare systems worldwide. In silico predictors can aid in the development of more innocuous food sources. Previous allergenicity predictors used sequence similarity, common structural domains, and amino acid physicochemical features. However, these predictors strongly rely on sequence similarity to known allergens and fail to predict protein allergenicity accurately when similarity diminishes.

Results: To overcome these limitations, we collected allergens from AllergenOnline, a curated database of IgE-inducing allergens, carefully removed allergen redundancy with a novel protein partitioning pipeline, and developed a new allergen prediction method, introducing MHC presentation propensity as a novel feature. NetAllergen outperformed a sequence similarity-based BLAST baseline approach, and previous allergenicity predictor AlgPred 2 when similarity to known allergens is limited.

Availability and implementation: The web service NetAllergen and the datasets are available at https://services.healthtech.dtu.dk/services/NetAllergen-1.0/.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
Data collection. (A) Positive data collection. Allergens were collected from the AllergenOnline database, filtered by length, and by similarity using Hobohm 1 algorithm. (B) Clustering. Clusters were generated using the MST algorithm and distances from BLAST all-against-all E-values. (C) Negative data collection. Non-allergen sequences (blue) were collected from NCBI and added in a 5:1 ratio to positive allergens (red) searching for the same species in each allergen. When more than five non-allergens meet the following criteria, matching keywords, length (range 20% allergen), and similarity (E-values), a random subset of the candidates that was selected. See Section 2 for details of data collection.

**Figure 2.**
Distribution and individual-feature performances on selected features for allergens and non-allergens. T-test was used to compare the significant difference between allergens (red, A) and nonallergens (blue, NA). The AUC was calculated with feature values and the targets. ‘Small’ corresponds to the relative abundance of small-size amino acids (A, C, D, G, N, P, S, T, V) in a given protein sequence. Likewise, D and R correspond to the frequencies of aspartic acid and arginine per protein.

**Figure 3.**
Model development pipeline of NetAllergen. (A) Partitioning. Five partitions randomly combining the clusters obtained from MST clustering were distributed using 10 different random seeds. Every allergen (red bar) and its five associated negatives (blue bars) share the same partition. (B) Cross-validation and ensemble prediction. A nested cross-validation setup was used using 5 folds in the outer layer and 4 folds in the inner layer. The final ensemble prediction is the mean of the 10 model predictions. (C) Internal redundancy tuning. The AUROC was evaluated on the minimal common subset (the dataset with a filter of 5). (D) Evaluation. Ensemble AUC (60F) was significantly higher than the individual partitions except when comparing to P2, which was not significant (P-value .152).

**Figure 4.**
Median feature importance for RF models. Medians are calculated from cross-validation partitions and 10 random seeds in each random forest as the mean decrease of Gini impurity for (A) 20 features (20F) model, and (B) 60F model. Individual features were grouped into physicochemical properties (PCP, blue), hydropathy (HYD, pink), structural information (NSP, yellow), MHC-II presentation propensity (MHC, red), amino acid compositions (COMP, green), and evolutionary information (AC, purple). Average relative solvent accessibility (AvgRSA).

**Figure 5.**
Model performance for variable datasets with decreasing similarity thresholds on AlgPred2 evaluation dataset. The sequences were sorted by similarities against allergens in the training dataset with a descending order. The vertical line indicates the changes of performance and split the curves into areas of higher and lower similarities. For the rightmost curve, the number of sequences reduced as the decreasing similarity, which led to that the evaluation was less robust because of noise. (A) AUC. (B) AUC 0.1.

**Figure 6.**
Model performance with different similarity thresholds on the new evaluation dataset. The similarities were represented by −log(E-value) which were obtained from the baseline model (BLAST). The searching database consisted of the positive sequences from our and AlgPred2 dataset. The noise in decreasing data size led to the less robust curves at the rightmost part. (A) AUC. (B) AUC0.1.

See this image and copyright information in PMC

Cited by

Multimodal deep learning for allergenic proteins prediction.
Yu L, Luo Y, Wu S, Chen S, Xue L, Jing R, Luo J. Yu L, et al. BMC Biol. 2025 Jul 31;23(1):232. doi: 10.1186/s12915-025-02347-z. BMC Biol. 2025. PMID: 40745646 Free PMC article.
SpanSeq: similarity-based sequence data splitting method for improved development and assessment of deep learning projects.
Ferrer Florensa A, Almagro Armenteros JJ, Nielsen H, Aarestrup FM, Clausen PTLC. Ferrer Florensa A, et al. NAR Genom Bioinform. 2024 Aug 16;6(3):lqae106. doi: 10.1093/nargab/lqae106. eCollection 2024 Sep. NAR Genom Bioinform. 2024. PMID: 39157582 Free PMC article.
The receiver operating characteristic curve accurately assesses imbalanced datasets.
Richardson E, Trevizani R, Greenbaum JA, Carter H, Nielsen M, Peters B. Richardson E, et al. Patterns (N Y). 2024 May 31;5(6):100994. doi: 10.1016/j.patter.2024.100994. eCollection 2024 Jun 14. Patterns (N Y). 2024. PMID: 39005487 Free PMC article.
AutoEpiCollect, a Novel Machine Learning-Based GUI Software for Vaccine Design: Application to Pan-Cancer Vaccine Design Targeting PIK3CA Neoantigens.
Samudrala M, Dhaveji S, Savsani K, Dakshanamurthy S. Samudrala M, et al. Bioengineering (Basel). 2024 Mar 27;11(4):322. doi: 10.3390/bioengineering11040322. Bioengineering (Basel). 2024. PMID: 38671743 Free PMC article.

References

1. Altschul SF, Gish W, Miller W. et al. Basic local alignment search tool. J Mol Biol 1990;215:403–10. - PubMed
1. Betts MJ, Russell RB.. Amino acid properties and consequences of substitutions. In: Barnes, M.R. and Gray, I.C. (eds) Bioinformatics for Geneticists. John Wiley & Sons, Ltd., 2003, 289–316.
1. Dall’Antonia F, Pavkov-Keller T, Zangger K. et al. Structure of allergens and structure based epitope predictions. Methods San Diego Calif 2014;66:3–21. - PMC - PubMed
1. Dang HX, Lawrence CB.. Allerdictor: fast allergen prediction using text classification techniques. Bioinformatics 2014;30:1120–8. - PMC - PubMed
1. De Maio N, Alekseyenko AV, Coleman-Smith WJ. et al. A phylogenetic approach for weighting genetic sequences. BMC Bioinformatics 2021;22:285. - PMC - PubMed

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

NetAllergen, a random forest model integrating MHC-II presentation propensity for improved allergenicity prediction

Affiliations

NetAllergen, a random forest model integrating MHC-II presentation propensity for improved allergenicity prediction

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Research Materials

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources

Research Materials