Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 16;3(1):vbad151.
doi: 10.1093/bioadv/vbad151. eCollection 2023.

NetAllergen, a random forest model integrating MHC-II presentation propensity for improved allergenicity prediction

Affiliations

NetAllergen, a random forest model integrating MHC-II presentation propensity for improved allergenicity prediction

Yuchen Li et al. Bioinform Adv. .

Abstract

Motivation: Allergy is a pathological immune reaction towards innocuous protein antigens. Although only a narrow fraction of plant or animal proteins induce allergy, atopic disorders affect millions of children and adults and cost billions in healthcare systems worldwide. In silico predictors can aid in the development of more innocuous food sources. Previous allergenicity predictors used sequence similarity, common structural domains, and amino acid physicochemical features. However, these predictors strongly rely on sequence similarity to known allergens and fail to predict protein allergenicity accurately when similarity diminishes.

Results: To overcome these limitations, we collected allergens from AllergenOnline, a curated database of IgE-inducing allergens, carefully removed allergen redundancy with a novel protein partitioning pipeline, and developed a new allergen prediction method, introducing MHC presentation propensity as a novel feature. NetAllergen outperformed a sequence similarity-based BLAST baseline approach, and previous allergenicity predictor AlgPred 2 when similarity to known allergens is limited.

Availability and implementation: The web service NetAllergen and the datasets are available at https://services.healthtech.dtu.dk/services/NetAllergen-1.0/.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Data collection. (A) Positive data collection. Allergens were collected from the AllergenOnline database, filtered by length, and by similarity using Hobohm 1 algorithm. (B) Clustering. Clusters were generated using the MST algorithm and distances from BLAST all-against-all E-values. (C) Negative data collection. Non-allergen sequences (blue) were collected from NCBI and added in a 5:1 ratio to positive allergens (red) searching for the same species in each allergen. When more than five non-allergens meet the following criteria, matching keywords, length (range 20% allergen), and similarity (E-values), a random subset of the candidates that was selected. See Section 2 for details of data collection.
Figure 2.
Figure 2.
Distribution and individual-feature performances on selected features for allergens and non-allergens. T-test was used to compare the significant difference between allergens (red, A) and nonallergens (blue, NA). The AUC was calculated with feature values and the targets. ‘Small’ corresponds to the relative abundance of small-size amino acids (A, C, D, G, N, P, S, T, V) in a given protein sequence. Likewise, D and R correspond to the frequencies of aspartic acid and arginine per protein.
Figure 3.
Figure 3.
Model development pipeline of NetAllergen. (A) Partitioning. Five partitions randomly combining the clusters obtained from MST clustering were distributed using 10 different random seeds. Every allergen (red bar) and its five associated negatives (blue bars) share the same partition. (B) Cross-validation and ensemble prediction. A nested cross-validation setup was used using 5 folds in the outer layer and 4 folds in the inner layer. The final ensemble prediction is the mean of the 10 model predictions. (C) Internal redundancy tuning. The AUROC was evaluated on the minimal common subset (the dataset with a filter of 5). (D) Evaluation. Ensemble AUC (60F) was significantly higher than the individual partitions except when comparing to P2, which was not significant (P-value .152).
Figure 4.
Figure 4.
Median feature importance for RF models. Medians are calculated from cross-validation partitions and 10 random seeds in each random forest as the mean decrease of Gini impurity for (A) 20 features (20F) model, and (B) 60F model. Individual features were grouped into physicochemical properties (PCP, blue), hydropathy (HYD, pink), structural information (NSP, yellow), MHC-II presentation propensity (MHC, red), amino acid compositions (COMP, green), and evolutionary information (AC, purple). Average relative solvent accessibility (AvgRSA).
Figure 5.
Figure 5.
Model performance for variable datasets with decreasing similarity thresholds on AlgPred2 evaluation dataset. The sequences were sorted by similarities against allergens in the training dataset with a descending order. The vertical line indicates the changes of performance and split the curves into areas of higher and lower similarities. For the rightmost curve, the number of sequences reduced as the decreasing similarity, which led to that the evaluation was less robust because of noise. (A) AUC. (B) AUC 0.1.
Figure 6.
Figure 6.
Model performance with different similarity thresholds on the new evaluation dataset. The similarities were represented by −log(E-value) which were obtained from the baseline model (BLAST). The searching database consisted of the positive sequences from our and AlgPred2 dataset. The noise in decreasing data size led to the less robust curves at the rightmost part. (A) AUC. (B) AUC0.1.

Similar articles

Cited by

References

    1. Altschul SF, Gish W, Miller W. et al. Basic local alignment search tool. J Mol Biol 1990;215:403–10. - PubMed
    1. Betts MJ, Russell RB.. Amino acid properties and consequences of substitutions. In: Barnes, M.R. and Gray, I.C. (eds) Bioinformatics for Geneticists. John Wiley & Sons, Ltd., 2003, 289–316.
    1. Dall’Antonia F, Pavkov-Keller T, Zangger K. et al. Structure of allergens and structure based epitope predictions. Methods San Diego Calif 2014;66:3–21. - PMC - PubMed
    1. Dang HX, Lawrence CB.. Allerdictor: fast allergen prediction using text classification techniques. Bioinformatics 2014;30:1120–8. - PMC - PubMed
    1. De Maio N, Alekseyenko AV, Coleman-Smith WJ. et al. A phylogenetic approach for weighting genetic sequences. BMC Bioinformatics 2021;22:285. - PMC - PubMed