Reliable prediction of T-cell epitopes using neural networks with novel sequence representations

Morten Nielsen¹, Claus Lundegaard, Peder Worning, Sanne Lise Lauemøller, Kasper Lamberth, Søren Buus, Søren Brunak, Ole Lund

Affiliations

PMID: 12717023
PMCID: PMC2323871
DOI: 10.1110/ps.0239403

Reliable prediction of T-cell epitopes using neural networks with novel sequence representations

Morten Nielsen et al. Protein Sci. 2003 May.

. 2003 May;12(5):1007-17.

doi: 10.1110/ps.0239403.

Authors

Morten Nielsen¹, Claus Lundegaard, Peder Worning, Sanne Lise Lauemøller, Kasper Lamberth, Søren Buus, Søren Brunak, Ole Lund

Affiliation

¹ Center for Biological Sequence Analysis, BioCentrum-DTU, Technical University of Denmark, DK-2800 Lyngby, Denmark. mniel@cbs.dtu.dk

PMID: 12717023
PMCID: PMC2323871
DOI: 10.1110/ps.0239403

Abstract

In this paper we describe an improved neural network method to predict T-cell class I epitopes. A novel input representation has been developed consisting of a combination of sparse encoding, Blosum encoding, and input derived from hidden Markov models. We demonstrate that the combination of several neural networks derived using different sequence-encoding schemes has a performance superior to neural networks derived using a single sequence-encoding scheme. The new method is shown to have a performance that is substantially higher than that of other methods. By use of mutual information calculations we show that peptides that bind to the HLA A*0204 complex display signal of higher order sequence correlations. Neural networks are ideally suited to integrate such higher order correlations when predicting the binding affinity. It is this feature combined with the use of several neural networks derived from different and novel sequence-encoding schemes and the ability of the neural network to be trained on data consisting of continuous binding affinities that gives the new method an improved performance. The difference in predictive performance between the neural network methods and that of the matrix-driven methods is found to be most significant for peptides that bind strongly to the HLA molecule, confirming that the signal of higher order sequence correlation is most strongly present in high-binding peptides. Finally, we use the method to predict T-cell epitopes for the genome of hepatitis C virus and discuss possible applications of the prediction method to guide the process of rational vaccine design.

PubMed Disclaimer

Figures

**Figure 1.**
Mutual information matrices calculated for two different data sets. (A,C) The mutual information matrix calculated for a data set consisting of 313 peptides derived from the Rammensee data set combined with peptides from the Buus data set with a binding affinity stronger than 500 nM. (B,D) The mutual information matrix calculated for a set of 313 random peptides extracted from the *Mycobaterium tuberculosis* genome. In the *upper* row the mutual information plot is calculated using the conventional 20-letter amino acid alphabet. In the *lower* row the calculation is repeated using the six-letter amino acid alphabet defined in the text.

**Figure 2.**
(A) Sensitivity/PPV plot calculated using a classification binding affinity of 500 nM for a series of linear combinations of the two neural network methods corresponding to Blosum50 and sparse sequence encoding, respectively. The curves were calculated by use of the Bootstrap method (Press et al. 1989) using 500 data set realizations. (A) 428 peptides in the test/train data set; (B) 100 peptides in the evaluation set. In (A) we determine the optimal performance to be the thick blue curve, corresponding to a combination of the two neural network methods with 70% weight on the Blosum50 encoded prediction and 30% weight on the sparse encoded prediction. This set of weights also results in close to optimal performance in *lower* graph. Inserts to the graphs show the corresponding ROC curves.

**Figure 3.**
Scatter plot of the predicted score versus the measured binding affinity for the 528 peptides in the Buus data set. The figure shows the performance for four different prediction methods. The insert to each figure shows an enlargement of the part of the plot that corresponds to a binding affinity stronger than 500 nM. (A) Rammensee matrix method, (B) Hidden Markov Model trained on sequences in the Rammensee data set, (C) Neural Network trained with sparse sequence encoding, and (D) Comb-II neural network method. The straight line fit to the data in (*C) and (D*) have slope and intercept of 0.989, −0.029 and 0.979, −0.027, respectively.

**Figure 4.**
Sensitivity/PPV curves calculated from the 528-peptide data set. Six methods are shown in the graphs: Rammensee, Matrix method by Rammensee (Rammensee et al. 1999); HMM, hidden Markov Model trained on data from the Rammensee database; SEQ, neural network with sparse sequence encoding; BL50, neural network with Blosum50 sequence encoding; Comb-I, combination of neural network trained with sparse and Blosum50 sequence encoding, respectively; and Comb-II, combination of neural network with sparse, Blosum50 and hidden Markov model sequence encoding. (A) The curves for a classification affinity threshold of 50 nM. (B) The curves corresponding to a classifcation affinity threshold of 500 nM. The sensitivity/PPV curves were calculated as described in Figure 2 ▶ using 528 data set realizations. The insert to the graphs shows the ROC curves defined in the text. The value given with the label to each of the curves in the insert is the area under the ROC curve.

See this image and copyright information in PMC

References

1. Adams, H.P. and Koziol, J.A. 1995. Prediction of binding to MHC class I molecules. J. Immunol Methods 185 181–190. - PubMed
1. Altuvia, Y., Schueler, O., and Margalit, H. 1995. Ranking potential binding peptides to MHC molecules by a computational threading approach. J. Mol. Biol. 249 244–250. - PubMed
1. Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28 45–48. - PMC - PubMed
1. Baldi, P. and Brunak, S. 2001. Bioinformatics. The machine learning approach, 2nd ed. The MIT Press, Cambridge, MA.
1. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., and Wheeler, D.L. 2002. GenBank. Nucleic Acids Res. 30 17–20. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reliable prediction of T-cell epitopes using neural networks with novel sequence representations

Affiliation

Reliable prediction of T-cell epitopes using neural networks with novel sequence representations

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials