Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 2;14(1):7974.
doi: 10.1038/s41467-023-43010-x.

Accurate de novo peptide sequencing using fully convolutional neural networks

Affiliations

Accurate de novo peptide sequencing using fully convolutional neural networks

Kaiyuan Liu et al. Nat Commun. .

Abstract

De novo peptide sequencing, which does not rely on a comprehensive target sequence database, provides us with a way to identify novel peptides from tandem mass spectra. However, current de novo sequencing algorithms suffer from low accuracy and coverage, which hinders their application in proteomics. In this paper, we present PepNet, a fully convolutional neural network for high accuracy de novo peptide sequencing. PepNet takes an MS/MS spectrum (represented as a high-dimensional vector) as input, and outputs the optimal peptide sequence along with its confidence score. The PepNet model is trained using a total of 3 million high-energy collisional dissociation MS/MS spectra from multiple human peptide spectral libraries. Evaluation results show that PepNet significantly outperforms current best-performing de novo sequencing algorithms (e.g. PointNovo and DeepNovo) in both peptide-level accuracy and positional-level accuracy. PepNet can sequence a large fraction of spectra that were not identified by database search engines, and thus could be used as a complementary tool to database search engines for peptide identification in proteomics. In addition, PepNet runs around 3x and 7x faster than PointNovo and DeepNovo on GPUs, respectively, thus being more suitable for the analysis of large-scale proteomics data.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. The Neural Network Architecture of PepNet.
The PepNet Network uses a series of temporal convolutional network (TCN) and down-sampling layers to encode the input MS/MS spectrum, from which the global and local information in the spectrum is fused into a single feature tensor and then decoded into the peptide sequence.
Fig. 2
Fig. 2. The accuracy and the Precision-Coverage curves of PepNet, PointNovo, and DeepNovo on the charge 2+ (upper half) and charge 3+ (lower half) spectra in the human proteomics dataset.
Here, the “Filtered Peptide Accuracy" is referred to as the peptide-level accuracy on the sequenced peptides after removing the sequencing results with unmatched precursor masses (i.e., over 10 ppm). The dotted lines represent the precision levels of 0.95 and 0.99, respectively. a Accuracy on charge 2+ spectra, b Precision-Coverage curves on charge 2+ spectra, c Accuracy on charge 3+ spectra, d Precision-Coverage curves on charge 3+ spectra.
Fig. 3
Fig. 3. Impact of peptide length on sequencing accuracy.
The positional accuracy of PepNet, PointNovo, and DeepNovo on peptides of different lengths for the spectra of charge 2+ (a) and charge 3+ (b) in the human proteomics dataset.
Fig. 4
Fig. 4. Sequencing accuracies of peptides from non-human organisms.
Performance of PepNet, PointNovo, and DeepNovo are shown on the spectra of charge 2+ (a) and charge 3+ (b) in the proteomics datasets acquired from different non-human organisms.
Fig. 5
Fig. 5. Numbers of sequenced spectra and peptides.
The numbers of de novo sequenced spectra and unique peptides by PepNet are shown in comparison with those by PointNovo and DeepNovo on spectra of (a) charge 2+, and (b) charge 3+, respectively.
Fig. 6
Fig. 6. Peptides sequenced on unidentified spectra.
The number of unique peptides sequenced by PepNet are comared with those by PointNovo and DeepNovo on the unidentified spectra and their matches with the proteins in Uniprot (identical or with one mutation) for spectra of charge 2+ (a) and charge 3+ (b).
Fig. 7
Fig. 7. The composition of the sequencing results.
The composition of the sequenced peptides on the spectra of  charge 2+ and charge 3+ are shown in (a) and (b), respectively. Here, the pull-out parts represent sequenced spectra with a matched precursor mass (≤10 ppm) and a quality score ≥ 95% precision cutoff.
Fig. 8
Fig. 8. The similarity between the experimental and predicted spectra of sequenced peptides.
The distributions of the similarities between the experimental and predicted spectra (by PredFull) on the sequenced peptides of charge 2+ and  charge 3+ are shown in (a) and (b), respectively.
Fig. 9
Fig. 9. Peptide sequencing accuracies on DIA spectra.
The sequencing accuracy (a) and the Precision-Coverage curve (b) of PepNet, PointNovo, and DeepNovo-DIA are compared on a dataset of DIA-derived MS/MS spectra. The Filtered Peptide Accuracy is referred to as the peptide-level accuracy on the sequenced peptides after removing those with unmatched precursor masses (i.e., over 10 ppm).
Fig. 10
Fig. 10. The impact of retained number of peaks on the performance of PepNet.
The positional and peptide-level accuracy of PepNet are shown on the input charge 2+ (a) and 3+ (b) spectra in the testing dataset, on which different numbers of most intensive peaks are retained.

References

    1. Mann M, Wilm M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 1994;66:4390–4399. doi: 10.1021/ac00096a002. - DOI - PubMed
    1. Eng JK, McCormack AL, Yates JR. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994;5:976–989. doi: 10.1016/1044-0305(94)80016-2. - DOI - PubMed
    1. Hirosawa M, Hoshida M, Ishikawa M, Toya T. Mascot: multiple alignment system for protein sequences based on three-way dynamic programming. Bioinformatics. 1993;9:161–167. doi: 10.1093/bioinformatics/9.2.161. - DOI - PubMed
    1. Craig R, Beavis RC. Tandem: matching proteins with tandem mass spectra. Bioinformatics. 2004;20:1466–1467. doi: 10.1093/bioinformatics/bth092. - DOI - PubMed
    1. Geer LY, et al. Open mass spectrometry search algorithm. J. Proteome Res. 2004;3:958–964. doi: 10.1021/pr0499491. - DOI - PubMed

Publication types