Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 14;16(1):3530.
doi: 10.1038/s41467-025-58866-4.

DIA-BERT: pre-trained end-to-end transformer models for enhanced DIA proteomics data analysis

Affiliations

DIA-BERT: pre-trained end-to-end transformer models for enhanced DIA proteomics data analysis

Zhiwei Liu et al. Nat Commun. .

Abstract

Data-independent acquisition mass spectrometry (DIA-MS) has become increasingly pivotal in quantitative proteomics. In this study, we present DIA-BERT, a software tool that harnesses a transformer-based pre-trained artificial intelligence (AI) model for analyzing DIA proteomics data. The identification model was trained using over 276 million high-quality peptide precursors extracted from existing DIA-MS files, while the quantification model was trained on 34 million peptide precursors from synthetic DIA-MS files. When compared to DIA-NN, DIA-BERT demonstrated a 51% increase in protein identifications and 22% more peptide precursors on average across five human cancer sample sets (cervical cancer, pancreatic adenocarcinoma, myosarcoma, gallbladder cancer, and gastric carcinoma), achieving high quantitative accuracy. This study underscores the potential of leveraging pre-trained models and synthetic datasets to enhance the analysis of DIA proteomics.

PubMed Disclaimer

Conflict of interest statement

Competing interests: T.G. is the founder of Westlake Omics (Hangzhou) Biotechnology Co., Ltd., while P.L. is staff of this company. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. DIA-BERT workflow and its performance on different human cancer datasets.
a The workflow of DIA-BERT. b Comparison of peptide and protein identification between DIA-BERT and DIA-NN (library-based mode) in human proteome datasets. A two-sided paired Student’s t-test was performed without adjustment. The p values for peptide precursor and protein, respectively, are: 0.0711 and 0.0060 (pancreatic adenocarcinoma), 0.1655 and 0.0011 (cervical cancer), 0.0025 and 0.0009 (myosarcoma), 0.0091 and 0.0064 (gallbladder cancer), and 0.0263 and 0.0108 (gastric carcinoma). c The overlap of identified human tissue peptide precursors or proteins using DIA-BERT and DIA-NN (library-based mode). d Comparison of peptide and protein identifications between DIA-BERT and DIA-NN (library-free mode) in human proteome datasets. A two-sided paired Student’s t-test was performed without adjustment. The p values for peptide precursor and protein, respectively, are: 0.0275 and 0.0046 (pancreatic adenocarcinoma), 0.0018 and 0.0005 (cervical cancer), 0.0020 and 0.0009 (myosarcoma), 0.0002 and 0.0022 (gallbladder cancer), and 0.0028 and 0.0056 (gastric carcinoma). Data (b, d) are presented as mean values ±SD. The statistics (b, d) are derived from three biological replicates per cancer type. Source data are provided as a Source Data file. *p < 0.05; **p < 0.01; ***p < 0.001. For (bd), brownish-orange color represents DIA-BERT; Light green color represents DIA-NN.
Fig. 2
Fig. 2. The quantification performance of DIA-BERT.
ad, Quantification precision was evaluated using a proteome dataset comprising three different species. Yeast and C. elegans peptide mixtures were spiked into a human peptide sample at two different ratios (A and B), with three replicate injections per condition,. a The peptide precursor quantification by DIA-BERT. b The protein quantification by DIA-BERT. c The peptide precursor quantification by DIA-NN. d The protein quantification by DIA-NN. The quantification results are visualized as scatterplots in the left three panels and as boxplots in the right panel (boxes: interquartile range; whiskers: 1.5× interquartile range). The n numbers for peptide precursor ratios obtained from DIA-BERT and DIA-NN reports, respectively, are: 83,270 and 78,468 (human), 9322 and 10,188 (yeast), and 15,903 and 8857 (C. elegans). For protein ratios obtained from DIA-BERT and DIA-NN reports, n numbers are: 5976 and 5546 (human), 1463 and 1487 (yeast), and 2717 and 1089 (C. elegans). Orange color represents yeast; Green color represents human; Sky-blue color represents C. elegans. e, f Quantification benchmarking in the mouse-yeast dataset was performed using a single peptide preparation (yeast), which was spiked into a mouse peptide preparation at six different proportions, with five replicate injections for each. e The peptide precursor and protein quantification by DIA-BERT. f The peptide precursor and protein quantification by DIA-NN. The quantification results are visualized using boxplots (boxes: interquartile range; whiskers: 1.5× interquartile range). The n numbers for peptide precursor ratios obtained from DIA-BERT and DIA-NN reports, respectively, are: 11,560 and 16,032 (mouse protein ratio 1:4), 24,908 and 30,415 (1:2), 25,226 and 31,118 (2:3), 29,833 and 35,901 (1:1), 35,006 and 42,084 (3:2), and 36,119 and 42,560 (2:1). For protein ratios obtained from DIA-BERT and DIA-NN reports, respectively, n numbers are: 2494 and 2759 (mouse protein ratio 1:4), 3891 and 4015 (1:2), 3925 and 4094 (2:3), 4307 and 4405 (1:1), 4578 and 4579 (3:2), and 4622 and 4599 (2:1). For panels (e, f), different colors represent different mouse protein ratios: light blue (1:4), light green (1:2), olive green (2:3), yellow (1:1), red (3:2), and brown (2:1). The center of boxplots (af) represents the median value. Source data are provided as a Source Data file.

Similar articles

References

    1. Zhang, F., Ge, W., Ruan, G., Cai, X. & Guo, T. Data-independent acquisition mass spectrometry-based proteomics and software tools: a glimpse in 2020. Proteomics.20, e1900276 (2020). - PubMed
    1. Kitata, R. B., Yang, J. C. & Chen, Y. J. Advances in data-independent acquisition mass spectrometry towards comprehensive digital proteome landscape. Mass Spectrom. Rev.42, 2324–2348 (2023). - PubMed
    1. Lou, R. & Shui, W. Acquisition and analysis of DIA-based proteomic data: a comprehensive Survey in 2023. Mol. Cell Proteom.23, 100712 (2024). - PMC - PubMed
    1. Rost, H. L. et al. OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nat. Biotechnol.32, 219–223 (2014). - PubMed
    1. Bruderer, R. et al. Extending the limits of quantitative proteome profiling with data-independent acquisition and application to acetaminophen-treated three-dimensional liver microtissues. Mol. Cell Proteom.14, 1400–1410 (2015). - PMC - PubMed

LinkOut - more resources