Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 14;9(1):335.
doi: 10.1038/s41597-022-01380-9.

Implementing the reuse of public DIA proteomics datasets: from the PRIDE database to Expression Atlas

Affiliations

Implementing the reuse of public DIA proteomics datasets: from the PRIDE database to Expression Atlas

Mathias Walzer et al. Sci Data. .

Abstract

The number of mass spectrometry (MS)-based proteomics datasets in the public domain keeps increasing, particularly those generated by Data Independent Acquisition (DIA) approaches such as SWATH-MS. Unlike Data Dependent Acquisition datasets, the re-use of DIA datasets has been rather limited to date, despite its high potential, due to the technical challenges involved. We introduce a (re-)analysis pipeline for public SWATH-MS datasets which includes a combination of metadata annotation protocols, automated workflows for MS data analysis, statistical analysis, and the integration of the results into the Expression Atlas resource. Automation is orchestrated with Nextflow, using containerised open analysis software tools, rendering the pipeline readily available and reproducible. To demonstrate its utility, we reanalysed 10 public DIA datasets from the PRIDE database, comprising 1,278 SWATH-MS runs. The robustness of the analysis was evaluated, and the results compared to those obtained in the original publications. The final expression values were integrated into Expression Atlas, making SWATH-MS experiments more widely available and combining them with expression data originating from other proteomics and transcriptomics datasets.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Graphical representation of the DIA data reanalysis pipeline, consisting of 4 parts. (a) Data curation: Metadata annotation protocols and dataset acquisition. (b) SWATH-MS data analysis: Nextflow workflow including steps ranging from data conversion, SWATH-MS window management, data quality assessment and control (QA/QC), OpenSWATH analysis, FDR calculation to measurement alignment. (c) Statistical analysis: Nextflow workflow for MSstats analysis, normalisation and result filtering. (d) Data integration: Data preparation, accession mapping and integration into Expression Atlas.
Fig. 2
Fig. 2
(a) Number of detected proteins per dataset at different FDR levels in the data reanalysis. (a) Protein detection results after 1% protein FDR threshold filtering. Original data refers to the respective publication’s mentioned protein numbers, reported at 1% protein FDR unless indicated otherwise. Reanalysis numbers are provided unfiltered and with the consistency filter applied (at least 50% of all protein’s peptide fragment targets have to be detected within a study group). *Proteins coming from datasets PXD000672 and PXD004873 were reported in the original publication at a 0.1% protein level FDR only. In the case of dataset PXD004589 at 0.1% peptide level FDR was reported.
Fig. 3
Fig. 3
Violin-plots showing the results of the group-wise CV comparisons: (a) PXD003497 reanalysis; (b) PXD003497 original data; (c) PXD004873 reanalysis; (d) PXD004873 original data; (e) PXD014194 reanalysis; (f) PXD014194 original data. As it can be seen from the similar size and shapes of the violin-plots, the CVs across the datasets are largely concordant.
Fig. 4
Fig. 4
Correlation analysis of reported log2 protein intensities from technical replicate pairs: (a) PXD003497 reanalysis; (b) PXD003497 original data; (c) PXD004873 reanalysis; (d) PXD004873 original data; (e) PXD014194 reanalysis; (f) PXD014194 original data. The first items of pairs are on the x-axis and second items are on the y-axis. Each point represents a protein. The point density is indicated by the colour gradient, with black showing the lowest density. The higher the density the lighter the colour becomes.
Fig. 5
Fig. 5
Volcano plots corresponding to the differential expression analysis for dataset PXD014943: (a) extranodal diffuse large B-cell lymphoma (eDLBCL) versus primary central nervous system lymphoma (PCNSL); (b) intravascular lymphoma (IVL) versus eDLBCL. For dataset PXD004691: (c) normal tissue (fresh frozen) versus PrC (fresh frozen); (d) normal tissue (paraffin embedded) versus tumour tissue (paraffin embedded). For dataset PXD000672: (e) benign tissue samples versus clear cell RCC; (f) clear cell RCC versus paillary RCC. The FC compared are represented by points on the plot. Significant FC proteins are colour indicated, dashed lines indicate the fold-change cutoff of 2 and the (adjusted) p-value cutoff at 0.05.

Similar articles

Cited by

References

    1. Rung J, Brazma A. Reuse of public genome-wide gene expression data. Nature Reviews. Genetics. 2013;14:89–99. doi: 10.1038/nrg3394. - DOI - PubMed
    1. Talavera D, et al. Archetypal transcriptional blocks underpin yeast gene regulation in response to changes in growth conditions. Scientific Reports. 2018;8:7949. doi: 10.1038/s41598-018-26170-5. - DOI - PMC - PubMed
    1. Perez-Riverol Y, for Mass Spectrometry, E. B. C. Toward a sample metadata standard in public proteomics repositories. Journal of Proteome Research. 2020;19:3906–3909. doi: 10.1021/acs.jproteome.0c00376. - DOI - PMC - PubMed
    1. Deutsch EW, et al. The ProteomeXchange consortium in 2020: enabling ‘big data’ approaches in proteomics. Nucleic Acids Research. 2020;48:D1145–D1152. doi: 10.1093/nar/gkz984. - DOI - PMC - PubMed
    1. Vaudel M, et al. Exploring the potential of public proteomics data. Proteomics. 2016;16:214–225. doi: 10.1002/pmic.201500295. - DOI - PMC - PubMed