. 2022 Jun 14;9(1):335.

doi: 10.1038/s41597-022-01380-9.

Implementing the reuse of public DIA proteomics datasets: from the PRIDE database to Expression Atlas

Affiliations

¹ European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom. walzer@ebi.ac.uk.
² European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom.
³ Division of Evolution, Infection and Genomics, School of Biological Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Oxford Road, Manchester, M13 9PT, United Kingdom.
⁴ Melandra Limited, 16 Brook Road, Urmston, Manchester, M41 5RY, United Kingdom.
⁵ School of Biological Sciences, Chlorine Gardens, Queen's University Belfast, Belfast, BT9 5DL, United Kingdom.
⁶ European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom. juan@ebi.ac.uk.

PMID: 35701420
PMCID: PMC9197839
DOI: 10.1038/s41597-022-01380-9

Implementing the reuse of public DIA proteomics datasets: from the PRIDE database to Expression Atlas

Mathias Walzer et al. Sci Data. 2022.

. 2022 Jun 14;9(1):335.

doi: 10.1038/s41597-022-01380-9.

Affiliations

¹ European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom. walzer@ebi.ac.uk.
² European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom.
³ Division of Evolution, Infection and Genomics, School of Biological Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Oxford Road, Manchester, M13 9PT, United Kingdom.
⁴ Melandra Limited, 16 Brook Road, Urmston, Manchester, M41 5RY, United Kingdom.
⁵ School of Biological Sciences, Chlorine Gardens, Queen's University Belfast, Belfast, BT9 5DL, United Kingdom.
⁶ European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom. juan@ebi.ac.uk.

PMID: 35701420
PMCID: PMC9197839
DOI: 10.1038/s41597-022-01380-9

Abstract

The number of mass spectrometry (MS)-based proteomics datasets in the public domain keeps increasing, particularly those generated by Data Independent Acquisition (DIA) approaches such as SWATH-MS. Unlike Data Dependent Acquisition datasets, the re-use of DIA datasets has been rather limited to date, despite its high potential, due to the technical challenges involved. We introduce a (re-)analysis pipeline for public SWATH-MS datasets which includes a combination of metadata annotation protocols, automated workflows for MS data analysis, statistical analysis, and the integration of the results into the Expression Atlas resource. Automation is orchestrated with Nextflow, using containerised open analysis software tools, rendering the pipeline readily available and reproducible. To demonstrate its utility, we reanalysed 10 public DIA datasets from the PRIDE database, comprising 1,278 SWATH-MS runs. The robustness of the analysis was evaluated, and the results compared to those obtained in the original publications. The final expression values were integrated into Expression Atlas, making SWATH-MS experiments more widely available and combining them with expression data originating from other proteomics and transcriptomics datasets.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
Graphical representation of the DIA data reanalysis pipeline, consisting of 4 parts. (a) Data curation: Metadata annotation protocols and dataset acquisition. (b) SWATH-MS data analysis: Nextflow workflow including steps ranging from data conversion, SWATH-MS window management, data quality assessment and control (QA/QC), OpenSWATH analysis, FDR calculation to measurement alignment. (c) Statistical analysis: Nextflow workflow for MSstats analysis, normalisation and result filtering. (d) Data integration: Data preparation, accession mapping and integration into Expression Atlas.

**Fig. 2**
(a) Number of detected proteins per dataset at different FDR levels in the data reanalysis. (a) Protein detection results after 1% protein FDR threshold filtering. Original data refers to the respective publication’s mentioned protein numbers, reported at 1% protein FDR unless indicated otherwise. Reanalysis numbers are provided unfiltered and with the consistency filter applied (at least 50% of all protein’s peptide fragment targets have to be detected within a study group). *Proteins coming from datasets PXD000672 and PXD004873 were reported in the original publication at a 0.1% protein level FDR only. In the case of dataset PXD004589 at 0.1% peptide level FDR was reported.

**Fig. 3**
Violin-plots showing the results of the group-wise CV comparisons: (a) PXD003497 reanalysis; (b) PXD003497 original data; (c) PXD004873 reanalysis; (d) PXD004873 original data; (e) PXD014194 reanalysis; (f) PXD014194 original data. As it can be seen from the similar size and shapes of the violin-plots, the CVs across the datasets are largely concordant.

**Fig. 4**
Correlation analysis of reported log2 protein intensities from technical replicate pairs: (a) PXD003497 reanalysis; (b) PXD003497 original data; (c) PXD004873 reanalysis; (d) PXD004873 original data; (e) PXD014194 reanalysis; (f) PXD014194 original data. The first items of pairs are on the x-axis and second items are on the y-axis. Each point represents a protein. The point density is indicated by the colour gradient, with black showing the lowest density. The higher the density the lighter the colour becomes.

**Fig. 5**
Volcano plots corresponding to the differential expression analysis for dataset PXD014943: (a) extranodal diffuse large B-cell lymphoma (eDLBCL) versus primary central nervous system lymphoma (PCNSL); (b) intravascular lymphoma (IVL) versus eDLBCL. For dataset PXD004691: (c) normal tissue (fresh frozen) versus PrC (fresh frozen); (d) normal tissue (paraffin embedded) versus tumour tissue (paraffin embedded). For dataset PXD000672: (e) benign tissue samples versus clear cell RCC; (f) clear cell RCC versus paillary RCC. The FC compared are represented by points on the plot. Significant FC proteins are colour indicated, dashed lines indicate the fold-change cutoff of 2 and the (adjusted) p-value cutoff at 0.05.

See this image and copyright information in PMC

Cited by

Integrated view and comparative analysis of baseline protein expression in mouse and rat tissues.
Wang S, García-Seisdedos D, Prakash A, Kundu DJ, Collins A, George N, Fexova S, Moreno P, Papatheodorou I, Jones AR, Vizcaíno JA. Wang S, et al. PLoS Comput Biol. 2022 Jun 17;18(6):e1010174. doi: 10.1371/journal.pcbi.1010174. eCollection 2022 Jun. PLoS Comput Biol. 2022. PMID: 35714157 Free PMC article.
The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences.
Perez-Riverol Y, Bai J, Bandla C, García-Seisdedos D, Hewapathirana S, Kamatchinathan S, Kundu DJ, Prakash A, Frericks-Zipper A, Eisenacher M, Walzer M, Wang S, Brazma A, Vizcaíno JA. Perez-Riverol Y, et al. Nucleic Acids Res. 2022 Jan 7;50(D1):D543-D552. doi: 10.1093/nar/gkab1038. Nucleic Acids Res. 2022. PMID: 34723319 Free PMC article.
Computational and Systems Biology Advances to Enable Bioagent Agnostic Signatures.
Lin A, Torres CM, Hobbs EC, Bardhan J, Aley SB, Spencer CT, Taylor KL, Chiang T. Lin A, et al. Health Secur. 2024 Mar-Apr;22(2):130-139. doi: 10.1089/hs.2023.0076. Epub 2024 Mar 13. Health Secur. 2024. PMID: 38483337 Free PMC article. No abstract available.
Integrated View of Baseline Protein Expression in Human Tissues.
Prakash A, García-Seisdedos D, Wang S, Kundu DJ, Collins A, George N, Moreno P, Papatheodorou I, Jones AR, Vizcaíno JA. Prakash A, et al. J Proteome Res. 2023 Mar 3;22(3):729-742. doi: 10.1021/acs.jproteome.2c00406. Epub 2022 Dec 28. J Proteome Res. 2023. PMID: 36577097 Free PMC article.
PM_2.5, component cause of severe metabolically abnormal obesity: An in silico, observational and analytical study.
Lobato S, Castillo-Granada AL, Bucio-Pacheco M, Salomón-Soto VM, Álvarez-Valenzuela R, Meza-Inostroza PM, Villegas-Vizcaíno R. Lobato S, et al. Heliyon. 2024 Apr 3;10(7):e28936. doi: 10.1016/j.heliyon.2024.e28936. eCollection 2024 Apr 15. Heliyon. 2024. PMID: 38601536 Free PMC article.

See all "Cited by" articles

References

1. Rung J, Brazma A. Reuse of public genome-wide gene expression data. Nature Reviews. Genetics. 2013;14:89–99. doi: 10.1038/nrg3394. - DOI - PubMed
1. Talavera D, et al. Archetypal transcriptional blocks underpin yeast gene regulation in response to changes in growth conditions. Scientific Reports. 2018;8:7949. doi: 10.1038/s41598-018-26170-5. - DOI - PMC - PubMed
1. Perez-Riverol Y, for Mass Spectrometry, E. B. C. Toward a sample metadata standard in public proteomics repositories. Journal of Proteome Research. 2020;19:3906–3909. doi: 10.1021/acs.jproteome.0c00376. - DOI - PMC - PubMed
1. Deutsch EW, et al. The ProteomeXchange consortium in 2020: enabling ‘big data’ approaches in proteomics. Nucleic Acids Research. 2020;48:D1145–D1152. doi: 10.1093/nar/gkz984. - DOI - PMC - PubMed
1. Vaudel M, et al. Exploring the potential of public proteomics data. Proteomics. 2016;16:214–225. doi: 10.1002/pmic.201500295. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Implementing the reuse of public DIA proteomics datasets: from the PRIDE database to Expression Atlas

Affiliations

Implementing the reuse of public DIA proteomics datasets: from the PRIDE database to Expression Atlas

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases