Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Mar;14(3):771-81.
doi: 10.1074/mcp.O114.039115. Epub 2014 Dec 11.

mzDB: a file format using multiple indexing strategies for the efficient analysis of large LC-MS/MS and SWATH-MS data sets

Affiliations

mzDB: a file format using multiple indexing strategies for the efficient analysis of large LC-MS/MS and SWATH-MS data sets

David Bouyssié et al. Mol Cell Proteomics. 2015 Mar.

Abstract

The analysis and management of MS data, especially those generated by data independent MS acquisition, exemplified by SWATH-MS, pose significant challenges for proteomics bioinformatics. The large size and vast amount of information inherent to these data sets need to be properly structured to enable an efficient and straightforward extraction of the signals used to identify specific target peptides. Standard XML based formats are not well suited to large MS data files, for example, those generated by SWATH-MS, and compromise high-throughput data processing and storing. We developed mzDB, an efficient file format for large MS data sets. It relies on the SQLite software library and consists of a standardized and portable server-less single-file database. An optimized 3D indexing approach is adopted, where the LC-MS coordinates (retention time and m/z), along with the precursor m/z for SWATH-MS data, are used to query the database for data extraction. In comparison with XML formats, mzDB saves ∼25% of storage space and improves access times by a factor of twofold up to even 2000-fold, depending on the particular data access. Similarly, mzDB shows also slightly to significantly lower access times in comparison with other formats like mz5. Both C++ and Java implementations, converting raw or XML formats to mzDB and providing access methods, will be released under permissive license. mzDB can be easily accessed by the SQLite C library and its drivers for all major languages, and browsed with existing dedicated GUIs. The mzDB described here can boost existing mass spectrometry data analysis pipelines, offering unprecedented performance in terms of efficiency, portability, compactness, and flexibility.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest statement: The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Simplified relational model of the mzDB data format. Most of the table names and content are identical to the main nodes of the mzML PSI standard. Bounding boxes are indexed through three different tables: spectrum, run_slice, and bounding_box_rtree. The “run slice” concept has been introduced by the mzDB format.
Fig. 2.
Fig. 2.
Data structure of the mzDB file. LC-MS data are divided in grid cells of custom m/z and time widths, namely bounding boxes. Each spectrum is first split into several spectrum slices of a given m/z window. Spectra slices belonging to the same m/z window and eluting in a given time window are grouped into a BB. A run slice is composed by all BBs having the same m/z window.
Fig. 3.
Fig. 3.
Sequential reading times. The time (in seconds) required for reading sequentially all the MS spectra contained in a file was measured after conversion in the three data formats (mzML: green; mz5: red; and mzDB: blue) in uncompressed profile mode. A large number of DDA files of different sizes were used for this test (636 in total), and for each file, the total reading time was plotted against the file size (expressed as the number of data points in the file, that is, number of m/z -intensity pairs). The speed for sequential reading was expressed using the slope of the linear fit of all the points for each file format (×107) and reported in the bottom table. Both mz5 and mzDB formats clearly outperform mzML, whereas mz5 is only slightly faster than mzDB for sequential reading.
Fig. 4.
Fig. 4.
Benchmarking of the different datafile formats on DDA data. A, Schematic representation of the data accesses used to assess performance. Different kinds of reading and data extraction were performed on a DDA file (1.6 GB), illustrated here as a bidimensional LC-MS map along m/z and RT axes. Test 1 (green): Sequential reading, by scan iteration, of all the MS and MS/MS spectra, representing the most classical data access type; Test 2 (purple): extraction of a region encompassing a m/z window of 5 Da on the whole RT range (run slice). In this second test, 100 extractions of this type were performed, for m/z windows centered around 100 randomly selected m/z values, and the total reading time was measured; Test 3 (red): systematic iterative reading of the whole file along the m/z dimension with a m/z window of 5 Da (iteration of run-slices); Test 4 and 5 (blue): targeted extraction of specific regions of the LC-MS map, defined as “small” rectangular regions (60 s and 5 Da windows) or “large” rectangular regions (200 s and 5 Da windows). For test 4 and 5, 100 different extractions were performed in each case, around randomly chosen m/z and RT values. In the case of mzDB, data access implemented in tests 2 and 3 take advantage of the run slice indexing introduced in the format, whereas tests 4 and 5 take advantage of the R*Tree index for rapid access to the targeted region. B, Benchmarks results of the tests for the different formats (mzDB, mz5, native raw, and mzML). Results are expressed as total access time in seconds for the different tests described above, on the four compared file formats. The conversion time (seconds) needed to convert the raw file into mzDB, mz5, and mzML respectively is indicated in the first line (uncompressed mode for mz5 and mzML, profile mode for mzDB). The three last columns indicate the ratio in total access time between mzDB and the other formats.
Fig. 5.
Fig. 5.
Performance comparison of mzDB versus mzXML on SWATH-MS data. Tests were performed on four SWATH-MS files of increasing size (2, 5, 10, and 25 GB), corresponding to samples of different complexity. In each case, the histograms illustrate the total processing time needed to perform 320 XICs (10 per each swath) of two different sizes (50 ppm × 60 s or 50 ppm × 200 s), either on the mzDB file (yellow) or on the mzXML one (blue). The times were obtained from the average of 10 repetitions. The loading time for each file is also reported.

References

    1. Köcher T., Swart R., Mechtler K. (2011) Ultra-high-pressure RPLC hyphenated to an LTQ-Orbitrap Velos reveals a linear relation between peak capacity and number of identified peptides. Anal. Chem. 83, 2699–2704 - PubMed
    1. Thakur S. S., Geiger T., Chatterjee B., Bandilla P., Fröhlich F., Cox J., Mann M. (2011) Deep and highly sensitive proteome coverage by LC-MS/MS without prefractionation. Mol. Cell. Proteomics 10, M110.003699. - PMC - PubMed
    1. Nagaraj N., Alexander Kulak N., Cox J., Neuhauser N., Mayr K., Hoerning O., Vorm O., Mann M. (2012) System-wide perturbation analysis with nearly complete coverage of the yeast proteome by single-shot ultra HPLC runs on a bench top Orbitrap. Mol. Cell. Proteomics 11, M111.013722–M111.013722 - PMC - PubMed
    1. Webb K. J., Xu T., Park S. K., Yates J. R. (2013) Modified MuDPIT separation identified 4488 proteins in a system-wide analysis of quiescence in yeast. J. Proteome Res. 12, 2177–2184 - PMC - PubMed
    1. Bantscheff M., Schirle M., Sweetman G., Rick J., Kuster B. (2007) Quantitative mass spectrometry in proteomics: a critical review. Anal. Bioanal. Chem. 389, 1017–1031 - PubMed

Publication types

LinkOut - more resources