Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2012 May 30:13:115.
doi: 10.1186/1471-2105-13-115.

PyMS: a Python toolkit for processing of gas chromatography-mass spectrometry (GC-MS) data. Application and comparative study of selected tools

Affiliations
Comparative Study

PyMS: a Python toolkit for processing of gas chromatography-mass spectrometry (GC-MS) data. Application and comparative study of selected tools

Sean O'Callaghan et al. BMC Bioinformatics. .

Abstract

Background: Gas chromatography-mass spectrometry (GC-MS) is a technique frequently used in targeted and non-targeted measurements of metabolites. Most existing software tools for processing of raw instrument GC-MS data tightly integrate data processing methods with graphical user interface facilitating interactive data processing. While interactive processing remains critically important in GC-MS applications, high-throughput studies increasingly dictate the need for command line tools, suitable for scripting of high-throughput, customized processing pipelines.

Results: PyMS comprises a library of functions for processing of instrument GC-MS data developed in Python. PyMS currently provides a complete set of GC-MS processing functions, including reading of standard data formats (ANDI- MS/NetCDF and JCAMP-DX), noise smoothing, baseline correction, peak detection, peak deconvolution, peak integration, and peak alignment by dynamic programming. A novel common ion single quantitation algorithm allows automated, accurate quantitation of GC-MS electron impact (EI) fragmentation spectra when a large number of experiments are being analyzed. PyMS implements parallel processing for by-row and by-column data processing tasks based on Message Passing Interface (MPI), allowing processing to scale on multiple CPUs in distributed computing environments. A set of specifically designed experiments was performed in-house and used to comparatively evaluate the performance of PyMS and three widely used software packages for GC-MS data processing (AMDIS, AnalyzerPro, and XCMS).

Conclusions: PyMS is a novel software package for the processing of raw GC-MS data, particularly suitable for scripting of customized processing pipelines and for data processing in batch mode. PyMS provides limited graphical capabilities and can be used both for routine data processing and interactive/exploratory data analysis. In real-life GC-MS data processing scenarios PyMS performs as well or better than leading software packages. We demonstrate data processing scenarios simple to implement in PyMS, yet difficult to achieve with many conventional GC-MS data processing software. Automated sample processing and quantitation with PyMS can provide substantial time savings compared to more traditional interactive software systems that tightly integrate data processing with the graphical user interface.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The binning in PyMS. The function "build_intensity_matrix()" performs binning on raw GC-MS data, and produces an object of the type "IntensityMatrix" from the object "GCMS_data" by binning. The object "GCMS_data" (holds the raw GC-MS data) consists of a list of mass spectral scans taken at equidistant time points along the retention time axis t, t + Δt, t + 2Δt , where each scan is a vector of (m/z, intensity) pairs. Binning creates a two-dimensional table whose cells are filled with sums of intensities that fall within each m/z bin. The "IntensityMatrix" object contains the binned data table, the vector of retention times that corresponds to the rows of the binned table (t = [t, t + Δt, t + 2Δt,…]), the vector of m/z which corresponds to the columns of the binned table (v = [ m1, m2, m3, … ]), and also defines a number of methods that operate specifically on the "IntensityMatrix" object. Empty cells of the binned table are filled with zeros (zeros not shown).
Figure 2
Figure 2
The single ion quantitation algorithm as implemented in PyMS. Shown is a hypothetical alignment table with three peak positions (peak UIDs "149-61-82.3-499.8", "155-101-52.5-561.2", and "161-11-49.8-433.2"), with nine individual peaks detected in three different experiments (shown as columns). For each individual peak, PyMS keeps track of the full mass spectrum and all m/z ions from the peak begin and end scans. In the single ion quantitation algorithm, the N most intensive ions are extracted for each peak (by default, N = 5). For each peak position, a single common ion present in all peaks is found. If multiple such ions may exits, the first found is used. This procedure aims to find at least one that will be used for consistent quantitation at this peak position across all experiments. In the example shown, for the first and second peak the selected common ions are m/z = 149 and m/z = 134, respectively (underlined). For the third peak position (UID = "161-11-49.8-433.2"), none of the ions present in both experiments #1 and are found in the top five ions of the experiment #3, suggesting the misalignment for this peak position (peak cell highlighted).
Figure 3
Figure 3
The experiment with low complexity sample matrix and well resolved peaks. A segment of a GC-MS experiment recorded on a control mix with 45 reference compounds. The total ion chromatogram (TIC) is shown in dotted line, and individual ion chromatograms (ICs) in the m/z range 50-550 are overlaid in full line. The true multi-component GC-MS peaks, identified manually by an experienced operator, are shown as filled triangles at the bottom of the graph. The peaks detected by four software packages are shown in the upper part of the Figure ('*'― AMDIS, ' + ' ― XCMS, 'x' ― PyMS, 'o' ― AnalyzerPro). The segment chosen for this analysis is sparsely populated with peaks, which are all well resolved. All four software packages performed comparably to the human operator. Clustering of individual single ion peaks around true multi-component GC-MS peaks is evident from the peak detected by XCMS. For the list of detected signals see Additional file 1.
Figure 4
Figure 4
The experiment with low complexity sample matrix and moderate peak overlap. A segment of a GC-MS experiment recorded on a control mix with 45 reference compounds. The total ion chromatogram (TIC) is shown in dotted line, and individual ion chromatograms (ICs) in the m/z range 50-550 are overlaid in full line. The true multi-component GC-MS peaks, identified manually by an experienced operator, are shown as filled triangles at the bottom of the graph. The peaks detected by four software packages are shown in the upper part of the Figure ('*'― AMDIS, '+' ― XCMS, 'x' ― PyMS, 'o' ― AnalyzerPro). This elution segment includes few closely eluting peaks in the retention time range 1895-1920 s. For example, the range 1910-1920 s contains three distinct co-eluting compounds to give appearance of two TIC peaks. Both PyMS and AnalyzerPro deconvoluted these three peaks successfully, and also detected the other peaks in the chromatogram comparably to the human operator. AMDIS significantly overestimated the number of peaks on this segment. By default, XCMS looks for single ion peaks and not GC-MS type multi-ion peaks, and therefore in this segment detected many such peaks. For the list of detected signals see Additional file 2.
Figure 5
Figure 5
The experiment with complex biological matrix and moderate peak overlap. A GC-MS experiment was recorded on foetal calf serum spiked with a mix of 17 known compounds. The total ion chromatogram (TIC) is shown in dotted line, and individual ion chromatograms (ICs) in the m/z range 50-550 are overlaid in full line. The true multi-component GC-MS peaks, identified manually by an experienced operator, are shown as filled triangles at the bottom of the graph. The peaks detected by four software packages are shown in the upper part of the Figure ('*'― AMDIS, ' + ' ― XCMS, 'x' ― PyMS, 'o' ― AnalyzerPro). The shown region includes two of the spiked compounds, aspartic acid (the large peak near 524 s) and malic acid (the large peak near 540 s). Two smaller peaks are closely eluting with malic acid providing a good test for deconvolution. Both PyMS and AnalyzerPro performed similarly to an experienced operator in the identification of peaks. For the list of detected signals see Additional file 3.
Figure 6
Figure 6
The experiment with complex biological matrix and heavy peak overlap. A GC-MS experiment recorded on foetal calf serum spiked with a mix of 17 known compounds. The total ion chromatogram (TIC) is shown in dotted line, and individual ion chromatograms (ICs) in the m/z range 50–550 are overlaid in full line. The true multi-component GC-MS peaks, identified manually by an experienced operator, are shown as filled triangles at the bottom of the graph. The broad peak observed in the retention time range 464-469 s is an overloaded urea peak. The four packages have each picked different numbers of peaks, with AMDIS and XCMS picking many more peaks than either PyMS or AnalyzerPro. Both PyMS and AnalyzerPro performed similarly to an experienced operator, with some false positive peaks reported. In the area near 476 s both software reported several weak signals, not annotated by the human operator. For the list of detected signals see Additional file 4.
Figure 7
Figure 7
Relative quantitation by AnalyzerPro and PyMS. Shown is a comparison of raw areas calculated by AnalyzerPro (x-axis) and PyMS (y-axis) for the data and segment depicted in Figure 3, including signal peaks correctly identified by both programs. The segment chosen for this analysis involves largely well-resolved peaks, and therefore provides a good test of relative quantitation without interference with potential errors originating from deconvolution. For each program its own internal algorithm was used to calculate peak boundaries and total peal areas. A good linear relationship in raw areas reported by two software packages was observed. The correlation coefficient for the data shown is 0.998.
Figure 8
Figure 8
Absolute quantitation with PyMS and AnalyzerPro. The samples of foetal calf serum were spiked with a decreasing concentration of a mix of 17 reference compounds. Shown is the quantitation of peak areas by PyMS and AnalyzerPro for four different reference compounds (methionine, trehalose, aspartic acid, and valine), across four different GC-MS experiments, spiked with 6, 12.5, 25, and 50 μl of reference compounds (the dose level of spiked reference compounds is shown as a sample label on the x-axis). To account for metabolites naturally occurring in foetal calf serum, the areas were normalized to the lowest concentration of each compound. A good agreement in expected and observed absolute quantitation was observed for both AnalyzerPro and PyMS.

References

    1. Sumner LW, Mendes P, Dixon RA. Plant metabolomics: large-scale phytochemistry in the functional genomics era. Phytochemistry. 2003;62(6):817–836. doi: 10.1016/S0031-9422(02)00708-2. - DOI - PubMed
    1. Villas-Boas SG, Mas S, Akesson M, Smedsgaard J, Nielsen J. Mass spectrometry in metabolome analysis. Mass Spectrom Rev. 2005;24(5):613–646. doi: 10.1002/mas.20032. - DOI - PubMed
    1. Halket JM, Waterman D, Przyborowska AM, Patel RK, Fraser PD, Bramley PM. Chemical derivatization and mass spectral libraries in metabolic profiling by GC/MS and LC/MS/MS. J Exp Bot. 2005;56(410):219–243. - PubMed
    1. Kopka J. In: Plant Metabolomics. Saito K, Dixon RA, Willmitzer L, editor. Berlin Heidelberg: Springer; 2006. Gas Chromatography Mass Spectrometry; pp. 3–20.
    1. Fiehn O. Extending the breadth of metabolite profiling by gas chromatography coupled to mass spectrometry. Trends Analyt Chem. 2008;27(3):261–269. doi: 10.1016/j.trac.2008.01.007. - DOI - PMC - PubMed

Publication types

MeSH terms