Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Feb;16(2):255-264.
doi: 10.1074/mcp.M116.062588. Epub 2016 Dec 9.

A Multivariate Mixture Model to Estimate the Accuracy of Glycosaminoglycan Identifications Made by Tandem Mass Spectrometry (MS/MS) and Database Search

Affiliations

A Multivariate Mixture Model to Estimate the Accuracy of Glycosaminoglycan Identifications Made by Tandem Mass Spectrometry (MS/MS) and Database Search

Yulun Chiu et al. Mol Cell Proteomics. 2017 Feb.

Abstract

We present a statistical model to estimate the accuracy of derivatized heparin and heparan sulfate (HS) glycosaminoglycan (GAG) assignments to tandem mass (MS/MS) spectra made by the first published database search application, GAG-ID. Employing a multivariate expectation-maximization algorithm, this statistical model distinguishes correct from ambiguous and incorrect database search results when computing the probability that heparin/HS GAG assignments to spectra are correct based upon database search scores. Using GAG-ID search results for spectra generated from a defined mixture of 21 synthesized tetrasaccharide sequences as well as seven spectra of longer defined oligosaccharides, we demonstrate that the computed probabilities are accurate and have high power to discriminate between correctly, ambiguously, and incorrectly assigned heparin/HS GAGs. This analysis makes it possible to filter large MS/MS database search results with predictable false identification error rates.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Screenshot of GAG-ID interface and report. The GAG result was generated with .mgf format of MS/MS spectra and parameters searched against the GAG-DB. (a) The interface of GAG-ID data submission. Several parameters are required, including project name, tolerance for MS and MS/MS, database, HS length, modifications and input peak list (.mgf format). (b) Results. The MS/MS search results, including experimental m/z (EXP_MZ), theoretical m/z (DB_MZ), difference between experimental and theoretical m/z (Diff MZ), charge (Z), score, Delta deviation (S-ΔDev(%)), total ion count (TIC), retention time (RT(min)), and a clickable link to the summary page. (c) Summary. Isomeric structures matched to that MS/MS spectrum with a score greater than zero, including sequence, number of unique matched ions (UNMIon), a summation of ion ratio matched (SIRatio), and score. (d) SpectraViewer. When a sequence is clicked in the summary, the MS/MS data complete with peak annotation are presented by the spectra viewer. (e) Detail. When a score is clicked in the summary, tabulated details of the matched fragment ions are listed for export by the user (reproduced from reference (9)).
Fig. 2.
Fig. 2.
Construction of decoy sequence. A tetrasccharide sequence and its decoy sequences. (a) Target sequence (ABAB). (b) Decoy misplaced sequence 1 (BABA). (c) Decoy misplaced sequence 2 (AABB). (d) Decoy m/z-shift method 1 (A4BAB4). (e) Decoy m/z-shift method 2 (ABAB47). A yellow circle indicates artificially adding n Da at that position, which is balanced by subtracting the same value at the position indicated by a green circle, where n can be any number; for example, n = 4 in Fig. 2d and n = 47 in Fig. 2e.
Fig. 3.
Fig. 3.
Underestimated error rate by TDA. The number of incorrect identifications from GAG-ID plotted by searching against the target sequence (ABAB) database that is missing all correct target sequences (dotted blue line, the upper most curve) and decoy sequence (BABA, ABAB47, A4BAB4, AABB) databases of identical size to the edited target sequence database (the remaining lines). “A” represents uronic acid; “B” represents glucosamine. ABAB47 indicates that artificial mass of 47 Da was added to each uronic acid unit and subtracted from each glucosamine unit. A4BAB4 indicates that artificial mass of 4 Da was added to the uronic acid on the nonreducing end and subtracted from the glucosamine unit.
Fig. 4.
Fig. 4.
Score distribution among correct, ambiguous and incorrect sequence assignments. Actual distribution results from searching a comprehensive tetrasaccharide database (solid line) and a mixture model (dotted line) derived from GAG-ID scores. The whole dataset distribution is shown in blue, with correct identifications in green, ambiguous identifications in red, and incorrect identifications in black. Ambiguous assignments are defined as correct assignments for which the S-ΔDev is low (<20%).
Fig. 5.
Fig. 5.
Data and model. Each annotated spectrum is presented as a dot in the graphs. Green dots indicate spectra that were classified by the two-component model as correct assignments and red dots indicate spectra that were classified as incorrect assignments figure. (a) shows spectra that were in reality matched correctly, and figure (b) shows spectra that were in reality matched incorrectly.
Fig. 6.
Fig. 6.
Multivariate EM model. (a) EM model of three normal distributions for the whole acquired spectrum. Three categories were predicted, correct assignment (blue dots), incorrect assignment (red dots), and ambiguous assignment (green dots). (b) Several larger oligosaccharide GAG-ID search results were used to evaluate the EM model. Number 1 indicates correct assignment of dodecamer; 2 indicates correct assignment of decamer; numbers 3–7 indicate ambiguous assignment of dodecamer or decamer.
Fig. 7.
Fig. 7.
Evaluation of EM model. Spectra were assigned intro three categories by the multivariate EM algorithm, as indicated by color. For those assigned to the correct category, 627 of 662 spectra are assigned correctly and the false positive rate is 0.05. For those assigned to the incorrect category, 5 of 130 spectra are assigned correctly, and the false negative rate is 0.04. For those assigned to the ambiguous category, 339 of 503 spectra are assigned correctly, and 122 of 503 spectra are assigned correctly on their second hit.

References

    1. Iozzo R. V., and San Antonio J. D. (2001) Heparan sulfate proteoglycans: Heavy hitters in the angiogenesis arena. J. Clin. Invest. 108, 349–355 - PMC - PubMed
    1. Holt C. E., and Dickson B. J. (2005) Sugar codes for axons? Neuron 46, 169–172 - PMC - PubMed
    1. Dityatev A., and Schachner M. (2003) Extracellular matrix molecules and synaptic plasticity. Nat. Rev. Neurosci. 4, 456–468 - PubMed
    1. Muramatsu T., and Muramatsu H. (2008) Glycosaminoglycan-binding cytokines as tumor markers. Proteomics 8, 3350–3359 - PubMed
    1. Knelson E. H., Nee J. C., and Blobe G. C. (2014) Heparan sulfate signaling in cancer. Trends Biochem. Sci. 39, 277–288 - PMC - PubMed

Publication types

LinkOut - more resources