Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov;18(11):1304-1316.
doi: 10.1038/s41592-021-01309-x. Epub 2021 Nov 1.

Community evaluation of glycoproteomics informatics solutions reveals high-performance search strategies for serum glycopeptide analysis

Affiliations

Community evaluation of glycoproteomics informatics solutions reveals high-performance search strategies for serum glycopeptide analysis

Rebeca Kawahara et al. Nat Methods. 2021 Nov.

Erratum in

Abstract

Glycoproteomics is a powerful yet analytically challenging research tool. Software packages aiding the interpretation of complex glycopeptide tandem mass spectra have appeared, but their relative performance remains untested. Conducted through the HUPO Human Glycoproteomics Initiative, this community study, comprising both developers and users of glycoproteomics software, evaluates solutions for system-wide glycopeptide analysis. The same mass spectrometrybased glycoproteomics datasets from human serum were shared with participants and the relative team performance for N- and O-glycopeptide data analysis was comprehensively established by orthogonal performance tests. Although the results were variable, several high-performance glycoproteomics informatics strategies were identified. Deep analysis of the data revealed key performance-associated search parameters and led to recommendations for improved 'high-coverage' and 'high-accuracy' glycoproteomics search solutions. This study concludes that diverse software packages for comprehensive glycopeptide data analysis exist, points to several high-performance search strategies and specifies key variables that will guide future software developments and assist informatics decision-making in glycoproteomics.

PubMed Disclaimer

Conflict of interest statement

All authors responsible for the study conception/design, data analysis/interpretation and manuscript writing/editing declare no conflict of interest. Participants (teams 1–22) declare a perceived or real financial or academic conflict of interest in the study outcomes, which was mitigated by excluding participants from the analysis and interpretation of data returned by participants and from manuscript editing.

Figures

Fig. 1
Fig. 1. Study overview.
a, Two glycoproteomics data files of human serum (Files A and B) were generated and shared with participants. b, Participants comprising both developers (orange) and users (blue, team identifiers indicated) employed diverse search engines to complete the study. c, Teams returned a common reporting template capturing details of the applied search strategy including key search settings (SS1–SS13) and search output (SO1–SO9, Table 1) and their identified glycopeptides. d, Complementary performance tests (N1–N6, O1–O5; Table 2) were used to comprehensively evaluate the ability of teams to identify N- and O-glycopeptides. e, The performance profiles were used to score and rank the developers and users separately. Diverse team-wide and search engine-centric (Byonic-focused) approaches were employed to identify performance-associated variables and high-performance search strategies.
Fig. 2
Fig. 2. Glycopeptides reported across teams.
a, Reported N-glycoPSMs (bars), unique source N-glycoproteins (dots) and the N-glycan search space applied (mirror bars) by each team. See key for N-glycan classification. b, Proportion of N-glycopeptides, source N-glycoproteins and N-glycan compositions commonly reported by teams. c, Reported O-glycoPSMs, unique source O-glycoproteins and the O-glycan search space applied by each team. Teams 8 and 9 did not perform O-glycopeptide analysis. See key for O-glycan classification. Multi-feature N- and O-glycans fitting into several of these classes were for this purpose classified in a prioritized order of multi-Fuc–NeuGc–NeuAc; see Supplementary Tables 2 and 3 for data. d, Proportion of O-glycopeptides, source O-glycoproteins and O-glycan compositions commonly reported by teams. The high-confidence ‘consensus’ N- and O-glycopeptides have been made publicly available (GlyConnect Reference ID 2943).
Fig. 3
Fig. 3. Team scoring/ranking and identification of performance-associated variables.
a, Heatmap representation of normalized scores (range 0–1) from the N-glycopeptide performance tests (N1–N6, Table 2). See Supplementary Tables 5–16 for performance data. #The top third performing teams (white font) were placed in a high-performance band. The team scoring was later validated (Extended Data Fig. 10). *Performance could not be determined. b, Many variables (search settings, search output) showed associations (negative or positive) with N-glycopeptide performance. See Table 1 for variables. See Supplementary Table 18 for statistics. See d for key to symbols. c, Scores from the O-glycopeptide performance tests (O1–O5). Teams 8 and 9 did not return O-glycopeptide data. d, Many associations between the search variables and O-glycopeptide performance were observed.
Fig. 4
Fig. 4. Search engine-centric (Byonic-focused) analysis of search strategies for high-performance glycoproteomics data analysis.
a, Overview of the search settings employed by Byonic teams. Default: search strategy used by most teams (yellow). Custom: variations from the default search strategy (green). #Data for SS14, a setting not included in the team reports, were adopted from SO4 data. b, The glycoproteome coverage (unique glycopeptides, File B) varied among Byonic teams. c, Specificity (accuracy) and sensitivity (coverage) scores for (i) N-glycopeptides and (ii) O-glycopeptides for Byonic teams. d, Controlled (in-house) searches for N-glycopeptides using Byonic (File B). Individual search settings were systematically varied (iteration level 1) and output assessed for performance gains (specificity, sensitivity). Search settings showing performance gains (shaded circles/diamonds) without unacceptable costs in specificity (SS13) or search time (SS8/SS10, SS9; see examples in e) (gray stars) were collectively tested for synergistic performance gains (iteration levels 2 and 3, dark green). See e for shared symbol key. e, Byonic-centric O-glycopeptide searches. See d for details. f, Recommended Byonic-centric search strategies for ‘high accuracy’, ‘high coverage’ and ‘balanced’ (between accuracy and coverage) glycoproteomics data analysis. ^The recommended search strategies showed relative performance gains as determined using an independent glycoprotein-centric score (Supplementary Table 19b). Search time and glycoproteome coverage (unique glycopeptides) are also indicated.
Extended Data Fig. 1
Extended Data Fig. 1. Overview of the participating teams and their search strategies grouped according to their status as either developers (orange) or users (blue) of glycoproteomics software.
a. Number and type of teams that registered for and completed the study. Note that a few registered teams did not complete the study; individuals within these non-completing teams and their data (if any) were not included in the study outcome. b. Average number of members in each of the completing teams. Data is represented by mean ± SD (n = 9, developers and n = 13, users). c. The self-reported experience in glycoproteomics of each team. d. Team origin by continent. e. Data files (File A and/or B) handled by the teams. f-g. Type of fragmentation spectra used by teams to identify glycopeptides. h. Search engine(s) and i. pre- and postprocessing tools used for the glycopeptide identification.
Extended Data Fig. 2
Extended Data Fig. 2. Overview of the MS/MS data and charge state distribution of the reported glycopeptides.
a-b. The total number of all recorded HCD-MS/MS scans within Files A-B (striped bars), the total number of m/z 204-containing MS/MS scans (potential glycopeptide MS/MS spectra, black bars) and the total number of glycoPSMs collectively reported from all teams (red bars) over the different fragmentation methods. c-d. Charge state distribution of the reported glycoPSMs from Files A-B (data are plotted as the mean calculated from all teams).
Extended Data Fig. 3
Extended Data Fig. 3. Team-centric overview of the search output data from the glycopeptide identification process (SO1-SO9).
Distribution of the a. LC retention time (min), ***P = 4.52 ×10-6, b. observed glycopeptide m/z, c. observed charge state (z), *P = 1.97 ×10-2, d. observed precursor selection off-by-X (Da, positive values only), e. observed glycopeptide mass [M + H]+ (Da), f. actual mass error of observed glycopeptides (ppm, positive values only), **P = 2.78 ×10-3, g. length of observed glycopeptides, ***P = 2.44 ×10-6, h. glycan mass of observed glycopeptides (M, Da), ***P = 1.03 ×10-9, i. total N- and O-glycoPSMs reported by the participants, ***P = 1.02 ×10-4. The mean and SDs of data from all teams are also indicated for each graph. Developer data are plotted in orange and user data points are in blue. Teams reporting data outside the SDs have been labelled. The N-glycopeptide (N-GP, n = 22) data were statistically compared to the O-glycopeptide (O-GP, n = 20) data using unpaired two-sided t-tests where *P < 0.05, **P < 0.01 and ***P < 0.001. See Supplementary Table 4 for data.
Extended Data Fig. 4
Extended Data Fig. 4. The site-specific N-glycosylation of proteins in the investigated serum sample was found to quantitatively match previously reported N-glycoform distributions of the same proteins from normal human serum.
Four high abundance glycoproteins each harboring multiple N-glycosylation sites were selected for this comparison including a. alpha-1-antitrypsin (A1AT, P01009), b. ceruloplasmin (CP, P00450), c. haptoglobin (HP, P00738) and d. immunoglobulin G1 (IgG1, P01857). The glycoproteins selected for this analysis are positive acute phase proteins and hence their serum levels and glycosylation features may be altered as a result of physiological changes. The quantitative glycoprofiling (indicated as “Rel abundance (HGI)”) was manually performed using AUC-based quantitation and compared to robust literature reporting on the relative abundance of site-specific glycoforms from the same proteins. The glycoforms have been labelled according to their generic monosaccharide composition (N, HexNAc; H, Hex; F, dHex; S, NeuAc). Cartoons illustrating likely N-glycan structures have been provided for the high abundance glycoforms. Low abundance glycoforms were listed according to their relative expression level (high->low, see zoom indicated with broken boxes). Black compositions indicate the glycopeptides reported in literature and found in HGI study; Blue compositions indicate glycopeptides reported only in literature; Green compositions indicate glycopeptides found only in HGI study. The relative abundance (in %) of the individual glycoforms were plotted and correlation coefficients (R2) generated for each N-glycosylation site. The consistently high correlation between the site-specific glycoprofiles generated from the HGI sample and from the literature (R2 = 0.85 – 1.00) validates the use of literature to score and rank the team performance in this study as used for the performance tests N2-N3 and O1-O2 (see Table 2 for details of performance tests).
Extended Data Fig. 5
Extended Data Fig. 5. Glycopeptides carrying NeuGc and multi-Fuc signatures are undetectable or rarely detected in the human serum sample investigated in this study.
a. Extracted ion chromatograms (XICs) were performed at the MS/MS levels for well-established diagnostic oxonium ions, including fragment ions reporting on i) HexNAc, ii) NeuAc, and iii) NeuGc. While abundant diagnostic ions as expected were observed for HexNAc and NeuAc, practically no diagnostic ions were observed for NeuGc glycopeptides. The XIC traces have been plotted on the same absolute intensity scale. All fragmentation modes (HCD, EThcD and CID) were considered for this XIC analysis. Only data from File B reported on by all teams were plotted in this figure; File A showed similar patterns (data not shown). b. Example of an HCD-MS/MS spectrum of a NeuAc-containing sialoglycopeptide correctly and incorrectly annotated by teams. Most teams correctly identified that this scan corresponds to a NeuAc glycopeptide as demonstrated by the presence of diagnostic oxonium and B ions for NeuAc, while two teams incorrectly identified the spectrum as a NeuGc-containing glycopeptide despite the absence of diagnostic oxonium and B ions for NeuGc (see insert) and one team incorrectly identified the spectrum as a NeuAc and Fuc containing glycopeptide due to the misidentification of Met oxidation. c. XICs were performed at the MS/MS level for well-established diagnostic B ions reporting on different antenna features, including i) sialyl LacNAc, ii) sialyl Lewis x/a, and iii) Lewis x/a. While abundant diagnostic ions as expected were observed for sialyl LacNAc, only very few diagnostic ions were observed for antennary fucosylation features (sialyl Lewis x/a and Lewis x/a). iv) Few diagnostic ions for antenna fucosylation could be observed at very low abundance, which indicated that antenna fucosylation (and thus by extension multi-fucosylated glycopeptides) are present but are rarely detected in the studied serum sample. The XIC traces have been plotted on the same absolute intensity scale. All fragmentation modes (HCD, EThcD and CID) were considered for this XIC analysis. Only data from File B reported on by all teams were plotted in this figure; File A showed similar patterns (data not shown). d. Example of an HCD-MS/MS spectrum of a multi-Fuc-containing glycopeptide correctly and incorrectly annotated by teams. Most teams correctly identified that this scan corresponds to a multi-Fuc sialoglycopeptide as indicated by the presence of diagnostic B ions for Lewis x/a (see insert, broken lines) and NeuAc oxonium ions as well as core fucosylated Y1 and Y2 ions, while one team incorrectly identified the spectrum as a tetra-fucosylated asialylated glycopeptide. Note that some teams (for example team 17) reported on several different glycopeptides from the same scan, likely due to conflicting output data from multiple searches of the same data. The monoisotopic precursor ion profile (see insert, full lines) supported that this spectrum corresponds to a difucosylated glycopeptide carrying a single NeuAc. e. Example of an HCD-MS/MS spectrum of a NeuAc-containing glycopeptide correctly and incorrectly annotated by teams. Three teams correctly identified that this scan corresponds to a disialylated (NeuAc) afucosylated glycopeptide as indicated by the presence of diagnostic oxonium and B ions for NeuAc, while three teams incorrectly identified the spectrum as a multi-Fuc sialoglycopeptide despite the lack of diagnostic ions for core fucosylated Y1 ions, and sialyl Lewis x/a or Lewis x/a. The monoisotopic precursor ion profile (see insert, full lines) supported that this spectrum corresponds to a disialylated NeuAc glycopeptide not carrying fucose.
Extended Data Fig. 6
Extended Data Fig. 6. Examples of (in)correctly annotated N-glycopeptides.
a. HCD-MS/MS fragment spectrum of a ‘consensus’ NeuAc-containing sialoglycopeptide correctly annotated by all 16 teams (teams 1, 5, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21) reporting on this particular scan number. Manual annotation confirmed that this spectrum indeed corresponds to the indicated NeuAc-containing N-glycopeptide from human alpha-2-HS-glycoprotein (UniProtKB, P02765) as demonstrated by the presence of diagnostic oxonium and B ions for NeuAc and extensive b- and y-ion peptide backbone fragmentation. Further, the monoisotopic precursor ion profile (see insert) supported the annotation of this spectrum. b. HCD-MS/MS spectrum of a NeuAc-containing core-fucosylated glycopeptide that was incorrectly annotated by several teams. While four teams (teams 10, 17, 20, 21) correctly identified that this spectrum corresponds to an N-glycopeptide from human immunoglobulin heavy constant mu (P01871) carrying a single NeuAc and Fuc as indicated by the presence of diagnostic oxonium and B ions for NeuAc (see insert, broken lines), y-ions confirming Met oxidation and Cys carbamidomethylation, and correct monoisotopic precursor ion profile, four incorrect glycan structures were reported by other teams as indicated. The structural differences between the incorrectly and correctly assigned glycans have been indicated in attempts to rationalize the misidentification. All teams (except for team 1, who reported a different peptide from a different source protein with an incorrect precursor m/z, data not shown) identified the correct peptide sequence, although the Met oxidation and Cys carbamidomethylation were features that frequently led to incorrect glycopeptide identification. Some teams (for example team 21) reported on several glycopeptides from the same scan, likely due to conflicting output data from multiple searches of the same data. The monoisotopic precursor ion profile (see insert, full lines) and the subsequent EThcD-MS/MS scan (scan #8026, data not shown) supported that this spectrum, in fact, corresponds to the indicated N-glycopeptide carrying Met oxidation and Cys carbamidomethylation as well as an N-glycan displaying a composition corresponding to a complex N-glycan structure with a single NeuAc and Fuc.
Extended Data Fig. 7
Extended Data Fig. 7. Biosynthesis-centric network analysis of the N- and O-glycan compositions of the consensus glycopeptides.
Biosynthesis-centric network analysis of the N- and O-glycan compositions carried by the a. 163 consensus N-glycopeptides and b. 23 consensus O-glycopeptides using Glyconnect Compozitor v1.0.0. Each node corresponds to a glycan composition either reported within the consensus list of glycopeptides arising from this study (blue circles) or manually added to biosynthetically connect the glycan compositions by a single glycan processing step (red circles). Both networks showed close biosynthetic relationship between the consensus N- and O-glycan structures reported in this study supporting the correctness of their identification.
Extended Data Fig. 8
Extended Data Fig. 8. Data underpinning the synthetic N-glycopeptide performance test (N1).
a. MS/MS spectra corresponding to the non-adducted synthetic N-glycopeptide (EVFVHPNYSK, Hex5HexNAc4NeuAc2, UniProtKB, P04070) in charge state 3+ and 4+ (9 top spectra) and the K+-adducted synthetic N-glycopeptide in charge state 5+ (three bottom spectra) arising from the four fragmentation modes (HCD-, ETciD-, EThcD- and CID-MS/MS) used to generate File A and B. Green asterisks: Oxonium ions and non-reducing end glycan fragments (B-ions). Blue asterisks: Y-ion series (peptide conjugated with glycan fragment). Red asterisks: Peptide backbone b-/y-/c-/z-ions. Black asterisks: Unfragmented peptide without glycan, unfragmented precursor (peptide with glycan) and charge-reduced precursor. b. Overview of the 12 MS/MS spectra of the synthetic N-glycopeptide (from panel a) that were either correctly identified (green), incorrectly identified (red), or not reported by each team (white). Spectra arising from fragmentation mode(s) not included in the search strategy chosen by each team were not included in the assessment (indicated in grey). c. Structure of the synthetic N-glycopeptide spiked into the human serum sample. d. Performance scores arising from the test determined for each team based on the sensitivity and specificity of the identification of the 12 MS/MS spectra corresponding to the synthetic N-glycopeptide.
Extended Data Fig. 9
Extended Data Fig. 9. Comparison of the raw (before normalization) performance scores arising from the glycopeptide identifications based on HCD- or EThcD-MS/MS data.
Only glycopeptides unambiguously reported by either HCD- or EThcD-MS/MS data were included in this analysis. a. N-glycan composition (N2), b. source N-glycoprotein (N3), and c. N-glycoproteome coverage (N4) were calculated using HCD-MS/MS glycoPSMs reported by 17 teams and EThcD-MS/MS glycoPSMs reported by 13 teams. d. O-glycan composition (O2), e. source O-glycoprotein (O2) and f. O-glycoproteome coverage (O3) were calculated using HCD-MS/MS glycoPSMs reported by 16 teams and EThcD-MS/MS glycoPSMs reported by 10 teams. Significance was tested between the HCD- and EThcD-MS/MS data for all performance scores using unpaired two-sided t-tests where ** indicates P = 0.0021.
Extended Data Fig. 10
Extended Data Fig. 10. Orthogonal glycoprotein-based scoring to validate the team scoring and ranking.
a. The overall team scores (best performer normalized to 1) from multiple performance tests (N1-N6, orange dots) and the independent glycoprotein-centric scores (black dots, normalized) showed high similarity across teams. b. Pearson correlation analysis confirmed that the overall team scores and the glycoprotein-centric scores correlated across the 22 teams thereby validating the team scoring and ranking (see scorecard, Fig. 3 and Supplementary Table 17 for data).

References

    1. Varki A. Biological roles of glycans. Glycobiology. 2017;27:3–49. - PMC - PubMed
    1. Thaysen-Andersen M, Packer NH, Schulz BL. Maturing glycoproteomics technologies provide unique structural insights into the N-glycoproteome and its regulation in health and disease. Mol. Cell. Proteomics. 2016;15:1773–1790. - PMC - PubMed
    1. Chandler KB, Costello CE. Glycomics and glycoproteomics of membrane proteins and cell-surface receptors: present trends and future opportunities. Electrophoresis. 2016;37:1407–1419. - PMC - PubMed
    1. Ye Z, Mao Y, Clausen H, Vakhrushev SY. Glyco-DIA: a method for quantitative O-glycoproteomics with in silico-boosted glycopeptide libraries. Nat. Methods. 2019;16:902–910. - PubMed
    1. Pap A, Klement E, Hunyadi-Gulyas E, Darula Z, Medzihradszky KF. Status report on the high-throughput characterization of complex intact O-glycopeptide mixtures. J. Am. Soc. Mass Spectrom. 2018;29:1210–1220. - PubMed

Publication types