Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 18;31(11):1510-1519.
doi: 10.1093/glycob/cwab078.

Enhancing the interoperability of glycan data flow between ChEBI, PubChem and GlyGen

Affiliations

Enhancing the interoperability of glycan data flow between ChEBI, PubChem and GlyGen

Rahi Navelkar et al. Glycobiology. .

Abstract

Glycans play a vital role in health, disease, bioenergy, biomaterials and bio-therapeutics. As a result, there is keen interest to identify and increase glycan data in bioinformatics databases like ChEBI and PubChem, and connecting them to resources at the EMBL-EBI and NCBI to facilitate access to important annotations at a global level. GlyTouCan is a comprehensive archival database that contains glycans obtained primarily through batch upload from glycan repositories, glycoprotein databases and individual laboratories. In many instances, the glycan structures deposited in GlyTouCan may not be fully defined or have supporting experimental evidence and citations. Databases like ChEBI and PubChem were designed to accommodate complete atomistic structures with well-defined chemical linkages. As a result, they cannot easily accommodate the structural ambiguity inherent in glycan databases. Consequently, there is a need to improve the organization of glycan data coherently to enhance connectivity across the major NCBI, EMBL-EBI and glycoscience databases. This paper outlines a workflow developed in collaboration between GlyGen, ChEBI and PubChem to improve the visibility and connectivity of glycan data across these resources. GlyGen hosts a subset of glycans (~29,000) from the GlyTouCan database and has submitted valuable glycan annotations to the PubChem database and integrated over 10,500 (including ambiguously defined) glycans into the ChEBI database. The integrated glycans were prioritized based on links to PubChem and connectivity to glycoprotein data. The pipeline provides a blueprint for how glycan data can be harmonized between different resources. The current PubChem, ChEBI and GlyTouCan mappings can be downloaded from GlyGen (https://data.glygen.org).

Keywords: database interoperability; glycan annotations; glycoinformatics.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Network of the glycan annotations sourced from various databases, which use either ChEBI ID or GlyTouCan accessions as primary identifiers. Utilizing the ChEBI ID to GlyTouCan accession mapping, GlyGen is able to map across multiple databases connecting glycan with information on glycosylation, reaction and pathway (e.g. https://glygen.org/glycan/G96881BQ#Cross-References). This network is restricted to human proteins.
Fig. 2
Fig. 2
Outlines the data flow of glycans across GlyTouCan, ChEBI and PubChem databases. The figure shows an example of the same glycan (beta-D-Galp-(1- > 3)-[beta-D-GlcpNAc-(1- > 6)]-alpha-D-GalpNAc) present in GlyTouCan (G00033MO) and ChEBI (62158) under respective database identifiers. The GlyTouCan accession and ChEBI ID is mapped to unique PubChem Substance identifiers (G00033MO to SID:252289141; CHEBI:62158 to SID:123058952) when submitted to the PubChem database. PubChem’s standardization process maps both the SID’s to a single compound identifier (CID:52921656). The same CID is utilized as a cross-reference by both GlyTouCan and ChEBI databases.
Fig. 3
Fig. 3
Overview of the data integration pipeline to map or register the GlyGen glycan set of 29,290 GlyTouCan accessions into the ChEBI database. If the GlyTouCan accession had a PubChem CID and a corresponding ChEBI ID, then a cross-reference mapping was generated and added to the ChEBI database where the corresponding GlyTouCan accession was added as a cross-reference. GlyTouCan accessions with a PubChem CID but without a ChEBI ID were uploaded to ChEBI using applications like KNIME and ClassyFire. The remaining GlyTouCan accessions where a PubChem CID mapping was missing were manually registered in the ChEBI database.

References

    1. Acevedo A, Simister R, McQueen-Mason SJ, Gomez LD. 2019. Sudangrass, an alternative lignocellulosic feedstock for bioenergy in Argentina. PLoS ONE. 14(5):e0217435. - PMC - PubMed
    1. Alocci D, Mariethoz J, Gastaldello A, Gasteiger E, Karlsson NG, Kolarich D, Packer NH, Lisacek F. 2019. GlyConnect: glycoproteomics goes visual, interactive, and analytical. J Proteome Res. 18(2):664–677. - PubMed
    1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 25(1):25–29. - PMC - PubMed
    1. Bohm M, Bohne-Lang A, Frank M, Loss A, Rojas-Macias MA, Lutteke T. 2019. Glycosciences.DB: an annotated data collection linking glycomics and proteomics data (2018 update). Nucleic Acids Res. 47(D1):D1195–D1201. - PMC - PubMed
    1. Bonifacino S, Resquin F, Lopretti M, Buxedas L, Vazquez S, Gonzalez M, Sapolinski A, Hirigoyen A, Doldan J, Rachid C, et al. 2021. Bioethanol production using high density Eucalyptus crops in Uruguay. Heliyon. 7(1):e06031. - PMC - PubMed

Publication types