Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 9:54:110404.
doi: 10.1016/j.dib.2024.110404. eCollection 2024 Jun.

MilkOligoThesaurus, a dataset of mammalian milk oligosaccharide synonyms

Affiliations

MilkOligoThesaurus, a dataset of mammalian milk oligosaccharide synonyms

Mathilde Rumeau et al. Data Brief. .

Abstract

There is a growing interest in milk oligosaccharides (MOs) because of their numerous benefits for newborns' and long-term health. A large number of MO structures have been identified in mammalian milk. Mostly described in human milk, the oligosaccharide richness, although less broad, has also been reported for a wide range of mammalian species. The structure of MOs is particularly difficult to report as it results from the combination of 5 monosaccharides linked by various glycosidic bonds forming structurally diverse and complex matrices of linear and branched oligosaccharides. Exploring the literature and extracting relevant information on MO diversity within or across species appears promising to elucidate structure-function role of MOs. Currently, given the complexity of these molecules, the main issues in exploring literature to extract relevant information on MO diversity within or across species relate to the heterogeneity in the way authors refer to these molecules. Herein, we provide a thesaurus (MilkOligoThesaurus) including the names and synonyms of MOs collected from key selected articles on mammalian milk analyses. MilkOligoThesaurus gathers the names of the MOs with a complete description of their monosaccharide composition and structures. When available, each unique MO molecule is linked to its ID from the NCBI PubChem and ChEBI databases. MilkOligoThesaurus is provided in a tabular format. It gathers 245 unique oligosaccharide structures described by 22 features (columns) including the name of the molecule, its abbreviation, the chemical database IDs if available, the monosaccharide composition, chemical information (molecular formula, monoisotopic mass), synonyms, its formula in condensed form, and in abbreviated condensed form, the abbreviated systematic name, the systematic name, the isomer group, and scientific article sources. MilkOligoThesaurus is also provided in the SKOS (Simple Knowledge Organization System) format. This thesaurus is a valuable resource gathering MO naming variations that are not found elsewhere for (i) Text and Data Mining to enable automatic annotation and rapid extraction of milk oligosaccharide data from scientific papers; (ii) biology researchers aiming to search for or decipher the structure of milk oligosaccharides based on any of their names, abbreviations or monosaccharide compositions and linkages.

Keywords: Chemical nomenclature; Milk oligosaccharide monoisotopic mass; Milk oligosaccharide monosaccharide composition; Normalized milk oligosaccharide name; Oligosaccharide isomer name; Systematic names; Vocabulary extraction.

PubMed Disclaimer

Figures

Fig. 1:
Fig. 1
Number of milk oligosaccharides extracted from the eleven sources used to build MilkOligoThesaurus.
Fig. 2:
Fig. 2
Distribution of the number of names or abbreviations per oligosaccharide in MilkOligoThesaurus.

References

    1. Borovikova M., Ferré A., Bossy R., Roche M., Nédellec C. In: Natural Language Processing and Information Systems. Métais E., Meziane F., Sugumaran V., Manning W., Reiff-Marganiec S., editors. Springer Nature Switzerland; Cham: 2023. Could keyword masking strategy improve language model? pp. 271–284. - DOI
    1. Wang X., Hu V., Song X., Garg S., Xiao J., Han J. ChemNER: fine-grained chemistry named entity recognition with ontology-guided distant supervision. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Dominican Republic; Association for Computational Linguistics, Online and Punta Cana; 2021. pp. 5227–5240. - DOI
    1. Ferré A., Deléger L., Bossy R., Zweigenbaum P., Nédellec C. C-Norm: a neural approach to few-shot entity normalization. BMC Bioinform. 2020;21:579. doi: 10.1186/s12859-020-03886-8. - DOI - PMC - PubMed
    1. Remoroza C.A., Mak T.D., De Leoz M.L.A., Mirokhin Y.A., Stein S.E. Creating a mass spectral reference library for oligosaccharides in human milk. Anal. Chem. 2018;90:8977–8988. doi: 10.1021/acs.analchem.8b01176. - DOI - PubMed
    1. Durham S.D., Wei Z., Lemay D.G., Lange M.C., Barile D. Creation of a milk oligosaccharide database, MilkOligoDB, reveals common structural motifs and extensive diversity across mammals. Sci. Rep. 2023;13:10345. doi: 10.1038/s41598-023-36866-y. - DOI - PMC - PubMed

LinkOut - more resources