Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug 9:2024:baae071.
doi: 10.1093/database/baae071.

The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop

Affiliations

The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop

Rezarta Islamaj et al. Database (Oxford). .

Abstract

The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Overview of the relationship extraction in the biomedical domain. This illustration depicts the importance of factual and correct knowledge discovery based on verifiable facts curated in the knowledge bases.
Figure 2.
Figure 2.
An example of article annotation data in the BioRED-BC8 corpus.
Figure 3.
Figure 3.
Illustration of a disease–variant relation (positive correlation) where the two entities do not co-occur in the same sentence.
Figure 4.
Figure 4.
Entity composition in the BioRED-BC8 training and test datasets. The bar graphs illustrate the compatibility: the corpus composition for each relation type is relatively similar for both the training and testing datasets. The Venn diagrams illustrate that the training and testing datasets are complimentary, in that while a proportion of entities are present in both sets of articles, additional new data are available for each type.
Figure 5.
Figure 5.
Relationship composition in the BioRED-BC8 training and test datasets.

References

    1. Islamaj Dogan R, Murray GC, Névéol A. et al. Understanding PubMed® user search behavior through log analysis. Database 2009;2009:bap018.doi: 10.1093/database/bap018 - DOI - PMC - PubMed
    1. Sayers EW, Bolton EE, Brister JR. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res 2022;50:D20.doi: 10.1093/nar/gkab1112 - DOI - PMC - PubMed
    1. Schoch CL, Ciufo S, Domrachev M. et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database 2020;2020:baaa062. - PMC - PubMed
    1. Sherry ST, Ward M-H, Kholodov M. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001;29:308–11.doi: 10.1093/nar/29.1.308 - DOI - PMC - PubMed
    1. Landrum MJ, Chitipiralla S, Brown GR. et al. ClinVar: improvements to accessing data. Nucleic Acids Res 2020;48:D835–44.doi: 10.1093/nar/gkz972 - DOI - PMC - PubMed

Publication types

LinkOut - more resources