Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 28:2024:baae039.
doi: 10.1093/database/baae039.

DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations

Affiliations

DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations

Charlotte Nachtegael et al. Database (Oxford). .

Abstract

While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Figure 1.
Figure 1.
Schematic representation of the whole annotation process. Unlabelled instances are represented as unfilled circles, labelled instances as filled circles and the machine learning model as gears. The left picture depicts the filtering process of the OLIDAv2 articles, starting with 318 articles, to the 85 articles at the origin of the instances composing the DUVEL dataset. The right side represents the annotation with AL, with 1000 and 500 instances randomly selected from all the 511 635 unlabelled instances for the test set and initial labelled set, respectively. The labelled set is used to fine-tune the BiomedBERT model, whose performance is evaluated with the test set. The fine-tuned model selects 500 instances in the unlabelled instances, which it considers as the most informative according to the Margin selection strategy. The selected instances are then labelled and added to the labelled set for a new iteration of fine-tuning-selection-labelling. The process stops once 4500 instances have been selected and labelled.
Figure 2.
Figure 2.
Examples of the different classes. Genes and variants are highlighted in the text. (A) Example of a text for the negative class. In this example, the variants and genes belong to two different patients and thus are not involved in a digenic variant relation. (B) Example of a text for the positive class. The text clearly states that the gene–variant pair is co-inherited in patients. The digenic variant combination depicted here is OLI504, more information can be found on https://olida.ibsquare.be/detail/Combination/OLI504/.
Figure 3.
Figure 3.
Fraction of positive samples in the training set across the AL iterations. The initial randomly selected labelled set has a positive fraction of 1%, which increases to more than 14% after the first round of AL selection and the addition of the 500 samples. The fraction of positive instances decreases slightly during the next two rounds of AL, but after that reaches and maintains a positive fraction between 13% and 15% until the end of the AL process.
Figure 4.
Figure 4.
F1-score over the AL iterations during the annotation of the DUVEL data set, evaluated with the DUVEL test set instead of the test set initialised during the AL process. The initially selected samples during the AL process are filtered to exclude the samples present in the DUVEL test set.

References

    1. Wei C.-H., Allot A., Leaman R. et al. (2019) PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res., 47, W587–W593. - PMC - PubMed
    1. Bunescu R., Ge R., Kate R.J. et al. (2005) Comparative experiments on learning information extractors for proteins and their interactions. Artif. Intell. Med., 33, 139–155. - PubMed
    1. Pyysalo S., Ginter F., Heimonen J. et al. (2007) BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinf., 8, 50. - PMC - PubMed
    1. Herrero-Zazo M., Segura-Bedmar I., Martínez P. et al. (2013) The DDI corpus: an annotated corpus with pharmacological substances and drug-drug interactions. J. Biomed. Inform., 46, 914–920. - PubMed
    1. Tiktinsky A., Viswanathan V., Niezni D. et al. (2022) A dataset for N-ary relation extraction of drug combinations. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Stroudsburg, PA, USA.

Publication types