Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 15;18(12):e0292356.
doi: 10.1371/journal.pone.0292356. eCollection 2023.

A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction

Affiliations

A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction

Charlotte Nachtegael et al. PLoS One. .

Abstract

Automatic biomedical relation extraction (bioRE) is an essential task in biomedical research in order to generate high-quality labelled data that can be used for the development of innovative predictive methods. However, building such fully labelled, high quality bioRE data sets of adequate size for the training of state-of-the-art relation extraction models is hindered by an annotation bottleneck due to limitations on time and expertise of researchers and curators. We show here how Active Learning (AL) plays an important role in resolving this issue and positively improve bioRE tasks, effectively overcoming the labelling limits inherent to a data set. Six different AL strategies are benchmarked on seven bioRE data sets, using PubMedBERT as the base model, evaluating their area under the learning curve (AULC) as well as intermediate results measurements. The results demonstrate that uncertainty-based strategies, such as Least-Confident or Margin Sampling, are statistically performing better in terms of F1-score, accuracy and precision, than other types of AL strategies. However, in terms of recall, a diversity-based strategy, called Core-set, outperforms all strategies. AL strategies are shown to reduce the annotation need (in order to reach a performance at par with training on all data), from 6% to 38%, depending on the data set; with Margin Sampling and Least-Confident Sampling strategies moreover obtaining the best AULCs compared to the Random Sampling baseline. We show through the experiments the importance of using AL methods to reduce the amount of labelling needed to construct high-quality data sets leading to optimal performance of deep learning models. The code and data sets to reproduce all the results presented in the article are available at https://github.com/oligogenic/Deep_active_learning_bioRE.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Diagram representing an AL scenario.
The data set can be split into the test set, used to evaluate the performance of the machine learning model after each AL iteration, the labelled set and the unlabelled set. An active learning iteration is divided into 3 steps: training, selection and labelling. First, the labelled set is used to train a model, which is then used to select in the unlabelled set the most informative instances to label according to a strategy, such as the instances the model is the most uncertain about. Finally, those selected instances are labelled by an oracle (generally a human expert) and added to the labelled set. The AL loop can be stopped once a stop criterion is reached, such as a specific performance of the trained model on a test set, a number of AL iterations or a specific amount of instances that are to be labelled.
Fig 2
Fig 2. Distribution of the relative difference between the AL strategies and the Random baseline across the AL iterations.
Y-values bigger than 0 indicate that the selection technique is performing better than the Random baseline. Except for BatchBALD, all AL strategies tend to have a positive difference compared to random This difference decreases with the increase of the size of the data set used for training. Results outside of 1.5*inter- quartile range from the first quartile and third quartile are removed for clarity. Boxplots containing the outliers are available in the S1 File.
Fig 3
Fig 3. Examples of balance measures using Shannon Entropy (see Eq 7) across AL iterations.
Results for (A) an unbalanced data set, CDR, and (B) a balanced data set, Nary-DGV. The measure provides an average over each iteration. The same behaviours were observed for the other data sets of the same distribution (S3 File). For each data set, each panel highlights the results for the specific AL strategy and shows the others in grey.
Fig 4
Fig 4. Fraction of positive instances in the training set for the unbalanced data sets.
Data sets are as follows, (A) AIMED, (B) BioRED, (C) CDR, (D) ChemProt and (E) DDI. The measures are averaged over each iteration. For each data set, each panel highlights the results for a strategy and greys out the others.
Fig 5
Fig 5. Fraction of positive instances in the training set for the balanced data sets.
Data sets are as follows, (A) Nary-DGV and (B) Nary-DV. The measures are averaged over each iteration. For each data set, each panel highlights the results for a strategy and greys out the others.

References

    1. Fiorini N, Leaman R, Lipman DJ, Lu Z. How user intelligence is improving PubMed. Nature Biotechnology. 2018;36(10):937–945. doi: 10.1038/nbt.4267 - DOI - PubMed
    1. Wei CH, Kao HY, Lu Z. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Research. 2013;41(W1):W518–W522. doi: 10.1093/nar/gkt441 - DOI - PMC - PubMed
    1. Perera N, Dehmer M, Emmert-Streib F. Named Entity Recognition and Relation Detection for Biomedical Information Extraction. Frontiers in Cell and Developmental Biology. 2020;8. doi: 10.3389/fcell.2020.00673 - DOI - PMC - PubMed
    1. Yu H, Hatzivassiloglou V, Friedman C, Rzhetsky A, Wilbur WJ. Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. Proceedings AMIA Symposium. 2002; p. 919–923. - PMC - PubMed
    1. Liu H, Friedman C. Mining terminological knowledge in large biomedical corpora. Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing. 2003; p. 415–426. - PubMed