Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar 4;13(1):1161.
doi: 10.1038/s41467-022-28818-3.

Active label cleaning for improved dataset quality under resource constraints

Affiliations

Active label cleaning for improved dataset quality under resource constraints

Mélanie Bernhardt et al. Nat Commun. .

Abstract

Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation-which we term "active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a specifically-devised medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed approach enables correcting labels up to 4 × more effectively than typical random selection in realistic conditions, making better use of experts' valuable time for improving dataset quality.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the proposed active label cleaning.
A dataset with noisy labels is sorted to prioritise clearly mislabelled samples, maximising the number of corrected samples given a fixed relabelling budget.
Fig. 2
Fig. 2. Image labelling can become difficult due to ambiguity in input space.
Top row shows the spectrum of ambiguity for cat images sampled from CIFAR10H dataset. The 2D plot illustrates different types of mislabelled samples: clear noise and difficult cases. We expect the former to be adjacent to semantically similar samples with a different label, and the latter to be closer to the optimal decision boundary.
Fig. 3
Fig. 3. Results of the label cleaning simulation on training datasets.
a NoisyCXR (η = 12.7%); b CIFAR10H (η = 15%). For a given number of collected labels (x-axis), a cost-efficient algorithm should maximise the number of samples that are now correctly labelled (y-axis). The correctness of acquired labels is measured in terms of accuracy. The area-under-the-curve (AUC) is reported as a summary of cleaning efficiency of each selector across different relabelling budgets. The upper and lower bounds are set by oracle (blue) and random sampling (red) strategies. The pink curve (a) illustrates the practical “model upper bound” of cleaning performance when the selector model is trained solely on clean labels, its performance being bound to the capacity of the model to fit the data. Shaded areas represent ± standard deviation over 5 random seeds for relabelling.
Fig. 4
Fig. 4. Ranking of CIFAR10H samples (15% initial noise rate) by the SSL-Linear algorithm.
The top row illustrates a representative subset of images ranked at the top-10 percentile with the highest priority for relabelling. Similarly, the second and third rows correspond to 25–50 and 50–75 percentiles, respectively. At the bottom, ambiguous examples that fall into the bottom 10% of the list (N = 2241) are shown. Each example is shown together with its true label distribution to highlight the associated labelling difficulty. This can be compared against the label noisiness (cross-entropy; CE) and sample ambiguity (entropy; AMB) scores predicted by the algorithm (see Eq. (2)), shown above each image. As pointed out earlier, adjudication of samples provided at the bottom does require a large number of re-annotations to form a consensus. The authors in ref. explore the causes of ambiguity observed in these samples.
Fig. 5
Fig. 5. Chest X-ray images selected from NoisyCXR dataset, which do not contain “pneumonia" label in the NIH dataset.
a Correctly identified noise case with pneumonia-like opacities shown with bounding boxes. b Wrongly flagged sample with a correct label; here the model confuses lung nodules with pneumonia-like opacities. c A difficult case with subtle abnormality where radiologists indicated medium-confidence in their diagnosis as shown by the highlighted region (RSNA study).
Fig. 6
Fig. 6. Understanding label noise patterns.
a Different label noise models used in robust learning. The statistical dependence between input image (X), true label (Y), observed label (Y^), and error occurrence (E) is shown with arrows (adapted from Frénay et al.). bc CIFAR10H class confusion matrices (temperature τ = 2) for all samples (b) and difficult samples only (c).

References

    1. Northcutt, C. G., Athalye, A. & Lin, J. Pervasive label errors in ML benchmark test sets, consequences, and benefits. In NeurIPS 2020 Workshop on Security and Data Curation Workshop (2020).
    1. Majkowska A, et al. Chest radiograph interpretation with deep learning models: assessment with radiologist-adjudicated reference standards and population-adjusted evaluation. Radiology. 2020;294:421–431. doi: 10.1148/radiol.2019191293. - DOI - PubMed
    1. Wang, X. et al. ChestX-Ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 3462–3471 (IEEE, 2017).
    1. Beyer, L., Hénaff, O. J., Kolesnikov, A., Zhai, X. & van den Oord, A. Are we done with ImageNet?arXiv preprint at 10.48550/arXiv.2006.07159 (2020).
    1. Peterson, J. C., Battleday, R. M., Griffiths, T. L. & Russakovsky, O. Human uncertainty makes classification more robust. In Proceedings of the IEEE International Conference on Computer Vision, 9617–9626 (IEEE, 2019).

Publication types

LinkOut - more resources