Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 22;14(1):24842.
doi: 10.1038/s41598-024-76369-y.

Efficient labeling of french mammogram reports with MammoBERT

Affiliations

Efficient labeling of french mammogram reports with MammoBERT

Nazanin Dehghani et al. Sci Rep. .

Abstract

Recent advances in deep learning and natural language processing (NLP) have broadened opportunities for automatic text processing in the medical field. However, the development of models for low-resource languages like French is challenged by limited datasets, often due to legal restrictions. Large-scale training of medical imaging models often requires extracting labels from radiology text reports. Current methods for report labeling primarily rely on sophisticated feature engineering based on medical domain knowledge or manual annotations by radiologists. These methods can be labor-intensive. In this work, we introduce a BERT-based approach for the efficient labeling of French mammogram image reports. Our method leverages both the expansive scale of existing rule-based systems and the precision of radiologist annotations. Our experimental results showcase the superiority of the proposed approach. It was initially fine-tuned on a limited dataset of radiologist annotations. Then, it underwent training on annotations generated by a rule-based labeler. Our findings reveal that our final model, MammoBERT, significantly outperforms the rule-based labeler while simultaneously reducing the necessity for radiologist annotations during training. This research not only advances the state of the art in medical image report labeling but also offers an efficient and effective solution for large-scale medical imaging model development.

Keywords: Breast cancer detection; Deep learning; Information extraction; Mammography report labeling; Natural language processing.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Architecture of the Mammography Report Labeling Approach. This two-phase method begins with Phase 1, where a pre-trained BERT-based model is initially trained on a small dataset of radiologist annotations (2k). In Phase 2, the model undergoes fine-tuning through an active learning loop, which integrates a combination of manual annotations (2k) selected via uncertainty sampling, and a larger set of automatic rule-based labels (40k) acquired through agreement sampling.
Fig. 2
Fig. 2
Schematic Representation of the Two-Level Sub-Classification Model. The first level involves the Surgery_Presence model, which performs binary classification to determine the presence or absence of prior surgery. The second level, applicable only to cases with prior surgery history, involves the Surgery_Laterality model, which further classifies the surgery into three categories: left surgery, right surgery, or bilateral surgeries.

References

    1. Ferlay, J. et al. Cancer statistics for the year 2020: An overview. Int. J. Cancer 149, 778–789 (2021). - PubMed
    1. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M. & Summers, R. M. ChestX-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2097–2106 (2017).
    1. Drozdov, I. et al. Supervised and unsupervised language modelling in chest X-ray radiological reports. PLoS One 15, e0229963 (2020). - PMC - PubMed
    1. Wood, D. A., Lynch, J., Kafiabadi, S., Guilhem, E., Al Busaidi, A., Montvila, A., Varsavsky, T., Siddiqui, J., Gadapa, N., Townend, M. et al. Automated labelling using an attention model for radiology reports of mri scans (alarm). In Medical Imaging with Deep Learning, PMLR, pp. 811–826 (2020).
    1. Martin, L., Muller, B., Ortiz Suarez, P. J., Dupont, Y., Romary, L., de la Clergerie, É. V., Seddah, D. & Sagot, B. Camembert: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020).

LinkOut - more resources