Efficient labeling of french mammogram reports with MammoBERT

Nazanin Dehghani¹, Vera Saliba-Colombani², Aurélien Chick², Morgane Heng², Grégory Operto², Pierre Fillard²

Affiliations

¹ Therapixel Company, 1 Imp. Reille, 75014, Paris, France. ndehghani@therapixel.com.
² Therapixel Company, 1 Imp. Reille, 75014, Paris, France.

PMID: 39438627
PMCID: PMC11496794
DOI: 10.1038/s41598-024-76369-y

Efficient labeling of french mammogram reports with MammoBERT

Nazanin Dehghani et al. Sci Rep. 2024.

. 2024 Oct 22;14(1):24842.

doi: 10.1038/s41598-024-76369-y.

Authors

Nazanin Dehghani¹, Vera Saliba-Colombani², Aurélien Chick², Morgane Heng², Grégory Operto², Pierre Fillard²

Affiliations

¹ Therapixel Company, 1 Imp. Reille, 75014, Paris, France. ndehghani@therapixel.com.
² Therapixel Company, 1 Imp. Reille, 75014, Paris, France.

PMID: 39438627
PMCID: PMC11496794
DOI: 10.1038/s41598-024-76369-y

Abstract

Recent advances in deep learning and natural language processing (NLP) have broadened opportunities for automatic text processing in the medical field. However, the development of models for low-resource languages like French is challenged by limited datasets, often due to legal restrictions. Large-scale training of medical imaging models often requires extracting labels from radiology text reports. Current methods for report labeling primarily rely on sophisticated feature engineering based on medical domain knowledge or manual annotations by radiologists. These methods can be labor-intensive. In this work, we introduce a BERT-based approach for the efficient labeling of French mammogram image reports. Our method leverages both the expansive scale of existing rule-based systems and the precision of radiologist annotations. Our experimental results showcase the superiority of the proposed approach. It was initially fine-tuned on a limited dataset of radiologist annotations. Then, it underwent training on annotations generated by a rule-based labeler. Our findings reveal that our final model, MammoBERT, significantly outperforms the rule-based labeler while simultaneously reducing the necessity for radiologist annotations during training. This research not only advances the state of the art in medical image report labeling but also offers an efficient and effective solution for large-scale medical imaging model development.

Keywords: Breast cancer detection; Deep learning; Information extraction; Mammography report labeling; Natural language processing.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
Architecture of the Mammography Report Labeling Approach. This two-phase method begins with Phase 1, where a pre-trained BERT-based model is initially trained on a small dataset of radiologist annotations ( $\sim$ 2k). In Phase 2, the model undergoes fine-tuning through an active learning loop, which integrates a combination of manual annotations ( $\sim$ 2k) selected via uncertainty sampling, and a larger set of automatic rule-based labels ( $\sim$ 40k) acquired through agreement sampling.

**Fig. 2**
Schematic Representation of the Two-Level Sub-Classification Model. The first level involves the Surgery_Presence model, which performs binary classification to determine the presence or absence of prior surgery. The second level, applicable only to cases with prior surgery history, involves the Surgery_Laterality model, which further classifies the surgery into three categories: left surgery, right surgery, or bilateral surgeries.

See this image and copyright information in PMC

References

1. Ferlay, J. et al. Cancer statistics for the year 2020: An overview. Int. J. Cancer 149, 778–789 (2021). - PubMed
1. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M. & Summers, R. M. ChestX-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2097–2106 (2017).
1. Drozdov, I. et al. Supervised and unsupervised language modelling in chest X-ray radiological reports. PLoS One 15, e0229963 (2020). - PMC - PubMed
1. Wood, D. A., Lynch, J., Kafiabadi, S., Guilhem, E., Al Busaidi, A., Montvila, A., Varsavsky, T., Siddiqui, J., Gadapa, N., Townend, M. et al. Automated labelling using an attention model for radiology reports of mri scans (alarm). In Medical Imaging with Deep Learning, PMLR, pp. 811–826 (2020).
1. Martin, L., Muller, B., Ortiz Suarez, P. J., Dupont, Y., Romary, L., de la Clergerie, É. V., Seddah, D. & Sagot, B. Camembert: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020).

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Efficient labeling of french mammogram reports with MammoBERT

Affiliations

Efficient labeling of french mammogram reports with MammoBERT

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical