Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul:13413:187-198.
doi: 10.1007/978-3-031-12053-4_14.

Weakly Supervised Captioning of Ultrasound Images

Affiliations

Weakly Supervised Captioning of Ultrasound Images

Mohammad Alsharid et al. Med Image Underst Anal (2022). 2022 Jul.

Abstract

Medical image captioning models generate text to describe the semantic contents of an image, aiding the non-experts in understanding and interpretation. We propose a weakly-supervised approach to improve the performance of image captioning models on small image-text datasets by leveraging a large anatomically-labelled image classification dataset. Our method generates pseudo-captions (weak labels) for caption-less but anatomically-labelled (class-labelled) images using an encoder-decoder sequence-to-sequence model. The augmented dataset is used to train an image-captioning model in a weakly supervised learning manner. For fetal ultrasound, we demonstrate that the proposed augmentation approach outperforms the baseline on semantics and syntax-based metrics, with nearly twice as much improvement in value on BLEU-1 and ROUGE-L. Moreover, we observe that superior models are trained with the proposed data augmentation, when compared with the existing regularization techniques. This work allows seamless automatic annotation of images that lack human-prepared descriptive captions for training image-captioning models. Using pseudo-captions in the training data is particularly useful for medical image captioning when significant time and effort of medical experts is required to obtain real image captions.

Keywords: Data Augmentation; Fetal Ultrasound; Image Captioning.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The sequence-to-sequence model architecture translates the sequence of the anatomical label and the nouns into a pseudo-caption. In the input sequence, the first ‘spine’ is the label, and ‘end’ and the second ‘spine’ are the extracted nouns. Both the encoder and decoder consist of 100 LSTM units. The word embedding size is 300.
Fig. 2
Fig. 2
The image captioning model. Max length represents the maximum number of words a caption of this anatomical structure could consist of. The LSTM-RNN consists of 300 units.
Fig. 3
Fig. 3
Qualitative results for different images. GT stands for ‘Ground Truth’ as spoken by a sonographer. ‘NP’ stands for model trained with ‘No Pseudo-captions’. ‘WD’ stands for model regularized with ‘Word Dropout’. ‘WP’ stands for model trained ‘With Pseudo-captions’ (our proposed method).

References

    1. Google code archive. 2018. https://code.google.com/archive/p/word2vec/
    1. Evaluating models — automl translation documentation. 2020. https://cloud.google.com/translate/automl/docs/evaluate .
    1. Grammarbot. 2020. https://www.grammarbot.io/
    1. Textblob. 2020. https://textblob.readthedocs.io/en/dev/
    1. Context analysis in nlp: why it’s valuable and how it’s done. 2021. https://www.lexalytics.com/lexablog/context-analysis-nlp .

LinkOut - more resources