Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jul;181(1):155-165.
doi: 10.1111/bjd.17189. Epub 2018 Oct 17.

Diagnostic accuracy of content-based dermatoscopic image retrieval with deep classification features

Affiliations

Diagnostic accuracy of content-based dermatoscopic image retrieval with deep classification features

P Tschandl et al. Br J Dermatol. 2019 Jul.

Abstract

Background: Automated classification of medical images through neural networks can reach high accuracy rates but lacks interpretability.

Objectives: To compare the diagnostic accuracy obtained by using content-based image retrieval (CBIR) to retrieve visually similar dermatoscopic images with corresponding disease labels against predictions made by a neural network.

Methods: A neural network was trained to predict disease classes on dermatoscopic images from three retrospectively collected image datasets containing 888, 2750 and 16 691 images, respectively. Diagnosis predictions were made based on the most commonly occurring diagnosis in visually similar images, or based on the top-1 class prediction of the softmax output from the network. Outcome measures were area under the receiver operating characteristic curve (AUC) for predicting a malignant lesion, multiclass-accuracy and mean average precision (mAP), measured on unseen test images of the corresponding dataset.

Results: In all three datasets the skin cancer predictions from CBIR (evaluating the 16 most similar images) showed AUC values similar to softmax predictions (0·842, 0·806 and 0·852 vs. 0·830, 0·810 and 0·847, respectively; P > 0·99 for all). Similarly, the multiclass-accuracy of CBIR was comparable with softmax predictions. Compared with softmax predictions, networks trained for detecting only three classes performed better on a dataset with eight classes when using CBIR (mAP 0·184 vs. 0·368 and 0·198 vs. 0·403, respectively).

Conclusions: Presenting visually similar images based on features from a neural network shows comparable accuracy with the softmax probability-based diagnoses of convolutional neural networks. CBIR may be more helpful than a softmax classifier in improving diagnostic accuracy of clinicians in a routine clinical setting.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Positive examples of three query images (first column) and corresponding most similar images as found by content‐based image retrieval (CBIR). The results show similar dermatoscopic patterns that in the majority correspond to the correct diagnosis. MEL, melanoma; SCC, squamous cell carcinoma; BCC, basal cell carcinoma.
Figure 2
Figure 2
Measured visual similarity (cosine similarity) of images with the same diagnoses (blue) compared with others (red) in a dataset. Images of the same diagnoses are significantly rated higher in almost any subgroup, showing automated measurements of visual similarity can differentiate between diagnoses within a retrieval dataset. Lines are drawn between values for a single query image, and rows denote the dataset used for training, queries and image retrieval. Comparing differences was performed with a paired t‐test or a paired Wilcoxon signed rank test (W). ISIC2017, International Skin Imaging Collaboration, bcc, basal cell carcinoma; bkl, seborrhoeic keratoses; df, dermatofibromas; inflammatory, inflammatory lesions including dermatitis, lichen sclerosus, porokeratosis, rosacea, psoriasis, lupus erythematosus, bullous pemphigoid, lichen planus, granulomatous processes and artefacts mel, melanoma; scc, squamous cell carcinoma. NS, nonsignificant: P > 0·05, **P < 0·01, ***P < 0·001.
Figure 3
Figure 3
Frequency of correct specific diagnoses (accuracy) made within each dataset by either softmax‐based predictions (red) or content‐based image retrieval (CBIR) with a different number of retrieved similar images (black). Retrieval of a few images is already performing better in the three‐class datasets (EDRA, ISIC2017), whereas in the eight‐class (PRIV) dataset it takes over 20 images to approximate softmax‐based accuracy. ISIC2017, International Skin Imaging Collaboration.
Figure 4
Figure 4
Receiver operating characteristic curve for detecting melanoma when retrieving 16 similar images with content‐based image retrieval (CBIR) (grey), showing different thresholds of needed malignant retrieval images (‘predict melanoma when x of 16 retrieved images are melanomas’), in addition to softmax‐based probabilities (red). Network training‐, query‐ and retrieval‐images are from EDRA. AUC, area under the receiver operating characteristic curve.
Figure 5
Figure 5
Mean average precision (mAP) of a ResNet‐50 network trained on EDRA dataset images. Predictions were made either through softmax probabilities (red line) or class‐frequencies of content‐based image retrieval (CBIR) (black). Softmax predictions perform worst on predicting PRIV dataset images, as the networks are not able to predict five of the eight classes in any case (first two columns, bottom row). CBIR retrieving from EDRA and ISIC2017 suffers from the same shortcomings, but was able to predict better when using PRIV‐source retrieval images (bottom right). In general, CBIR performs best when using retrieval images from the same source as the test images (descending diagonal), and here performed better on new data than softmax predictions. Re‐training the network on those new‐source images (blue) in turn outperformed CBIR again. ISIC2017, International Skin Imaging Collaboration.
Figure 6
Figure 6
Mean cosine similarities of PRIV retrieval images with the same (blue) or different (red) diagnosis for the corresponding PRIV query images. Cosine similarity is calculated by feature extraction via ResNet‐50 networks trained for classification on different training datasets (rows). Compared with the PRIV‐trained network, those trained on different sources (row EDRA and ISIC2017) transfer their ability to distinguish specific diagnoses through visual similarity except for seborrhoeic keratoses (bkl) cases. Lines are drawn between values for the same query image. W, paired Wilcoxon signed‐rank test was used instead of paired t‐test; ISIC2017, International Skin Imaging Collaboration; bcc, basal cell carcinoma; df, dermatofibromas; mel, melanoma; scc, squamous cell carcinoma. NS, nonsignificant: P > 0·05, **P < 0·01, ***P < 0·001; grey indicators denote nonadjusted P‐values as these comparisons were omitted during correction for multiple testing (see statistics section).

Comment in

References

    1. Menzies S, Bischof L, Talbot H et al The performance of solarscan: an automated dermoscopy image analysis instrument for the diagnosis of primary melanoma. Arch Dermatol 2005; 141:1388–96. - PubMed
    1. Dreiseitl S, Binder M, Vinterbo S, Kittler H. Applying a decision support system in clinical practice: results from melanoma diagnosis. AMIA Annu Symp Proc 2007; 2007:191–5. - PMC - PubMed
    1. Esteva A, Kuprel B, Novoa RA et al Dermatologist‐level classification of skin cancer with deep neural networks. Nature 2017; 542:115–18. - PMC - PubMed
    1. Han SS, Kim MS, Lim W et al Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm. J Invest Dermat 2018; 138:1529–38. - PubMed
    1. Haenssle HA, Fink C, Schneiderbauer R et al Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann Oncol 2018; 29:1836–42. - PubMed

Publication types