Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 22;14(11):1081.
doi: 10.3390/diagnostics14111081.

Improving the Generalizability and Performance of an Ultrasound Deep Learning Model Using Limited Multicenter Data for Lung Sliding Artifact Identification

Affiliations

Improving the Generalizability and Performance of an Ultrasound Deep Learning Model Using Limited Multicenter Data for Lung Sliding Artifact Identification

Derek Wu et al. Diagnostics (Basel). .

Abstract

Deep learning (DL) models for medical image classification frequently struggle to generalize to data from outside institutions. Additional clinical data are also rarely collected to comprehensively assess and understand model performance amongst subgroups. Following the development of a single-center model to identify the lung sliding artifact on lung ultrasound (LUS), we pursued a validation strategy using external LUS data. As annotated LUS data are relatively scarce-compared to other medical imaging data-we adopted a novel technique to optimize the use of limited external data to improve model generalizability. Externally acquired LUS data from three tertiary care centers, totaling 641 clips from 238 patients, were used to assess the baseline generalizability of our lung sliding model. We then employed our novel Threshold-Aware Accumulative Fine-Tuning (TAAFT) method to fine-tune the baseline model and determine the minimum amount of data required to achieve predefined performance goals. A subgroup analysis was also performed and Grad-CAM++ explanations were examined. The final model was fine-tuned on one-third of the external dataset to achieve 0.917 sensitivity, 0.817 specificity, and 0.920 area under the receiver operator characteristic curve (AUC) on the external validation dataset, exceeding our predefined performance goals. Subgroup analyses identified LUS characteristics that most greatly challenged the model's performance. Grad-CAM++ saliency maps highlighted clinically relevant regions on M-mode images. We report a multicenter study that exploits limited available external data to improve the generalizability and performance of our lung sliding model while identifying poorly performing subgroups to inform future iterative improvements. This approach may contribute to efficiencies for DL researchers working with smaller quantities of external validation data.

Keywords: POCUS; artificial intelligence; deep learning; explainability; generalizability; lung sliding; lung ultrasound; multicenter; pneumothorax; ultrasound.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Schematic representation of our methods for data preprocessing through to M-mode creation and subsequent model development. (a) Frames in a 3 s LUS clip. (b) Vertical slice selection (red), restricted by pleural line ROI (green). (c) Vertical slicing across all frames. (d) Concatenating slices to form an M-mode image. (e) Obtaining the model’s prediction for the M-mode input image. (f) Final model output representing probability of absent lung sliding.
Figure 2
Figure 2
Dataset splits and fine-tuned models for a single trial of the TAAFT method. Data are incrementally added to the training set used for fine-tuning (green) and removed from the variable-sized validation set (light blue) while maintaining a fixed-size validation set (dark blue). This process continues until the two validation sets are the same. Three new models (M1, M2, and M3) are produced, each being fine-tuned using a different proportion of the dataset and evaluated on each validation set (variable-sized and fixed-size).
Figure 3
Figure 3
Specificity (a) and sensitivity (b) results of the successful five-trial TAAFT fine-tuning experiment. The mean (thick line) and individual (thin line) trial-wise metrics observed on the variable (solid) and fixed (dashed) sized validation sets at each training set proportion (ptrain) are shown. The predefined performance goals (sensitivity ≥ 0.901, specificity ≥ 0.793; shaded grey region) are met, on average, on the variable-sized validation set when M0 is fine-tuned on 13 of the external dataset.
Figure 4
Figure 4
Receiver-operating characteristic (ROC) curves and confusion matrices for the five-trial TAAFT experiment (mean ± standard deviation) and final model performance on the variable-sized validation set at the optimal training proportion (ptrain = 13). (a) AUC of the five trial TAAFT experiment fine-tuning on 13 of the external dataset with an average of 0.916 (standard deviation represented by the light blue outline) and (b) the corresponding confusion matrix. (c) AUC of inference of the final model yielded 0.920 and (d) the corresponding confusion matrix on its variable-sized validation set.
Figure 5
Figure 5
Subgroup analysis results of the final model on its variable-sized validation set. Sensitivity (circles) and specificity (squares) are stratified by (a) imaging preset and (b) institution. The validation set’s subgroup distribution is reflected in the bottom panel of each subplot.
Figure 6
Figure 6
M-mode and corresponding Grad-CAM++ [23] saliency map images from a (a) true positive (D462) example and a (b) true negative (D117) example taken from the final model’s variable-sized validation set. Highly important features relating to model prediction are highlighted in red, which correspond to regions clinicians asses for lung sliding.
Figure 7
Figure 7
M-mode and corresponding Grad-CAM++ [27] saliency map image from a false positive prediction. The saliency map highlights the subcutaneous tissue above the pleural line that does not move with respiration, thus mimicking an absent lung sliding pattern. The significant depth at which this LUS clip was acquired likely contributed to the model’s incorrect prediction as well.

Similar articles

Cited by

References

    1. Kim J., Hong J., Park H. Prospects of deep learning for medical imaging. Precis. Future Med. 2018;2:37–52. doi: 10.23838/pfm.2018.00030. - DOI
    1. Shen D., Wu G., Suk H. Deep Learning in Medical Image Analysis. Annu. Rev. Biomed. Eng. 2017;21:221–248. doi: 10.1146/annurev-bioeng-071516-044442. - DOI - PMC - PubMed
    1. Duran-Lopez L., Dominguez-Morales J.P., Corral-Jaime J., Diaz S.V., Linares-Barranco A. Covid-xnet: A custom deep learning system to diagnose and locate COVID-19 in chest x-ray images. Appl. Sci. 2020;10:5683. doi: 10.3390/app10165683. - DOI
    1. Ozdemir O., Russell R.L., Berlin A.A. A 3D probabilistic deep learning system for detection and diagnosis of lung cancer using low-dose CT scans. IEEE Trans. Med. Imaging. 2019;39:1419–1429. doi: 10.1109/TMI.2019.2947595. - DOI - PubMed
    1. Wang J., Yang X., Cai H., Tan W., Jin C., Li L. Discrimination of breast cancer with microcalcifications on mammography by deep learning. Sci. Rep. 2016;6:27327. doi: 10.1038/srep27327. - DOI - PMC - PubMed