. 2024 May 22;14(11):1081.

doi: 10.3390/diagnostics14111081.

Improving the Generalizability and Performance of an Ultrasound Deep Learning Model Using Limited Multicenter Data for Lung Sliding Artifact Identification

Derek Wu¹, Delaney Smith², Blake VanBerlo², Amir Roshankar³, Hoseok Lee², Brian Li³, Faraz Ali³, Marwan Rahman³, John Basmaji⁴, Jared Tschirhart⁵, Alex Ford⁶, Bennett VanBerlo⁷, Ashritha Durvasula⁵, Claire Vannelli⁵, Chintan Dave⁴, Jason Deglint³, Jordan Ho⁸, Rushil Chaudhary¹, Hans Clausdorff⁹, Ross Prager⁴, Scott Millington¹⁰, Samveg Shah¹¹, Brian Buchanan¹², Robert Arntfield⁴

Affiliations

¹ Department of Medicine, Western University, London, ON N6A 5C1, Canada.
² Faculty of Mathematics, University of Waterloo, Waterloo, ON N2L 3G1, Canada.
³ Faculty of Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada.
⁴ Division of Critical Care Medicine, Western University, London, ON N6A 5C1, Canada.
⁵ Schulich School of Medicine and Dentistry, Western University, London, ON N6A 5C1, Canada.
⁶ Independent Researcher, London, ON N6A 1L8, Canada.
⁷ Faculty of Engineering, Western University, London, ON N6A 5C1, Canada.
⁸ Department of Family Medicine, Western University, London, ON N6A 5C1, Canada.
⁹ Departamento de Medicina de Urgencia, Pontificia Universidad Católica de Chile, Santiago 8331150, Chile.
¹⁰ Department of Critical Care Medicine, University of Ottawa, Ottawa, ON K1N 6N5, Canada.
¹¹ Department of Medicine, University of Alberta, Edmonton, AB T6G 2R3, Canada.
¹² Department of Critical Care, University of Alberta, Edmonton, AB T6G 2R3, Canada.

PMID: 38893608
PMCID: PMC11172006
DOI: 10.3390/diagnostics14111081

Improving the Generalizability and Performance of an Ultrasound Deep Learning Model Using Limited Multicenter Data for Lung Sliding Artifact Identification

Derek Wu et al. Diagnostics (Basel). 2024.

. 2024 May 22;14(11):1081.

doi: 10.3390/diagnostics14111081.

Authors

Affiliations

¹ Department of Medicine, Western University, London, ON N6A 5C1, Canada.
² Faculty of Mathematics, University of Waterloo, Waterloo, ON N2L 3G1, Canada.
³ Faculty of Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada.
⁴ Division of Critical Care Medicine, Western University, London, ON N6A 5C1, Canada.
⁵ Schulich School of Medicine and Dentistry, Western University, London, ON N6A 5C1, Canada.
⁶ Independent Researcher, London, ON N6A 1L8, Canada.
⁷ Faculty of Engineering, Western University, London, ON N6A 5C1, Canada.
⁸ Department of Family Medicine, Western University, London, ON N6A 5C1, Canada.
⁹ Departamento de Medicina de Urgencia, Pontificia Universidad Católica de Chile, Santiago 8331150, Chile.
¹⁰ Department of Critical Care Medicine, University of Ottawa, Ottawa, ON K1N 6N5, Canada.
¹¹ Department of Medicine, University of Alberta, Edmonton, AB T6G 2R3, Canada.
¹² Department of Critical Care, University of Alberta, Edmonton, AB T6G 2R3, Canada.

PMID: 38893608
PMCID: PMC11172006
DOI: 10.3390/diagnostics14111081

Abstract

Deep learning (DL) models for medical image classification frequently struggle to generalize to data from outside institutions. Additional clinical data are also rarely collected to comprehensively assess and understand model performance amongst subgroups. Following the development of a single-center model to identify the lung sliding artifact on lung ultrasound (LUS), we pursued a validation strategy using external LUS data. As annotated LUS data are relatively scarce-compared to other medical imaging data-we adopted a novel technique to optimize the use of limited external data to improve model generalizability. Externally acquired LUS data from three tertiary care centers, totaling 641 clips from 238 patients, were used to assess the baseline generalizability of our lung sliding model. We then employed our novel Threshold-Aware Accumulative Fine-Tuning (TAAFT) method to fine-tune the baseline model and determine the minimum amount of data required to achieve predefined performance goals. A subgroup analysis was also performed and Grad-CAM++ explanations were examined. The final model was fine-tuned on one-third of the external dataset to achieve 0.917 sensitivity, 0.817 specificity, and 0.920 area under the receiver operator characteristic curve (AUC) on the external validation dataset, exceeding our predefined performance goals. Subgroup analyses identified LUS characteristics that most greatly challenged the model's performance. Grad-CAM++ saliency maps highlighted clinically relevant regions on M-mode images. We report a multicenter study that exploits limited available external data to improve the generalizability and performance of our lung sliding model while identifying poorly performing subgroups to inform future iterative improvements. This approach may contribute to efficiencies for DL researchers working with smaller quantities of external validation data.

Keywords: POCUS; artificial intelligence; deep learning; explainability; generalizability; lung sliding; lung ultrasound; multicenter; pneumothorax; ultrasound.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
Schematic representation of our methods for data preprocessing through to M-mode creation and subsequent model development. (a) Frames in a 3 s LUS clip. (b) Vertical slice selection (red), restricted by pleural line ROI (green). (c) Vertical slicing across all frames. (d) Concatenating slices to form an M-mode image. (e) Obtaining the model’s prediction for the M-mode input image. (f) Final model output representing probability of absent lung sliding.

**Figure 2**
Dataset splits and fine-tuned models for a single trial of the TAAFT method. Data are incrementally added to the training set used for fine-tuning (green) and removed from the variable-sized validation set (light blue) while maintaining a fixed-size validation set (dark blue). This process continues until the two validation sets are the same. Three new models (M₁, M₂, and M₃) are produced, each being fine-tuned using a different proportion of the dataset and evaluated on each validation set (variable-sized and fixed-size).

**Figure 3**
Specificity (a) and sensitivity (b) results of the successful five-trial TAAFT fine-tuning experiment. The mean (thick line) and individual (thin line) trial-wise metrics observed on the variable (solid) and fixed (dashed) sized validation sets at each training set proportion (p_train) are shown. The predefined performance goals (sensitivity ≥ 0.901, specificity ≥ 0.793; shaded grey region) are met, on average, on the variable-sized validation set when M₀ is fine-tuned on $\frac{1}{3}$ of the external dataset.

**Figure 4**
Receiver-operating characteristic (ROC) curves and confusion matrices for the five-trial TAAFT experiment (mean ± standard deviation) and final model performance on the variable-sized validation set at the optimal training proportion (p_train = $\frac{1}{3}$ ). (a) AUC of the five trial TAAFT experiment fine-tuning on $\frac{1}{3}$ of the external dataset with an average of 0.916 (standard deviation represented by the light blue outline) and (b) the corresponding confusion matrix. (c) AUC of inference of the final model yielded 0.920 and (d) the corresponding confusion matrix on its variable-sized validation set.

**Figure 5**
Subgroup analysis results of the final model on its variable-sized validation set. Sensitivity (circles) and specificity (squares) are stratified by (a) imaging preset and (b) institution. The validation set’s subgroup distribution is reflected in the bottom panel of each subplot.

**Figure 6**
M-mode and corresponding Grad-CAM++ [23] saliency map images from a (a) true positive (D₄₆₂) example and a (b) true negative (D₁₁₇) example taken from the final model’s variable-sized validation set. Highly important features relating to model prediction are highlighted in red, which correspond to regions clinicians asses for lung sliding.

**Figure 7**
M-mode and corresponding Grad-CAM++ [27] saliency map image from a false positive prediction. The saliency map highlights the subcutaneous tissue above the pleural line that does not move with respiration, thus mimicking an absent lung sliding pattern. The significant depth at which this LUS clip was acquired likely contributed to the model’s incorrect prediction as well.

See this image and copyright information in PMC

Cited by

Automated Analysis of Ultrasound for the Diagnosis of Pneumothorax: A Systematic Review.
Kossoff J, Duncan S, Acharya J, Davis D. Kossoff J, et al. Cureus. 2024 Nov 2;16(11):e72896. doi: 10.7759/cureus.72896. eCollection 2024 Nov. Cureus. 2024. PMID: 39618742 Free PMC article. Review.
Progress in the Application of Artificial Intelligence in Ultrasound-Assisted Medical Diagnosis.
Yan L, Li Q, Fu K, Zhou X, Zhang K. Yan L, et al. Bioengineering (Basel). 2025 Mar 13;12(3):288. doi: 10.3390/bioengineering12030288. Bioengineering (Basel). 2025. PMID: 40150752 Free PMC article. Review.

References

1. Kim J., Hong J., Park H. Prospects of deep learning for medical imaging. Precis. Future Med. 2018;2:37–52. doi: 10.23838/pfm.2018.00030. - DOI
1. Shen D., Wu G., Suk H. Deep Learning in Medical Image Analysis. Annu. Rev. Biomed. Eng. 2017;21:221–248. doi: 10.1146/annurev-bioeng-071516-044442. - DOI - PMC - PubMed
1. Duran-Lopez L., Dominguez-Morales J.P., Corral-Jaime J., Diaz S.V., Linares-Barranco A. Covid-xnet: A custom deep learning system to diagnose and locate COVID-19 in chest x-ray images. Appl. Sci. 2020;10:5683. doi: 10.3390/app10165683. - DOI
1. Ozdemir O., Russell R.L., Berlin A.A. A 3D probabilistic deep learning system for detection and diagnosis of lung cancer using low-dose CT scans. IEEE Trans. Med. Imaging. 2019;39:1419–1429. doi: 10.1109/TMI.2019.2947595. - DOI - PubMed
1. Wang J., Yang X., Cai H., Tan W., Jin C., Li L. Discrimination of breast cancer with microcalcifications on mammography by deep learning. Sci. Rep. 2016;6:27327. doi: 10.1038/srep27327. - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improving the Generalizability and Performance of an Ultrasound Deep Learning Model Using Limited Multicenter Data for Lung Sliding Artifact Identification

Affiliations

Improving the Generalizability and Performance of an Ultrasound Deep Learning Model Using Limited Multicenter Data for Lung Sliding Artifact Identification

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Research Materials

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources

Research Materials