Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr;9(4):507-520.
doi: 10.1038/s41551-024-01257-9. Epub 2024 Oct 1.

Accurate prediction of disease-risk factors from volumetric medical scans by a deep vision model pre-trained with 2D scans

Affiliations

Accurate prediction of disease-risk factors from volumetric medical scans by a deep vision model pre-trained with 2D scans

Oren Avram et al. Nat Biomed Eng. 2025 Apr.

Abstract

The application of machine learning to tasks involving volumetric biomedical imaging is constrained by the limited availability of annotated datasets of three-dimensional (3D) scans for model training. Here we report a deep-learning model pre-trained on 2D scans (for which annotated data are relatively abundant) that accurately predicts disease-risk factors from 3D medical-scan modalities. The model, which we named SLIViT (for 'slice integration by vision transformer'), preprocesses a given volumetric scan into 2D images, extracts their feature map and integrates it into a single prediction. We evaluated the model in eight different learning tasks, including classification and regression for six datasets involving four volumetric imaging modalities (computed tomography, magnetic resonance imaging, optical coherence tomography and ultrasound). SLIViT consistently outperformed domain-specific state-of-the-art models and was typically as accurate as clinical specialists who had spent considerable time manually annotating the analysed scans. Automating diagnosis tasks involving volumetric scans may save valuable clinician hours, reduce data acquisition costs and duration, and help expedite medical research and clinical applications.

PubMed Disclaimer

Conflict of interest statement

Competing interests: E.H. has an affiliation with Optum. S.R.S. has affiliations with Abbvie/Allergan, Alexion, Amgen, Apellis, ARVO, Astellas, Bayer, Biogen, Boerhinger Ingelheim, Carl Zeiss Meditec, Centervue, Character, Eyepoint, Heidelberg, iCare, IvericBio, Jannsen, Macula Society, Nanoscope, Nidek, NotalVision, Novartis, Optos, OTx, Pfizer, Regeneron, Roche, Samsung Bioepis and Topcon. The other authors declare no competing interests.

Figures

Extended Data Fig. 1 |
Extended Data Fig. 1 |. PR AUC comparison of five models in four single-task AMD-biomarker classification problems when trained on less than 700 OCT volumes.
Shown are the PR AUC as an alternative scoring metric for the OCT experiments shown in Fig. 3. The left panel shows the performance when trained and tested on the Houston Dataset (see Supplementary Table 1). The right panel shows the performance when trained on the Houston Dataset and tested on the SLIVER-net Dataset (see Supplementary Table 2). The dashed lines represent the corresponding biomarker’s positive-label prevalence, which is the expected performance of a random model. Box plot whiskers represent a 90% CI.
Extended Data Fig. 2 |
Extended Data Fig. 2 |. Precision-recall performance compared to clinical retina specialists’ assessment.
Shown are the PR curves (blue) of SLIViT as an alternative scoring metric for the OCT experiments shown in Fig. 5. SLIViT was trained using less than 700 OCT volumes (Houston Dataset) and tested on an independent dataset (Pasadena Dataset). In each panel, the light-blue shaded area represents a 90% CI for SLIViT’s performance, the red dot represents the retina clinical specialists’ average performance, and the green asterisks correspond to the retina clinical specialists’ assessments. Two of the clinical specialists obtained the exact same performance score for IHRF classification.
Extended Data Fig. 3 |
Extended Data Fig. 3 |. SLIViT’s performance in a frame-shuffling experiment.
Shown are the ROC AUC scores distribution of 101 SLIViT models in four single-task classification problems of AMD high-risk factors (DV, IHRF, SDD, and hDC) trained on volumetric-OCT dataset. One model was trained on the OCT dataset in its original form, while the other 100 models were trained on randomly shuffled copies of the dataset. The performance ranks of the former model (Original) compared to the performance distribution of the latter models (Shuffled) were 22, 34, 56, and 47 for DV, IHRF, SDD, and hDC, respectively. The expected performance of a random classifier is 0.5. Box plot whiskers extend to the 5th and the 95th ranked models (out of the 100 shuffled models’ performance distribution).
Extended Data Fig. 4 |
Extended Data Fig. 4 |. ImageNet and OCT B-scans pre-training contribution for OCT-related downstream learning tasks.
Shown are the ROC (left) and PR (right) AUC scores across different fine-tuned models for volumetric-OCT classification tasks initialized with five different sets of weights. Combined, the proposed SLIViT’s initialization, is ImageNet weights initialization followed by supervised pre-training on the Kermany Dataset. ssCombined is an ImageNet weights initialization followed by self-supervised pre-training on an unlabeled version of the Kermany Dataset. The expected ROC AUC score of a random model is 0.5. The dashed lines represent the corresponding biomarker’s positive-label prevalence, which is the expected PR AUC score of a random model. Box plot whiskers represent a 90% CI.
Extended Data Fig. 5 |
Extended Data Fig. 5 |. ImageNet and OCT B-scans pre-training contribution for non-OCT-related downstream learning tasks.
Shown are the performance scores for the volumetric ultrasound and MRI regression tasks (R2) and the volumetric CT classification task (ROC AUC) initialized with five different sets of weights. Combined, the proposed SLIViT’s initialization, is ImageNet weights initialization followed by supervised pre-training on the Kermany Dataset. ssCombined is an ImageNet weights initialization followed by self-supervised pre-training on an unlabeled version of the Kermany Dataset. The expected R2 and ROC AUC of a random model are 0 and 0.5, respectively. Box plot whiskers represent a 90% CI.
Extended Data Fig. 6 |
Extended Data Fig. 6 |. Feature similarity analysis between various pre-trained backbone projections.
Shown are nine scatterplots of similarity analysis (CKA) when comparing the projections of a biomedical-imaging dataset induced by different biomedical-imaging pre-trained backbones. Each panel corresponds to a different pair of pre-trained backbones (upper- biomedical pairs; middle- biomedical and ImageNet pairs; lower- biomedical and random pairs). In each panel, each of the 768 dots represents the similarity score computed for the projections induced by the corresponding filter. A dot is red if it falls within the top 5% scores (and gray otherwise). The dashed lines show the average score measured for the color-corresponding set of dots.
Extended Data Fig. 7 |
Extended Data Fig. 7 |. 2D biomedical-imaging pre-training performance contribution for 3D OCT-related downstream learning tasks.
Shown are the ROC AUC scores on four volumetric-OCT single-task classification problems. Four SLIViT models were evaluated in every classification problem. Each SLIViT model was initialized with ImageNet weights and then pre-trained on a 2D biomedical-imaging dataset of a different modality. The considered modalities were CT, X-ray, OCT, and Mixed (containing all the images from the CT, X-ray, and OCT datasets). SLIVER-net’s performance (Domain-specific) is borrowed from Fig. 3. The expected performance of a random model is 0.5. Box plot whiskers represent a 90% CI.
Extended Data Fig. 8 |
Extended Data Fig. 8 |. 2D biomedical-imaging pre-training performance contribution for 3D non-OCT-related downstream learning tasks.
Shown are the performance scores for the volumetric ultrasound and MRI regression tasks (R2) and the volumetric CT classification task (ROC AUC). Four SLIViT models were evaluated in every learning problem. Each SLIViT model was initialized with ImageNet weights and then pre-trained on a 2D biomedical-imaging dataset of a different modality. The considered modalities were CT, X-ray, OCT, and Mixed (containing all the images from the CT, X-ray, and OCT datasets). The performance scores of the domain-specific methods were borrowed from Fig. 2. The expected R2 and ROC AUC of a random model are 0 and 0.5, respectively. Box plot whiskers represent a 90% CI.
Fig. 1 |
Fig. 1 |. The SLIViT framework.
The input of SLIViT is a 3D volume of N frames of size H × W. (1) The frames of the volume are resized and vertically tiled into an ‘elongated image’. (2) The elongated image is fed into a ConvNeXt-based feature-map extractor that was pre-trained on both natural and biomedical 2D labelled images. (3) An 8N × 8 × 768 (3D) feature map is extracted and partitioned into N patches of size 8 × 8 × 768, each (roughly) representing features extracted from the corresponding original frame. (4) Patches are fed into a ViT-based feature-map integrator followed by a fully connected layer that outputs the prediction for the task in question (see Methods for further details).
Fig. 2 |
Fig. 2 |. Overview of SLIViT’s performance across 3D imaging modalities.
The performance scores in four different volumetric-biomedical-imaging learning tasks: eye-disease biomarker diagnosis in 3D OCT scans (classification), heart-function analysis in ultrasound (US) videos (regression), liver fat level imputation in volumetric MRI scans (regression) and lung malignant cell-aggregation screening in 3D CT scans (classification). The domain-specific methods (hatched) used are SLIVER-net, EchoNet, 3D ResNet and UniMiSS for OCT, ultrasound, MRI and CT, respectively. The cross-modality benchmarking used are 3D ResNet and UniMiSS, which are (fully) supervised-based and self-supervised-based, respectively (see relevant experiment’s section for additional benchmarking). The expected R2 and ROC AUC of a random model are 0 and 0.5, respectively. Box plot whiskers represent a 90% CI.
Fig. 3 |
Fig. 3 |. Performance comparison on four tasks of AMD-biomarker classification when trained on less than 700 OCT volumes.
The ROC AUC scores of SLIViT, SLIVER-net, 3D ResNet, 3D ViT and UniMiSS on four binary classification problems of AMD high-risk factors (DV, IHRF, SDD and hDC) in two independent 3D OCT datasets. Left: the performance when trained and tested on the Houston Dataset (Supplementary Table 1). Right: the performance when trained on the Houston Dataset and tested on the SLIVER-net Dataset (Supplementary Table 2). The expected performance of a random model is 0.5. Box plot whiskers represent a 90% CI.
Fig. 4 |
Fig. 4 |. Performance comparison on cardiac function prediction tasks when trained on echocardiograms.
The R2 scores of SLIViT, 3D ResNet, EchoNet and UniMiSS on heart ejection fraction prediction. Several SLIViT models were trained, each on a different-sized training subset (sampled from the original training set). The x axis shows the sampled subset size (in percentage) used for training, where 100% corresponds to the original training set. Box plot whiskers represent a 90% CI. It is worth noting that SLIViT, when trained on 25% (n = 1,866) of the original training set, obtained similar accuracy as the other examined methods trained on 100% (n = 7,465) of the original training set.
Fig. 5 |
Fig. 5 |. SLIViT’s performance compared with manual assessment by retina clinical specialists.
The ROC curves (blue) of SLIViT trained to predict four AMD high-risk biomarkers (DV, IHRF, SDD and hDC) using less than 700 OCT volumes (Houston Dataset) and tested on an independent dataset (Pasadena Dataset). In each panel, the light-blue shaded area represents a 90% CI for SLIViT’s performance, and the red dot represents the clinical specialists’ average performance. The green asterisks correspond to the clinical specialists’ manual assessments. Two of the clinical specialists obtained the exact same performance score for IHRF classification. TPR, true positive rate; FPR, false positive rate.

Update of

References

    1. Chiang JN et al. Automated identification of incomplete and complete retinal epithelial pigment and outer retinal atrophy using machine learning. Ophthalmol. Retina 7, 118–126 (2023). - PubMed
    1. Wong TY, Liew G & Mitchell P Clinical update: new treatments for age-related macular degeneration. Lancet 370, 204–206 (2007). - PubMed
    1. Gandhi SK et al. The pathogenesis of acute pulmonary edema associated with hypertension. N. Engl. J. Med. 344, 17–22 (2001). - PubMed
    1. Bloom MW et al. Heart failure with reduced ejection fraction. Nat. Rev. Dis. Primers 3, 17058 (2017). - PubMed
    1. Guindalini RSC et al. Intensive surveillance with biannual dynamic contrast-enhanced magnetic resonance imaging downstages breast cancer in BRCA1 mutation carriers. Clin. Cancer Res. 25, 1786–1794 (2019). - PMC - PubMed

LinkOut - more resources