. 2024 Dec 13;10(12):2058-2072.

doi: 10.3390/tomography10120146.

BAE-ViT: An Efficient Multimodal Vision Transformer for Bone Age Estimation

Jinnian Zhang¹, Weijie Chen¹, Tanmayee Joshi¹, Xiaomin Zhang², Po-Ling Loh³, Varun Jog³, Richard J Bruce⁴, John W Garrett⁴, Alan B McMillan^{1

4

5

6}

Affiliations

¹ Department of Electrical and Computer Engineering, University of Wisconsin-Madison, Madison, WI 53706, USA.
² Department of Computer Science, University of Wisconsin-Madison, Madison, WI 53706, USA.
³ Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge CB2 1TN, UK.
⁴ Department of Radiology, University of Wisconsin-Madison, Madison, WI 53706, USA.
⁵ Department of Medical Physics, University of Wisconsin-Madison, Madison, WI 53706, USA.
⁶ Department of Biomedical Engineering, University of Wisconsin-Madison, Madison, WI 53706, USA.

PMID: 39728908
PMCID: PMC11679900
DOI: 10.3390/tomography10120146

BAE-ViT: An Efficient Multimodal Vision Transformer for Bone Age Estimation

Jinnian Zhang et al. Tomography. 2024.

. 2024 Dec 13;10(12):2058-2072.

doi: 10.3390/tomography10120146.

Authors

Jinnian Zhang¹, Weijie Chen¹, Tanmayee Joshi¹, Xiaomin Zhang², Po-Ling Loh³, Varun Jog³, Richard J Bruce⁴, John W Garrett⁴, Alan B McMillan^{1

4

5

6}

Affiliations

¹ Department of Electrical and Computer Engineering, University of Wisconsin-Madison, Madison, WI 53706, USA.
² Department of Computer Science, University of Wisconsin-Madison, Madison, WI 53706, USA.
³ Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge CB2 1TN, UK.
⁴ Department of Radiology, University of Wisconsin-Madison, Madison, WI 53706, USA.
⁵ Department of Medical Physics, University of Wisconsin-Madison, Madison, WI 53706, USA.
⁶ Department of Biomedical Engineering, University of Wisconsin-Madison, Madison, WI 53706, USA.

PMID: 39728908
PMCID: PMC11679900
DOI: 10.3390/tomography10120146

Abstract

This research introduces BAE-ViT, a specialized vision transformer model developed for bone age estimation (BAE). This model is designed to efficiently merge image and sex data, a capability not present in traditional convolutional neural networks (CNNs). BAE-ViT employs a novel data fusion method to facilitate detailed interactions between visual and non-visual data by tokenizing non-visual information and concatenating all tokens (visual or non-visual) as the input to the model. The model underwent training on a large-scale dataset from the 2017 RSNA Pediatric Bone Age Machine Learning Challenge, where it exhibited commendable performance, particularly excelling in handling image distortions compared to existing models. The effectiveness of BAE-ViT was further affirmed through statistical analysis, demonstrating a strong correlation with the actual ground-truth labels. This study contributes to the field by showcasing the potential of vision transformers as a viable option for integrating multimodal data in medical imaging applications, specifically emphasizing their capacity to incorporate non-visual elements like sex information into the framework. This tokenization method not only demonstrates superior performance in this specific task but also offers a versatile framework for integrating multimodal data in medical imaging applications.

Keywords: bone age regression; gender embedding; machine learning; multimodal data; vision transformer.

PubMed Disclaimer

Conflict of interest statement

Professor McMillan maintains a consulting position with Weinberg Medical Physics, LLC; however, no resources from this company were utilized in our study, and no financial ties exist between any author and the firm.

Figures

**Figure 1**
A comparison of architectures between the regression, ensemble, and embedding models. The red area indicates the intensive feature learning phase (red), which requires significant computational resources, while the green area indicates the multimodal feature fusion phase (green), where image features are integrated with non-visual features such as sex. (a) Regression model: This model only takes images as inputs, without using biological sex information. (b) Ensemble model: In this model, two branches encode the image and sex information into feature vectors separately. These features are concatenated and fed into linear layers for bone age estimation. The biological sex information is integrated after the image encoder and processed only by the linear layer. (c) Embedding model (proposed): Our proposed multimodal vision transformer converts both the image and sex information into tokens. These tokens interact in the transformer blocks through attention mechanisms and are finally projected through a linear layer for predictions.

**Figure 2**
BAE-ViT architecture with biological sex embedding. This diagram illustrates the three-stage design of the BAE-ViT architecture. In the feature fusion phase (green), the model uses patch-embedding, convolutions, and MBConv blocks, followed by patch merging to create token sequences. The biological sex information is tokenized through a linear layer and processed by the transformer alongside other visual patch tokens. In the feature learning phase (red), stages 1, 2, and 3 involve transformer blocks with shifted window attention, processing features at varying scales. The architecture is designed for efficient feature extraction and integrates sex information, facilitating enhanced classification performance.

**Figure 3**
Heatmaps of our proposed BAE-ViT and ensemble models using CNNs or TinyViT as the image encoder by ScoreCAM. The left four columns are male with bone ages of 34.2, 89.9, 149.1, and 202.3 months, respectively. The right four columns are female with bone ages of 75.0, 87.5, 118.2, and 162.0 months, respectively. Heatmaps tend to highlight joints within the fingers and hand.

**Figure 4**
Performance comparison between our proposed BAE-ViT model and the ResNet50 ensemble model. (a) Correlation between actual and predicted bone ages, demonstrating high correlation for both models. (b) Mean bias and standard deviation of bias for the models, with dot-dashed lines representing mean bias and dashed lines indicating the standard deviation. The mean bias, defined as the signed average error, indicates whether the model’s predictions are consistently higher or lower than the true values. Both models exhibit mean bias values close to zero (−0.66 for BAE-ViT and −0.70 for ResNet50), suggesting no significant overestimation or underestimation. BAE-ViT shows a slightly lower mean bias and standard deviation (5.40) compared to the ResNet50 ensemble model (5.50 standard deviation), indicating its superior accuracy and consistency in predictions.

**Figure 5**
ResNet50 training and testing loss comparison. This figure presents the loss curves for a ResNet50 model under different training conditions. Lighter solid lines show the training loss, and darker dashed lines represent the testing loss. The graph compares end-to-end training with pretraining on ImageNet-1k and RSNA datasets. The lower testing loss across all methods could imply a limited diversity in the testing dataset compared to the training dataset.

See this image and copyright information in PMC

References

1. Greulich W.W., Pyle S.I. Radiographic Atlas of Skeletal Development of the Hand and Wrist. Stanford University Press; Redwood City, CA, USA: 1959. [(accessed on 25 October 2022)]. Available online: http://www.sup.org/books/title/?id=2696.
1. Poznanski A.K. Assessment of Skeletal Maturity and Prediction of Adult Height (TW2 Method) Am. J. Dis. Child. 1977;131:1041–1042. doi: 10.1001/archpedi.1977.02120220107024. - DOI
1. Lee J.H., Kim Y.J., Kim K.G. Bone age estimation using deep learning and hand X-ray images. Biomed. Eng. Lett. 2020;10:323–331. doi: 10.1007/s13534-020-00151-y. - DOI - PMC - PubMed
1. Lee H. Fully Automated Deep Learning System for Bone Age Assessment. J. Digit. Imaging. 2017;30:427–441. doi: 10.1007/s10278-017-9955-8. - DOI - PMC - PubMed
1. Bui T.D., Lee J.J., Shin J. Incorporated region detection and classification using deep convolutional networks for bone age assessment. Artif. Intell. Med. 2019;97:1–8. doi: 10.1016/j.artmed.2019.04.005. - DOI - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- MDPI
- PubMed Central
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

BAE-ViT: An Efficient Multimodal Vision Transformer for Bone Age Estimation

Affiliations

BAE-ViT: An Efficient Multimodal Vision Transformer for Bone Age Estimation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources