A Novel 3D Convolutional Neural Network-Based Deep Learning Model for Spatiotemporal Feature Mapping for Video Analysis: Feasibility Study for Gastrointestinal Endoscopic Video Classification

Mrinal Kanti Dhar¹, Mou Deb², Poonguzhali Elangovan³, Keerthy Gopalakrishnan³, Divyanshi Sood³, Avneet Kaur³, Charmy Parikh³, Swetha Rapolu³, Gianeshwaree Alias Rachna Panjwani³, Rabiah Aslam Ansari³, Naghmeh Asadimanesh³, Shiva Sankari Karuppiah³, Scott A Helgeson^{3

4}, Venkata S Akshintala⁵, Shivaram P Arunachalam^{3

4}

Affiliations

¹ Department of Radiology, Mayo Clinic, Rochester, MN 55905, USA.
² Department of Biomedical Informatics and Computational Biology, University of Minnesota, Minneapolis, MN 55455, USA.
³ Digital Engineering & Artificial Intelligence Laboratory (DEAL), Mayo Clinic, Jacksonville, FL 32224, USA.
⁴ Department of Critical Care Medicine, Division of Pulmonary Medicine, Mayo Clinic, Jacksonville, FL 32224, USA.
⁵ Division of Gastroenterology & Hepatology, Department of Medicine, Johns Hopkins School of Medicine, Baltimore, MD 21218, USA.

PMID: 40710629
PMCID: PMC12295846
DOI: 10.3390/jimaging11070243

A Novel 3D Convolutional Neural Network-Based Deep Learning Model for Spatiotemporal Feature Mapping for Video Analysis: Feasibility Study for Gastrointestinal Endoscopic Video Classification

Mrinal Kanti Dhar et al. J Imaging. 2025.

. 2025 Jul 18;11(7):243.

doi: 10.3390/jimaging11070243.

Authors

Affiliations

¹ Department of Radiology, Mayo Clinic, Rochester, MN 55905, USA.
² Department of Biomedical Informatics and Computational Biology, University of Minnesota, Minneapolis, MN 55455, USA.
³ Digital Engineering & Artificial Intelligence Laboratory (DEAL), Mayo Clinic, Jacksonville, FL 32224, USA.
⁴ Department of Critical Care Medicine, Division of Pulmonary Medicine, Mayo Clinic, Jacksonville, FL 32224, USA.
⁵ Division of Gastroenterology & Hepatology, Department of Medicine, Johns Hopkins School of Medicine, Baltimore, MD 21218, USA.

PMID: 40710629
PMCID: PMC12295846
DOI: 10.3390/jimaging11070243

Abstract

Accurate analysis of medical videos remains a major challenge in deep learning (DL) due to the need for effective spatiotemporal feature mapping that captures both spatial detail and temporal dynamics. Despite advances in DL, most existing models in medical AI focus on static images, overlooking critical temporal cues present in video data. To bridge this gap, a novel DL-based framework is proposed for spatiotemporal feature extraction from medical video sequences. As a feasibility use case, this study focuses on gastrointestinal (GI) endoscopic video classification. A 3D convolutional neural network (CNN) is developed to classify upper and lower GI endoscopic videos using the hyperKvasir dataset, which contains 314 lower and 60 upper GI videos. To address data imbalance, 60 matched pairs of videos are randomly selected across 20 experimental runs. Videos are resized to 224 × 224, and the 3D CNN captures spatiotemporal information. A 3D version of the parallel spatial and channel squeeze-and-excitation (P-scSE) is implemented, and a new block called the residual with parallel attention (RPA) block is proposed by combining P-scSE3D with a residual block. To reduce computational complexity, a (2 + 1)D convolution is used in place of full 3D convolution. The model achieves an average accuracy of 0.933, precision of 0.932, recall of 0.944, F1-score of 0.935, and AUC of 0.933. It is also observed that the integration of P-scSE3D increased the F1-score by 7%. This preliminary work opens avenues for exploring various GI endoscopic video-based prospective studies.

Keywords: 3D convolutional neural network; P-scSE3D; deep learning; gastrointestinal endoscopic video classification; hyperKvasir dataset; spatiotemporal features.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

**Figure 1**
Proposed model. (a) Model architecture, (b) residual with parallel attention (RPA) block, which is the core of the model, and (c) residual block.

**Figure 2**
Parallel spatial and channel squeeze-and-excitation (P-scSE) module [47,50].

**Figure 3**
Representative examples of (a,b) upper GI image frame from endoscopic video, and (c,d) lower GI image frame from endoscopic video (Bottom).

**Figure 4**
Video segment generation. (a) Original video, (b) video split into segments. Zero-padded frames are added to keep the segment size fixed, and (c) the structure of a segment. A frame gap is used to skip some frames.

**Figure 5**
(**top**) ROC curves and (**bottom**) confusion matrices for different test accuracies.

**Figure 6**
Loss and accuracy curves for training and validation for different test accuracies.

**Figure 7**
Explainable AI (XAI). Guided Grad-CAM is used as XAI. The 1st and 3rd rows indicate the original videos. The 2nd and 4th rows blend the heatmap on them. The yellow zone indicates the more focused zone used for classification.

See this image and copyright information in PMC

References

1. Han L., Shi H., Li Y., Qi H., Wang Y., Gu J., Wu J., Zhao S., Cao P., Xu L., et al. Excess deaths of gastrointestinal, liver, and pancreatic diseases during the COVID-19 pandemic in the United States. Int. J. Public Health. 2023;68:1606305. doi: 10.3389/ijph.2023.1606305. - DOI - PMC - PubMed
1. Adedire O., Love N.K., Hughes H.E., Buchan I., Vivancos R., Elliot A.J. Early Detection and Monitoring of Gastrointestinal Infections Using Syndromic Surveillance: A Systematic Review. Int. J. Environ. Res. Public Health. 2024;21:489. doi: 10.3390/ijerph21040489. - DOI - PMC - PubMed
1. Borgli H., Thambawita V., Smedsrud P.H., Hicks S., Jha D., Eskeland S.L., Randel K.R., Pogorelov K., Lux M., Nguyen D.T.D., et al. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Sci. Data. 2020;7:283. doi: 10.1038/s41597-020-00622-y. - DOI - PMC - PubMed
1. Akpunonu B., Hummell J., Akpunonu J.D., Din S.U. Capsule endoscopy in gastrointestinal disease: Evaluation, diagnosis, and treatment. Clevel. Clin. J. Med. 2022;89:200–211. doi: 10.3949/ccjm.89a.20061. - DOI - PubMed
1. Öztürk Ş., Özkaya U. Residual LSTM layered CNN for classification of gastrointestinal tract diseases. J. Biomed. Inform. 2021;113:103638. doi: 10.1016/j.jbi.2020.103638. - DOI - PubMed

LinkOut - more resources

Full Text Sources
- MDPI
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Novel 3D Convolutional Neural Network-Based Deep Learning Model for Spatiotemporal Feature Mapping for Video Analysis: Feasibility Study for Gastrointestinal Endoscopic Video Classification

Affiliations

A Novel 3D Convolutional Neural Network-Based Deep Learning Model for Spatiotemporal Feature Mapping for Video Analysis: Feasibility Study for Gastrointestinal Endoscopic Video Classification

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous