Learning multi-modal representations by watching hundreds of surgical video lectures

Kun Yuan¹, Vinkle Srivastav², Tong Yu³, Joël L Lavanchy⁴, Jacques Marescaux⁵, Pietro Mascagni⁶, Nassir Navab⁷, Nicolas Padoy⁸

Affiliations

¹ University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; CAMP, Technische Universität München, Munich, Germany.
² University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; IHU Strasbourg, Strasbourg, France.
³ University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France.
⁴ IHU Strasbourg, Strasbourg, France; University Digestive Health Care Center - Clarunis, 4002 Basel, Switzerland.
⁵ IRCAD, Strasbourg, France.
⁶ IHU Strasbourg, Strasbourg, France; Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy.
⁷ CAMP, Technische Universität München, Munich, Germany.
⁸ University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; IHU Strasbourg, Strasbourg, France. Electronic address: npadoy@unistra.fr.

PMID: 40513506
DOI: 10.1016/j.media.2025.103644

Free article

Learning multi-modal representations by watching hundreds of surgical video lectures

Kun Yuan et al. Med Image Anal. 2025 Oct.

Free article

. 2025 Oct:105:103644.

doi: 10.1016/j.media.2025.103644. Epub 2025 Jun 4.

Authors

Kun Yuan¹, Vinkle Srivastav², Tong Yu³, Joël L Lavanchy⁴, Jacques Marescaux⁵, Pietro Mascagni⁶, Nassir Navab⁷, Nicolas Padoy⁸

Affiliations

¹ University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; CAMP, Technische Universität München, Munich, Germany.
² University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; IHU Strasbourg, Strasbourg, France.
³ University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France.
⁴ IHU Strasbourg, Strasbourg, France; University Digestive Health Care Center - Clarunis, 4002 Basel, Switzerland.
⁵ IRCAD, Strasbourg, France.
⁶ IHU Strasbourg, Strasbourg, France; Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy.
⁷ CAMP, Technische Universität München, Munich, Germany.
⁸ University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; IHU Strasbourg, Strasbourg, France. Electronic address: npadoy@unistra.fr.

PMID: 40513506
DOI: 10.1016/j.media.2025.103644

Abstract

Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively demonstrate the representational capability of the learned joint latent space, we introduce several vision-and-language surgical tasks and evaluate various vision-only tasks specific to surgery, e.g., surgical tool, phase, and triplet recognition. Extensive experiments across diverse surgical procedures and tasks demonstrate that the multi-modal representations learned by SurgVLP exhibit strong transferability and adaptability in surgical video analysis. Furthermore, our zero-shot evaluations highlight SurgVLP's potential as a general-purpose foundation model for surgical workflow analysis, reducing the reliance on extensive manual annotations for downstream tasks, and facilitating adaptation methods such as few-shot learning to build a scalable and data-efficient solution for various downstream surgical applications. The code is available at https://github.com/CAMMA-public/SurgVLP.

Keywords: Multi-modal representation learning; Self-supervision; Surgical video lectures; Vision-and-language.

PubMed Disclaimer

Conflict of interest statement

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Cited by

Text-driven adaptation of foundation models for few-shot surgical workflow analysis.
Chen T, Yuan K, Srivastav V, Navab N, Padoy N. Chen T, et al. Int J Comput Assist Radiol Surg. 2025 Jun;20(6):1175-1183. doi: 10.1007/s11548-025-03341-0. Epub 2025 Apr 17. Int J Comput Assist Radiol Surg. 2025. PMID: 40244318 Free PMC article.

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Elsevier Science

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Learning multi-modal representations by watching hundreds of surgical video lectures

Affiliations

Learning multi-modal representations by watching hundreds of surgical video lectures

Authors

Affiliations

Abstract

Conflict of interest statement

Similar articles

Cited by

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Similar articles

Cited by

MeSH terms

Related information

LinkOut - more resources

Full Text Sources