Learning multi-modal representations by watching hundreds of surgical video lectures
- PMID: 40513506
- DOI: 10.1016/j.media.2025.103644
Learning multi-modal representations by watching hundreds of surgical video lectures
Abstract
Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively demonstrate the representational capability of the learned joint latent space, we introduce several vision-and-language surgical tasks and evaluate various vision-only tasks specific to surgery, e.g., surgical tool, phase, and triplet recognition. Extensive experiments across diverse surgical procedures and tasks demonstrate that the multi-modal representations learned by SurgVLP exhibit strong transferability and adaptability in surgical video analysis. Furthermore, our zero-shot evaluations highlight SurgVLP's potential as a general-purpose foundation model for surgical workflow analysis, reducing the reliance on extensive manual annotations for downstream tasks, and facilitating adaptation methods such as few-shot learning to build a scalable and data-efficient solution for various downstream surgical applications. The code is available at https://github.com/CAMMA-public/SurgVLP.
Keywords: Multi-modal representation learning; Self-supervision; Surgical video lectures; Vision-and-language.
Copyright © 2025. Published by Elsevier B.V.
Conflict of interest statement
Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Similar articles
-
Prescription of Controlled Substances: Benefits and Risks.2025 Jul 6. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. 2025 Jul 6. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. PMID: 30726003 Free Books & Documents.
-
Short-Term Memory Impairment.2024 Jun 8. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. 2024 Jun 8. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. PMID: 31424720 Free Books & Documents.
-
Watch and learn: leveraging expert knowledge and language for surgical video understanding.Int J Comput Assist Radiol Surg. 2025 Jul 2. doi: 10.1007/s11548-025-03472-4. Online ahead of print. Int J Comput Assist Radiol Surg. 2025. PMID: 40601123
-
Interventions to improve access to cataract surgical services and their impact on equity in low- and middle-income countries.Cochrane Database Syst Rev. 2017 Nov 9;11(11):CD011307. doi: 10.1002/14651858.CD011307.pub2. Cochrane Database Syst Rev. 2017. PMID: 29119547 Free PMC article.
-
Management of urinary stones by experts in stone disease (ESD 2025).Arch Ital Urol Androl. 2025 Jun 30;97(2):14085. doi: 10.4081/aiua.2025.14085. Epub 2025 Jun 30. Arch Ital Urol Androl. 2025. PMID: 40583613 Review.
Cited by
-
Text-driven adaptation of foundation models for few-shot surgical workflow analysis.Int J Comput Assist Radiol Surg. 2025 Jun;20(6):1175-1183. doi: 10.1007/s11548-025-03341-0. Epub 2025 Apr 17. Int J Comput Assist Radiol Surg. 2025. PMID: 40244318 Free PMC article.
MeSH terms
LinkOut - more resources
Full Text Sources