Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct:105:103644.
doi: 10.1016/j.media.2025.103644. Epub 2025 Jun 4.

Learning multi-modal representations by watching hundreds of surgical video lectures

Affiliations
Free article

Learning multi-modal representations by watching hundreds of surgical video lectures

Kun Yuan et al. Med Image Anal. 2025 Oct.
Free article

Abstract

Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively demonstrate the representational capability of the learned joint latent space, we introduce several vision-and-language surgical tasks and evaluate various vision-only tasks specific to surgery, e.g., surgical tool, phase, and triplet recognition. Extensive experiments across diverse surgical procedures and tasks demonstrate that the multi-modal representations learned by SurgVLP exhibit strong transferability and adaptability in surgical video analysis. Furthermore, our zero-shot evaluations highlight SurgVLP's potential as a general-purpose foundation model for surgical workflow analysis, reducing the reliance on extensive manual annotations for downstream tasks, and facilitating adaptation methods such as few-shot learning to build a scalable and data-efficient solution for various downstream surgical applications. The code is available at https://github.com/CAMMA-public/SurgVLP.

Keywords: Multi-modal representation learning; Self-supervision; Surgical video lectures; Vision-and-language.

PubMed Disclaimer

Conflict of interest statement

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Similar articles

  • Prescription of Controlled Substances: Benefits and Risks.
    Preuss CV, Kalava A, King KC. Preuss CV, et al. 2025 Jul 6. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. 2025 Jul 6. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. PMID: 30726003 Free Books & Documents.
  • Short-Term Memory Impairment.
    Cascella M, Al Khalili Y. Cascella M, et al. 2024 Jun 8. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. 2024 Jun 8. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan–. PMID: 31424720 Free Books & Documents.
  • Watch and learn: leveraging expert knowledge and language for surgical video understanding.
    Gastager D, Ghazaei G, Patsch C. Gastager D, et al. Int J Comput Assist Radiol Surg. 2025 Jul 2. doi: 10.1007/s11548-025-03472-4. Online ahead of print. Int J Comput Assist Radiol Surg. 2025. PMID: 40601123
  • Interventions to improve access to cataract surgical services and their impact on equity in low- and middle-income countries.
    Ramke J, Petkovic J, Welch V, Blignault I, Gilbert C, Blanchet K, Christensen R, Zwi AB, Tugwell P. Ramke J, et al. Cochrane Database Syst Rev. 2017 Nov 9;11(11):CD011307. doi: 10.1002/14651858.CD011307.pub2. Cochrane Database Syst Rev. 2017. PMID: 29119547 Free PMC article.
  • Management of urinary stones by experts in stone disease (ESD 2025).
    Papatsoris A, Geavlete B, Radavoi GD, Alameedee M, Almusafer M, Ather MH, Budia A, Cumpanas AA, Kiremi MC, Dellis A, Elhowairis M, Galán-Llopis JA, Geavlete P, Guimerà Garcia J, Isern B, Jinga V, Lopez JM, Mainez JA, Mitsogiannis I, Mora Christian J, Moussa M, Multescu R, Oguz Acar Y, Petkova K, Piñero A, Popov E, Ramos Cebrian M, Rascu S, Siener R, Sountoulides P, Stamatelou K, Syed J, Trinchieri A. Papatsoris A, et al. Arch Ital Urol Androl. 2025 Jun 30;97(2):14085. doi: 10.4081/aiua.2025.14085. Epub 2025 Jun 30. Arch Ital Urol Androl. 2025. PMID: 40583613 Review.

Cited by

LinkOut - more resources