Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis

Demir K, Lojo Rodríguez MB, Weise T, Maier A, Yang SH (2024)


Publication Language: English

Publication Type: Conference contribution

Publication year: 2024

Publisher: International Speech Communication Association

Pages Range: 1465-1469

Conference Proceedings Title: Interspeech 2024

Event location: Kos Island GR

DOI: 10.21437/Interspeech.2024-975

Abstract

To develop intelligent speech assistants and integrate them seamlessly with intra-operative decision-support frameworks, accurate and efficient surgical phase recognition is a prerequisite. In this study, we propose a multimodal framework based on Gated Multimodal Units (GMU) and Multi-Stage Temporal Convolutional Networks (MS-TCN) to recognize surgical phases of port-catheter placement operations. Our method merges speech and image models and uses them separately in different surgical phases. Based on the evaluation of 28 operations, we report a frame-wise accuracy of 92.65 ± 3.52% and an F1-score of 92.30 ±3.82%. Our results show approximately 10% improvement in both metrics over previous work and validate the effectiveness of integrating multimodal data for the surgical phase recognition task. We further investigate the contribution of individual data channels by comparing mono-modal models with multimodal models.

Authors with CRIS profile

How to cite

APA:

Demir, K., Lojo Rodríguez, M.B., Weise, T., Maier, A., & Yang, S.H. (2024). Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis. In Interspeech 2024 (pp. 1465-1469). Kos Island, GR: International Speech Communication Association.

MLA:

Demir, Kubilay, et al. "Towards Intelligent Speech Assistants in Operating Rooms: A Multimodal Model for Surgical Workflow Analysis." Proceedings of the 25th Interspeech Conferece 2024, Kos Island International Speech Communication Association, 2024. 1465-1469.

BibTeX: Download