Recognize Any Surgical Object: Unleashing the Power of Weakly-Supervised Data
Authors: Jiajie Li, Brian Quaranto, Chenhui Xu, Ishan Mishra, Ruiyang Qin, Dancheng Liu, Peter Kim, Jinjun Xiong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that RASO achieves improvements of 2.9 m AP, 4.5 m AP, 10.6 m AP, and 7.2 m AP on four standard surgical benchmarks respectively in zero-shot settings, and surpasses state-of-the-art models in supervised surgical action recognition tasks. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, University at Buffalo 2Department of Surgery, University at Buffalo 3Department of Computer Science and Engineering, IIT, Jodhpur 4Department of Computer Science and Engineering, University of Notre Dame |
| Pseudocode | No | The paper describes the architecture and training process in prose and uses diagrams (e.g., Figure 2) but does not provide any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | By open-sourcing our code, model, and dataset, we aim to drive further research, bridging the gap between recognition and segmentation in surgical imaging applications. |
| Open Datasets | Yes | We evaluated the RASO model on several well-established surgical datasets. The Gra SP (Ayobi et al., 2024) dataset... Cholec T50 (Nwoye et al., 2022)... Cholec80 (Twinanda et al., 2016)... Endo Vis18 (Allan et al., 2020)... |
| Dataset Splits | Yes | For the supervised task, we finetune the model with the training split of the Cholec T50 dataset. |
| Hardware Specification | Yes | We train all the models on 8 NVIDIA A6000 GPUs. We evaluate the latency on one NVIDIA A6000 GPU. |
| Software Dependencies | Yes | We utilized the large-v2 version of Whisper X for generating transcriptions, followed by data filtering with gpt-3.5-turbo-0125 . For additional annotation, we employed gpt-4o . ... We initialize the image encoder using swin-large weights of the Swin-Transformer, with an input image size of 384 384. We used the text encoder of CLIP version Vi T-B/16 for tag embeddings. |
| Experiment Setup | Yes | During pretraining, the weight decay was set to 0.05, with an initial learning rate of 1e-4, a minimum learning rate of 5e-7, and a learning rate decay rate of 0.9. The warmup learning rate was 5e-7, and the warmup steps were set to 3000. Pretraining was conducted for a maximum of 10 epochs with a batch size of 26 per device. For fine-tuning, the weight decay remained at 0.05, the initial learning rate was set to 5e-6, and the minimum learning rate was 0. The fine-tuning process lasted for 4 epochs, with a batch size of 26 per device. |