reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Recognize Any Surgical Object: Unleashing the Power of Weakly-Supervised Data

Authors: Jiajie Li, Brian Quaranto, Chenhui Xu, Ishan Mishra, Ruiyang Qin, Dancheng Liu, Peter Kim, Jinjun Xiong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that RASO achieves improvements of 2.9 m AP, 4.5 m AP, 10.6 m AP, and 7.2 m AP on four standard surgical benchmarks respectively in zero-shot settings, and surpasses state-of-the-art models in supervised surgical action recognition tasks.
Researcher Affiliation	Academia	1Department of Computer Science and Engineering, University at Buffalo 2Department of Surgery, University at Buffalo 3Department of Computer Science and Engineering, IIT, Jodhpur 4Department of Computer Science and Engineering, University of Notre Dame
Pseudocode	No	The paper describes the architecture and training process in prose and uses diagrams (e.g., Figure 2) but does not provide any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	By open-sourcing our code, model, and dataset, we aim to drive further research, bridging the gap between recognition and segmentation in surgical imaging applications.
Open Datasets	Yes	We evaluated the RASO model on several well-established surgical datasets. The Gra SP (Ayobi et al., 2024) dataset... Cholec T50 (Nwoye et al., 2022)... Cholec80 (Twinanda et al., 2016)... Endo Vis18 (Allan et al., 2020)...
Dataset Splits	Yes	For the supervised task, we finetune the model with the training split of the Cholec T50 dataset.
Hardware Specification	Yes	We train all the models on 8 NVIDIA A6000 GPUs. We evaluate the latency on one NVIDIA A6000 GPU.
Software Dependencies	Yes	We utilized the large-v2 version of Whisper X for generating transcriptions, followed by data filtering with gpt-3.5-turbo-0125 . For additional annotation, we employed gpt-4o . ... We initialize the image encoder using swin-large weights of the Swin-Transformer, with an input image size of 384 384. We used the text encoder of CLIP version Vi T-B/16 for tag embeddings.
Experiment Setup	Yes	During pretraining, the weight decay was set to 0.05, with an initial learning rate of 1e-4, a minimum learning rate of 5e-7, and a learning rate decay rate of 0.9. The warmup learning rate was 5e-7, and the warmup steps were set to 3000. Pretraining was conducted for a maximum of 10 epochs with a batch size of 26 per device. For fine-tuning, the weight decay remained at 0.05, the initial learning rate was set to 5e-6, and the minimum learning rate was 0. The fine-tuning process lasted for 4 epochs, with a batch size of 26 per device.