reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

Authors: Tianren Ma, Lingxi Xie, Yunjie Tian, Boyu Yang, Qixiang Ye

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted extensive experiments to evaluate Claw Machine s multimodal understanding ability. ... Experiments show that Claw Machine achieves superior performance on scene-level and referential understanding tasks with higher efficiency.
Researcher Affiliation	Collaboration	Tianren Ma School of EECE University of Chinese Academy of Sciences EMAIL Lingxi Xie Huawei Inc. EMAIL Yunjie Tian School of EECE University of Chinese Academy of Sciences EMAIL Boyu Yang Jiutian Team China Mobile Research Institute EMAIL Qixiang Ye School of EECE University of Chinese Academy of Sciences EMAIL
Pseudocode	No	The paper describes the methodology in narrative text and includes architectural diagrams (Figure 2) but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/martian422/Claw Machine
Open Datasets	Yes	The first model only utilize the object annotations from VG, and the Ref COCO series (including Ref COCO, Ref COCO+, Ref COCOg; Ref COCO/+/g for short). ... pre-trained on 10 times larger materials including GRIT-20M (Peng et al., 2023).
Dataset Splits	Yes	Table 3: Results on the visual grounding (REC) task. We report accuracy with the Io U threshold 0.5. ... We extract 221, 334, and 445 Q&A pairs from GQA s testdev-balanced set correspondingly (same as the original ratio of these three types in the testdev-balanced set), and yes : no = 1 : 1.
Hardware Specification	Yes	The training is conducted on 8 NVIDIA A100 GPUs with 80GB memory.
Software Dependencies	No	The paper mentions 'Adam W', 'cosine annealing scheduler', 'Flash Attention-2', and 'Deep Speed libraries with zero2' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	The initial learning rate is set to 2e-5 and 1e-5 for the two stages respectively with a warm-up ratio of 0.03. The global batch size remains constant at 256. ... The input image size is set to 224x224 with a patch size P = 14, and the maximum sequence length in the MLLM is 2048.