ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

Authors: Tianren Ma, Lingxi Xie, Yunjie Tian, Boyu Yang, Qixiang Ye

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted extensive experiments to evaluate Claw Machine s multimodal understanding ability. ... Experiments show that Claw Machine achieves superior performance on scene-level and referential understanding tasks with higher efficiency.
Researcher Affiliation Collaboration Tianren Ma School of EECE University of Chinese Academy of Sciences EMAIL Lingxi Xie Huawei Inc. EMAIL Yunjie Tian School of EECE University of Chinese Academy of Sciences EMAIL Boyu Yang Jiutian Team China Mobile Research Institute EMAIL Qixiang Ye School of EECE University of Chinese Academy of Sciences EMAIL
Pseudocode No The paper describes the methodology in narrative text and includes architectural diagrams (Figure 2) but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/martian422/Claw Machine
Open Datasets Yes The first model only utilize the object annotations from VG, and the Ref COCO series (including Ref COCO, Ref COCO+, Ref COCOg; Ref COCO/+/g for short). ... pre-trained on 10 times larger materials including GRIT-20M (Peng et al., 2023).
Dataset Splits Yes Table 3: Results on the visual grounding (REC) task. We report accuracy with the Io U threshold 0.5. ... We extract 221, 334, and 445 Q&A pairs from GQA s testdev-balanced set correspondingly (same as the original ratio of these three types in the testdev-balanced set), and yes : no = 1 : 1.
Hardware Specification Yes The training is conducted on 8 NVIDIA A100 GPUs with 80GB memory.
Software Dependencies No The paper mentions 'Adam W', 'cosine annealing scheduler', 'Flash Attention-2', and 'Deep Speed libraries with zero2' but does not provide specific version numbers for these software components.
Experiment Setup Yes The initial learning rate is set to 2e-5 and 1e-5 for the two stages respectively with a warm-up ratio of 0.03. The global batch size remains constant at 256. ... The input image size is set to 224x224 with a patch size P = 14, and the maximum sequence length in the MLLM is 2048.