Reconstructive Visual Instruction Tuning

Authors: Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Zhaoxiang Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, ROSS consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, ROSS delivers competitive performance with a single Sig LIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs. Sections 5 and 5.1 are titled 'Experiments' and 'Ablation Study' respectively.
Researcher Affiliation Collaboration 1 New Laboratory of Pattern Recognition, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences 2 University of Chinese Academy of Sciences 3 University of Hong Kong 4 MEGVII Technology 5 Step Fun. The affiliations include universities (Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences; University of Hong Kong) and companies (MEGVII Technology; Step Fun), indicating a collaboration.
Pseudocode No The paper describes the methodology using natural language and mathematical equations in sections like 'ROSS: RECONSTRUCTIVE VISUAL INSTRUCTION TUNING' and 'LATENT DIFFUSION MODELS', but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper provides a 'Project Page: https://haochen-wang409.github.io/ross'. However, it does not explicitly state that the source code for the methodology described in this paper is released there, nor is it a direct link to a code repository. The instructions state that a project page that is a high-level project overview, rather than a specific code repository, does not qualify.
Open Datasets Yes The paper utilizes and cites numerous established datasets, including LLa VA-558K (Liu et al., 2023a), Cambrian-737K (Tong et al., 2024a), POPE (Li et al., 2023c), Hallusion Bench (Guan et al., 2024), MMVP (Tong et al., 2024b), Chart QA (Masry et al., 2022), MMBench (Liu et al., 2023b), Image Net-1K (Deng et al., 2009), and Spatial Bench (Cai et al., 2024), among others. These are all publicly available and formally cited.
Dataset Splits No The paper states: 'The training data is LLa VA-558K (Liu et al., 2023a) and Cambrian-737K (Tong et al., 2024a) for the pre-training stage and the instruction tuning stage, respectively.' While it mentions total sample counts for training data and refers to 'English dev split' and 'validation split' for various benchmarks used for evaluation, it does not explicitly detail the training/validation/test splits of its *own* training data (LLaVA-558K and Cambrian-737K) that are needed to reproduce the experiments for ROSS itself.
Hardware Specification Yes Evaluations are conducted using 8 A100 GPUs with a global batch size of 128.
Software Dependencies No The paper mentions 'Deep Speed zero stage 2 3' and 'Optimizer Adam W' in Table 6, but it does not provide specific version numbers for software libraries or packages like Python, PyTorch, or CUDA, which would be necessary for full reproducibility. It also states 'We obtain most of the configurations from LLa VA-v1.5 (Liu et al., 2024a)'.
Experiment Setup Yes Table 6: Hyperparameters of ROSS provides specific details such as: Global batch size 256 (Stage I) / 128 (Stage II), Batch size per GPU 16 (Stage I) / 4 (Stage II), Accumulated steps 2 (Stage I) / 4 (Stage II), Learning rate 1e-3 (Stage I) / 2e-5 (Stage II), Learning rate schedule warmup + cosine decay, Warmup ratio 0.03, Weight decay 0, Epoch 1, and Precision bf16.