reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Authors: Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, He-Yang Xu, Yazhou Yao, Errui Ding

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the model s zero-shot capabilities on unseen instructions and unseen vision tasks through different experimental settings. ... Training is conducted on 64 A100 GPUs, with batch size for each GPU set as 8 over 2 epochs (around 47k iterations), totaling 5,400 GPU hours. Image resolution equals 448 448. ... Table 1. Comparisons with task-specific / vision generalist baselines across four representative tasks.
Researcher Affiliation	Collaboration	1School of Computer Science and Engineering, Nanjing University of Science and Technology, China. 2School of Computer Science and Engineering, and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, China. 3Baidu VIS.
Pseudocode	No	The paper describes the methodology in prose and with a framework diagram (Figure 4) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code and dataset have been openly available on our Git Hub repository.
Open Datasets	Yes	Code and dataset have been openly available on our Git Hub repository. ... We also collect Object Detection data from the LVIS (Gupta et al., 2019) dataset, Style Transfer data from CSGO1(Xing et al., 2024) and prepare data for dense image prediction tasks including Depth Estimation, Surface Normal Estimation, Pose Estimation, and Semantic Segmentation from ADE20K (Zhou et al., 2017b), Depth in the Wild (Chen et al., 2016) and randomly selected data from the next two components. Additionally, for controllable generation tasks, we incorporate Holistically-Nested Edge (HED) Boundary to Image and inverse dense image prediction tasks (e.g., Pose-to-Image, Depth-to-Image, Segmentation-to-Image).
Dataset Splits	Yes	For quick validation, we utilize a subset of the DECVT training data: 30% data from the Explanatory-based Vision Tasks component (50% image editing data while the remaining 50% data from more visualrelated image pairs) and 20% data from the remaining portion of the Terminological-based Vision Tasks component, with each subset containing approximately 0.5 million bidirectional pair of image explanatory instruction image triplets (totally 1.5 million bidirectional pair of triplets). ... We conduct Semantic Segmentation experiments on the ADE20K validation set (Zhou et al., 2017b), while Depth estimation and Surface Normal Estimation experiments on the NYU-Depth V2 dataset (Silberman et al., 2012).
Hardware Specification	Yes	Training is conducted on 64 A100 GPUs, with batch size for each GPU set as 8 over 2 epochs (around 47k iterations), totaling 5,400 GPU hours. ... Training for the 7B AR-based VLM model is conducted on 8 A100 GPUs with batch size set as 128 over 2 epochs (around 46k iterations), totaling 1,340 GPU hours.
Software Dependencies	No	The paper mentions using a 'byte pair encoding tokenizer' and refers to models like 'Chameleon' and 'Lumina-m GPT' but does not specify software library versions (e.g., PyTorch 1.x, Python 3.x) or specific solver versions.
Experiment Setup	Yes	For all the experiments, we employ the Adam W (Loshchilov, 2017) optimizer with a weight decay of 0.01 and betas set to (0.9, 0.95). The learning rate is configured at 4 10 5, and the z-loss is applied with a weight of 10 5. ... Image resolution equals 448 448.