Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization
Authors: Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, He-Yang Xu, Yazhou Yao, Errui Ding
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate the model s zero-shot capabilities on unseen instructions and unseen vision tasks through different experimental settings. ... Training is conducted on 64 A100 GPUs, with batch size for each GPU set as 8 over 2 epochs (around 47k iterations), totaling 5,400 GPU hours. Image resolution equals 448 448. ... Table 1. Comparisons with task-specific / vision generalist baselines across four representative tasks. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Engineering, Nanjing University of Science and Technology, China. 2School of Computer Science and Engineering, and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, China. 3Baidu VIS. |
| Pseudocode | No | The paper describes the methodology in prose and with a framework diagram (Figure 4) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code and dataset have been openly available on our Git Hub repository. |
| Open Datasets | Yes | Code and dataset have been openly available on our Git Hub repository. ... We also collect Object Detection data from the LVIS (Gupta et al., 2019) dataset, Style Transfer data from CSGO1(Xing et al., 2024) and prepare data for dense image prediction tasks including Depth Estimation, Surface Normal Estimation, Pose Estimation, and Semantic Segmentation from ADE20K (Zhou et al., 2017b), Depth in the Wild (Chen et al., 2016) and randomly selected data from the next two components. Additionally, for controllable generation tasks, we incorporate Holistically-Nested Edge (HED) Boundary to Image and inverse dense image prediction tasks (e.g., Pose-to-Image, Depth-to-Image, Segmentation-to-Image). |
| Dataset Splits | Yes | For quick validation, we utilize a subset of the DECVT training data: 30% data from the Explanatory-based Vision Tasks component (50% image editing data while the remaining 50% data from more visualrelated image pairs) and 20% data from the remaining portion of the Terminological-based Vision Tasks component, with each subset containing approximately 0.5 million bidirectional pair of image explanatory instruction image triplets (totally 1.5 million bidirectional pair of triplets). ... We conduct Semantic Segmentation experiments on the ADE20K validation set (Zhou et al., 2017b), while Depth estimation and Surface Normal Estimation experiments on the NYU-Depth V2 dataset (Silberman et al., 2012). |
| Hardware Specification | Yes | Training is conducted on 64 A100 GPUs, with batch size for each GPU set as 8 over 2 epochs (around 47k iterations), totaling 5,400 GPU hours. ... Training for the 7B AR-based VLM model is conducted on 8 A100 GPUs with batch size set as 128 over 2 epochs (around 46k iterations), totaling 1,340 GPU hours. |
| Software Dependencies | No | The paper mentions using a 'byte pair encoding tokenizer' and refers to models like 'Chameleon' and 'Lumina-m GPT' but does not specify software library versions (e.g., PyTorch 1.x, Python 3.x) or specific solver versions. |
| Experiment Setup | Yes | For all the experiments, we employ the Adam W (Loshchilov, 2017) optimizer with a weight decay of 0.01 and betas set to (0.9, 0.95). The learning rate is configured at 4 10 5, and the z-loss is applied with a weight of 10 5. ... Image resolution equals 448 448. |