Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection
Authors: Yucheng Suo, Fan Ma, Kaixin Shen, Linchao Zhu, Yi Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate the visual instructions generated by LIGER are more comprehensive compared with baseline methods. To evaluate whether the generated visual instructions align with human comprehension, we curate a benchmark containing 569 long-horizon tasks along with human-annotated ground truth expressions and logic relations. Moreover, we evaluate the method from semantic alignment, logic correctness, and illustrativeness. Results show that LIGER surpasses baseline methods by a large margin. User studies and qualitative comparisons further verify that visual instructions generated by LIGER are more illustrative. |
| Researcher Affiliation | Academia | Yucheng Suo Fan Ma Kaixin Shen Linchao Zhu Yi Yang Re LER Lab, CCAI, Zhejiang University, China |
| Pseudocode | Yes | Algorithm 1 Single Step Self-reflection |
| Open Source Code | Yes | Code and dataset are provided in https://github.com/suoych/LIGER. |
| Open Datasets | Yes | Code and dataset are provided in https://github.com/suoych/LIGER. |
| Dataset Splits | No | To evaluate whether the generated visual instructions align with human comprehension, we curate a benchmark containing 569 long-horizon tasks along with human-annotated ground truth expressions and logic relations. We categorize the tasks into three types: short (6-8 steps), medium (9-11 steps), and long (12 or more steps). The paper describes a 'training-free' framework that leverages existing models, and evaluates on a curated benchmark, rather than training a model with explicit data splits. |
| Hardware Specification | Yes | All experiments are conducted on a single RTX A6000 GPU. |
| Software Dependencies | No | The paper mentions specific models and tools used (GPT-4O, SDXL, Free-U plugin, DDIM, LISA-7B, LAMA), but does not provide explicit version numbers for core software dependencies like Python, PyTorch, or CUDA, which are necessary for reproducibility as per the prompt's definition. |
| Experiment Setup | Yes | The draft image generation uses the SDXL (Podell et al., 2023) with a guidance scale of 5 along with the Free-U plugin (Si et al., 2024). The DDIM generation and inversion timesteps are set to 50. In terms of the visual memory, we set the number of the previous step image feature token M to half of the sequence length N, in other words, M = N/2. |