Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection

Authors: Yucheng Suo, Fan Ma, Kaixin Shen, Linchao Zhu, Yi Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate the visual instructions generated by LIGER are more comprehensive compared with baseline methods. To evaluate whether the generated visual instructions align with human comprehension, we curate a benchmark containing 569 long-horizon tasks along with human-annotated ground truth expressions and logic relations. Moreover, we evaluate the method from semantic alignment, logic correctness, and illustrativeness. Results show that LIGER surpasses baseline methods by a large margin. User studies and qualitative comparisons further verify that visual instructions generated by LIGER are more illustrative.
Researcher Affiliation Academia Yucheng Suo Fan Ma Kaixin Shen Linchao Zhu Yi Yang Re LER Lab, CCAI, Zhejiang University, China
Pseudocode Yes Algorithm 1 Single Step Self-reflection
Open Source Code Yes Code and dataset are provided in https://github.com/suoych/LIGER.
Open Datasets Yes Code and dataset are provided in https://github.com/suoych/LIGER.
Dataset Splits No To evaluate whether the generated visual instructions align with human comprehension, we curate a benchmark containing 569 long-horizon tasks along with human-annotated ground truth expressions and logic relations. We categorize the tasks into three types: short (6-8 steps), medium (9-11 steps), and long (12 or more steps). The paper describes a 'training-free' framework that leverages existing models, and evaluates on a curated benchmark, rather than training a model with explicit data splits.
Hardware Specification Yes All experiments are conducted on a single RTX A6000 GPU.
Software Dependencies No The paper mentions specific models and tools used (GPT-4O, SDXL, Free-U plugin, DDIM, LISA-7B, LAMA), but does not provide explicit version numbers for core software dependencies like Python, PyTorch, or CUDA, which are necessary for reproducibility as per the prompt's definition.
Experiment Setup Yes The draft image generation uses the SDXL (Podell et al., 2023) with a guidance scale of 5 along with the Free-U plugin (Si et al., 2024). The DDIM generation and inversion timesteps are set to 50. In terms of the visual memory, we set the number of the previous step image feature token M to half of the sequence length N, in other words, M = N/2.