reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection

Authors: Yucheng Suo, Fan Ma, Kaixin Shen, Linchao Zhu, Yi Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate the visual instructions generated by LIGER are more comprehensive compared with baseline methods. To evaluate whether the generated visual instructions align with human comprehension, we curate a benchmark containing 569 long-horizon tasks along with human-annotated ground truth expressions and logic relations. Moreover, we evaluate the method from semantic alignment, logic correctness, and illustrativeness. Results show that LIGER surpasses baseline methods by a large margin. User studies and qualitative comparisons further verify that visual instructions generated by LIGER are more illustrative.
Researcher Affiliation	Academia	Yucheng Suo Fan Ma Kaixin Shen Linchao Zhu Yi Yang Re LER Lab, CCAI, Zhejiang University, China
Pseudocode	Yes	Algorithm 1 Single Step Self-reflection
Open Source Code	Yes	Code and dataset are provided in https://github.com/suoych/LIGER.
Open Datasets	Yes	Code and dataset are provided in https://github.com/suoych/LIGER.
Dataset Splits	No	To evaluate whether the generated visual instructions align with human comprehension, we curate a benchmark containing 569 long-horizon tasks along with human-annotated ground truth expressions and logic relations. We categorize the tasks into three types: short (6-8 steps), medium (9-11 steps), and long (12 or more steps). The paper describes a 'training-free' framework that leverages existing models, and evaluates on a curated benchmark, rather than training a model with explicit data splits.
Hardware Specification	Yes	All experiments are conducted on a single RTX A6000 GPU.
Software Dependencies	No	The paper mentions specific models and tools used (GPT-4O, SDXL, Free-U plugin, DDIM, LISA-7B, LAMA), but does not provide explicit version numbers for core software dependencies like Python, PyTorch, or CUDA, which are necessary for reproducibility as per the prompt's definition.
Experiment Setup	Yes	The draft image generation uses the SDXL (Podell et al., 2023) with a guidance scale of 5 along with the Free-U plugin (Si et al., 2024). The DDIM generation and inversion timesteps are set to 50. In terms of the visual memory, we set the number of the previous step image feature token M to half of the sequence length N, in other words, M = N/2.