reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation

Authors: Jihyo Kim, Seulbi Lee, Sangheum Hwang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that our Re Guide enhances the performance of current LVLMs in both image classification and Oo DD tasks.
Researcher Affiliation	Academia	Jihyo Kim , Seulbi Lee , Sangheum Hwang Department of Data Science, Seoul National University of Science and Technology EMAIL
Pseudocode	No	The paper describes the proposed method, Reflexive Guidance, in paragraph text and illustrates it with a framework diagram (Figure 5) but does not provide a formal pseudocode or algorithm block.
Open Source Code	Yes	1https://github.com/daintlab/Re Guide
Open Datasets	Yes	We evaluate the comparison models on the CIFAR10 and Image Net200 benchmarks proposed in Open OOD v1.5 (Zhang et al., 2024a). ... The Image Net200 benchmark consists of Image Net200 (Russakovsky et al., 2015) as the ID dataset, two datasets NINCO (Bitterwolf et al., 2023) and SSB Hard (Vaze et al., 2022) as the near-Oo D datasets, and three datasets i Naturalist (Van Horn et al., 2018), Textures, and Openimage-O (Wang et al., 2022) as the far-Oo D datasets.
Dataset Splits	Yes	Due to cost, time, and API rate limits, we use 25% subsets of the benchmarks. A detailed explanation of the benchmark datasets can be found in Appendix B.1. ... We sample 25% of each dataset, ensuring that the proportion of datasets in each benchmark are maintained. During sampling, we maintain the ratio of the number of samples for each label from the original dataset. Tables B.1.1 and B.1.2 present the number of images in each dataset for the Image Net200 and CIFAR10 benchmarks, respectively.
Hardware Specification	Yes	All experiments are implemented with Python 3.9 and Py Torch 1.9, using NVIDIA A100 80GB GPUs. For the Intern VL2-llama3-76B model, due to its significant computational requirements, we employ 4 A100 GPUs with the VLLM (Kwon et al., 2023) library to reduce time overhead. For the remaining models, a single A100 GPU is sufficient to run the experiments.
Software Dependencies	Yes	All experiments are implemented with Python 3.9 and Py Torch 1.9, using NVIDIA A100 80GB GPUs. For the Intern VL2-llama3-76B model, due to its significant computational requirements, we employ 4 A100 GPUs with the VLLM (Kwon et al., 2023) library to reduce time overhead.
Experiment Setup	Yes	Our prompt consists of four components: a task description, an explanation of the rejection class, guidelines, and examples for the response format. ... The Oo DD task allows us to investigate how LVLMs behave when required to generate responses beyond the categories defined within the user-provided prompt. ... For this experiment, we set the number of negative class suggestions for each group N to 20.