reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Perception in Reflection

Authors: Yana Wei, Liang Zhao, Kangheng Lin, En Yu, Yuang Peng, Runpei Dong, Jianjian Sun, Haoran Wei, Zheng Ge, Xiangyu Zhang, Vishal M. Patel

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experimental evaluation demonstrates Re Per s quantifiable improvements in image understanding, captioning precision, and hallucination reduction. Notably, Re Per achieves strong alignment between model attention patterns and human visual focus, while RPL optimizes fine-grained and free-form preference alignment. These advancements establish perception in reflection as a robust paradigm for future multimodal agents, particularly in tasks requiring complex reasoning and multi-step manipulation.
Researcher Affiliation	Collaboration	1Johns Hopkins University 2Step Fun 3BUPT 4HUST 5Tsinghua University 6UIUC. Correspondence to: Yana Wei <EMAIL>, Vishal M. Patel <EMAIL>.
Pseudocode	Yes	Algorithm 1 Reflective Perception (Re Per)
Open Source Code	Yes	Project Page: https://weiyana.github.io/Perception-in-Reflection/
Open Datasets	Yes	To construct the training dataset as illustrated in Section 2.2, we begin by randomly sampling 10,000 images from the LLa VA-665K (Liu et al., 2024c) dataset. For each image, we prompt the model to generate 8 different captions sampled with temperatures ranging from 0.0 to 1.4 in increments of 0.2.
Dataset Splits	No	The paper describes the composition of its constructed dataset, stating: "Using the generated captions, rewards, and templates from Figure 2, we create the visual reflection dataset, containing 11,065 samples from 8,101 images. These samples are distributed as follows: 3,649 for one conversation turn, 2,621 for two turns, and 3,795 for three turns." However, it does not explicitly provide percentages or counts for training, validation, and test splits used for model training.
Hardware Specification	Yes	All models are trained for one epoch on 8 NVIDIA A100 GPUs with a batch size of 8 and a learning rate of 1e-6.
Software Dependencies	No	The paper mentions using specific models and tools like "LLa VA-1.5", "GPT-4o", "DALLE3", "LLa VA-Critic", "Word Net", and "BERT", along with their respective citations. However, it does not provide specific version numbers for underlying software dependencies such as programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA libraries.
Experiment Setup	Yes	All models are trained for one epoch on 8 NVIDIA A100 GPUs with a batch size of 8 and a learning rate of 1e-6. Only the parameters of the LLM module are fine-tuned, while the rest remain frozen. In reflective unlikelihood training (Equation (2)), rewards are normalized to [0,1] by dividing with their maximum values (F), serving as likelihood weight (σ). The constant term α is set as 10.0.