Perception in Reflection
Authors: Yana Wei, Liang Zhao, Kangheng Lin, En Yu, Yuang Peng, Runpei Dong, Jianjian Sun, Haoran Wei, Zheng Ge, Xiangyu Zhang, Vishal M. Patel
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experimental evaluation demonstrates Re Per s quantifiable improvements in image understanding, captioning precision, and hallucination reduction. Notably, Re Per achieves strong alignment between model attention patterns and human visual focus, while RPL optimizes fine-grained and free-form preference alignment. These advancements establish perception in reflection as a robust paradigm for future multimodal agents, particularly in tasks requiring complex reasoning and multi-step manipulation. |
| Researcher Affiliation | Collaboration | 1Johns Hopkins University 2Step Fun 3BUPT 4HUST 5Tsinghua University 6UIUC. Correspondence to: Yana Wei <EMAIL>, Vishal M. Patel <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Reflective Perception (Re Per) |
| Open Source Code | Yes | Project Page: https://weiyana.github.io/Perception-in-Reflection/ |
| Open Datasets | Yes | To construct the training dataset as illustrated in Section 2.2, we begin by randomly sampling 10,000 images from the LLa VA-665K (Liu et al., 2024c) dataset. For each image, we prompt the model to generate 8 different captions sampled with temperatures ranging from 0.0 to 1.4 in increments of 0.2. |
| Dataset Splits | No | The paper describes the composition of its constructed dataset, stating: "Using the generated captions, rewards, and templates from Figure 2, we create the visual reflection dataset, containing 11,065 samples from 8,101 images. These samples are distributed as follows: 3,649 for one conversation turn, 2,621 for two turns, and 3,795 for three turns." However, it does not explicitly provide percentages or counts for training, validation, and test splits used for model training. |
| Hardware Specification | Yes | All models are trained for one epoch on 8 NVIDIA A100 GPUs with a batch size of 8 and a learning rate of 1e-6. |
| Software Dependencies | No | The paper mentions using specific models and tools like "LLa VA-1.5", "GPT-4o", "DALLE3", "LLa VA-Critic", "Word Net", and "BERT", along with their respective citations. However, it does not provide specific version numbers for underlying software dependencies such as programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA libraries. |
| Experiment Setup | Yes | All models are trained for one epoch on 8 NVIDIA A100 GPUs with a batch size of 8 and a learning rate of 1e-6. Only the parameters of the LLM module are fine-tuned, while the rest remain frozen. In reflective unlikelihood training (Equation (2)), rewards are normalized to [0,1] by dividing with their maximum values (F), serving as likelihood weight (σ). The constant term α is set as 10.0. |