Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration
Authors: Yuanchen Wu, Ke Yan, Shouhong Ding, Ziyin Zhou, Xiaoqiang Li
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate the effectiveness of SRC. Compared to post-training techniques of vision-text alignment, SRC significantly improves performance in QA scenarios (as shown in Figure 1), achieving strong generalization and reasoning capabilities. |
| Researcher Affiliation | Collaboration | 1School of Computer Engineering and Science, Shanghai Univeristy 2Tencent Youtu Lab 3Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Xiamen University. Correspondence to: Ke Yan <EMAIL>, Xiaoqiang Li <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in prose (Section 3. Method: Self-Rationale Calibration) and outlines its stages without presenting any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper discusses the use of existing open-source models (e.g., Qwen2-VL-72B, Qwen-2.5-1.5B, LLaMA-3.1-70B) but does not provide a specific link or explicit statement about releasing the authors' own implementation code for the SRC framework. |
| Open Datasets | Yes | We begin by collecting and sampling publicly available VQA datasets... We curated a training set of 57K samples from 11 popular datasets in LVLMs, encompassing three major categories: perception & world knowledge, chart understanding, and math & science. Detailed descriptions of each dataset and their respective contributions to the SRC training set are provided in Table 4. |
| Dataset Splits | Yes | We empirically set a 2:1:1 sampling ratio across the above three categories, and construct a set of nearly 20K samples for rationale fine-tuning. ... For the calibration process, we constructed a training dataset of 12k samples using the remaining data from the rationale fine-tuning phase, adhering to a 2:1:1 sample ratio. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions several models and frameworks like LoRA, Qwen2-VL-72B, Qwen-2.5-72B, LLaVA-1.5-7B, LLaVA-Next-8B, GPT-4o, and LLaMA-3.1-70B, but it does not specify software dependencies like programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA versions with specific version numbers. |
| Experiment Setup | Yes | During the rationale calibration stage, the Lo RA ranks for LLa VA-1.5 and LLa VA-Next are set to 4 and 32, respectively, with their corresponding Lo RA learning scales set to twice the rank. The learning rates for both models are set to 1e-5. In the preference fine-tuning stage, the Lo RA rank for both LLa VA-1.5 and LLa VA-Next is set to 256, with a learning scale of 512 and a learning rate of 5e 7. For DPO, the regularization weight β is set to 0.1. Additionally, we incorporate an SFT loss following RPO (Liu et al., 2024d), with the loss weight for the SFT term set to 0.02. The iterative training is carried out over three iterations. |