reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration

Authors: Yuanchen Wu, Ke Yan, Shouhong Ding, Ziyin Zhou, Xiaoqiang Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we demonstrate the effectiveness of SRC. Compared to post-training techniques of vision-text alignment, SRC significantly improves performance in QA scenarios (as shown in Figure 1), achieving strong generalization and reasoning capabilities.
Researcher Affiliation	Collaboration	1School of Computer Engineering and Science, Shanghai Univeristy 2Tencent Youtu Lab 3Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Xiamen University. Correspondence to: Ke Yan <EMAIL>, Xiaoqiang Li <EMAIL>.
Pseudocode	No	The paper describes the methodology in prose (Section 3. Method: Self-Rationale Calibration) and outlines its stages without presenting any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper discusses the use of existing open-source models (e.g., Qwen2-VL-72B, Qwen-2.5-1.5B, LLaMA-3.1-70B) but does not provide a specific link or explicit statement about releasing the authors' own implementation code for the SRC framework.
Open Datasets	Yes	We begin by collecting and sampling publicly available VQA datasets... We curated a training set of 57K samples from 11 popular datasets in LVLMs, encompassing three major categories: perception & world knowledge, chart understanding, and math & science. Detailed descriptions of each dataset and their respective contributions to the SRC training set are provided in Table 4.
Dataset Splits	Yes	We empirically set a 2:1:1 sampling ratio across the above three categories, and construct a set of nearly 20K samples for rationale fine-tuning. ... For the calibration process, we constructed a training dataset of 12k samples using the remaining data from the rationale fine-tuning phase, adhering to a 2:1:1 sample ratio.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions several models and frameworks like LoRA, Qwen2-VL-72B, Qwen-2.5-72B, LLaVA-1.5-7B, LLaVA-Next-8B, GPT-4o, and LLaMA-3.1-70B, but it does not specify software dependencies like programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA versions with specific version numbers.
Experiment Setup	Yes	During the rationale calibration stage, the Lo RA ranks for LLa VA-1.5 and LLa VA-Next are set to 4 and 32, respectively, with their corresponding Lo RA learning scales set to twice the rank. The learning rates for both models are set to 1e-5. In the preference fine-tuning stage, the Lo RA rank for both LLa VA-1.5 and LLa VA-Next is set to 256, with a learning scale of 512 and a learning rate of 5e 7. For DPO, the regularization weight β is set to 0.1. Additionally, we incorporate an SFT loss following RPO (Liu et al., 2024d), with the loss weight for the SFT term set to 0.02. The iterative training is carried out over three iterations.