reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs

Authors: Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng Qiu, See-Kiong Ng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate CHi P through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations. On the Object Hal Bench dataset, CHi P outperforms DPO in hallucination reduction, achieving improvements of 52.7% and 55.5% relative points based on the base model Muffin and LLa VA models, respectively.
Researcher Affiliation	Academia	Jinlan Fu1 Shenzhen Huangfu1,2 Hao Fei1 Xiaoyu Shen3 Bryan Hooi1 Xipeng Qiu2 See-Kiong Ng1 1National University of Singapore 2Fudan University 3Digital Twin Institute, Eastern Institute of Technology, Ningbo
Pseudocode	No	The paper describes its methodology using mathematical formulations (e.g., LDPOr = log σ β log πθ(yw\|m, x) / πref(yw\|m, x) - β log πθ(yl\|m, x) / πref(yl\|m, x)) and textual descriptions of modules, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps.
Open Source Code	Yes	We make all our datasets and code publicly available. 1https://github.com/LVUGAI/CHi P
Open Datasets	Yes	We choose to use the RLHF-V-Dataset (Yu et al., 2024a;b) with 5k training samples as our training dataset. Specifically, we sampled 150 image-ground truth description pairs from the COCO-2017 (Lin et al., 2014) validation set
Dataset Splits	No	There are several publicly available training datasets that include preference pairs for multimodal hallucinations. Here, we choose to use the RLHF-V-Dataset (Yu et al., 2024a;b) with 5k training samples as our training dataset. Object Hal Bench (Obj Hal) (Rohrbach et al., 2018) is a widely used benchmark for evaluating object hallucination. To improve evaluation stability, the benchmark includes 8 diverse prompts and is tested on 300 instances. MMHal-Bench (MMHal) (Sun et al., 2023) is a question-answering benchmark that covers 8 question categories and 12 object topics. Hallusion Bench (Guan et al., 2024) evaluates visual illusions and knowledge hallucinations, featuring 346 images and 1129 questions. AMBER (Wang et al., 2023a) was designed to be evaluated without LLM assistance.
Hardware Specification	Yes	For the training time, LLAVA-1.6 took about three hour to train with CHi P on 4 H100 GPUs, while Muffin took approximately five hours.
Software Dependencies	No	The paper mentions several models and frameworks like CLIP, Vicuna-1.5-7B, BEi T3, and Vicuna v1.0, but does not specify any general software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We train the Muffin (13B) (Yu et al., 2023) and LLa VA-1.6 (7B) (Liu et al., 2024b) with CHi P for 3 epochs, with learning rate of 5e-7 and a batch size of 32. For the training time, LLAVA-1.6 took about three hour to train with CHi P on 4 H100 GPUs, while Muffin took approximately five hours. Hyperparameter: Since our training dataset is RLHF-V dataset Yu et al. (2024a), we followed Yu et al. (2024a) to set the hyperparameter β = 0.5 and followed (Zeng et al., 2024) to set γ = 0.1 for token-level preference optimization. As for the weight of segment-level preference optimization, namely λ, we set to λ = 1 and λ = 3 for the Muffin and LLava dataset set (Sec. 5.3).