PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
Authors: Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, Chunhua Shen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted experiments on the open-source and widely-used LLa VA1.5 dataset. By utilizing the LLa VA1.5 dataset, we fully replicated LLa VA1.5 training process and results, and use it as the baseline for our experiments. To ensure fairness, we apply perturbations directly to the LLa VA1.5 dataset rather than incorporating additional data. We conduct experiments by generating perturbed text using GPT-4o on the 160k data related to the VQA task in the dataset. Table 2 shows the our final Hal Fscore includes precision, recall, and fscore, and further classifies hallucinations into object, attribute, and relation types for more detailed evaluation. Table 3: Results of our method and comparative methods on different benchmarks. |
| Researcher Affiliation | Collaboration | 1 Zhejiang University 2 We Chat Group 3 Zhejiang University of Technology |
| Pseudocode | No | The paper describes methods like 'Graph Computation' and 'Perturbation Text Design' with structured steps and explanations (e.g., in Figure 4 and Appendix A.8 for prompts), but it does not present any explicit pseudocode blocks or algorithms labeled as such with a code-like format. |
| Open Source Code | No | The paper states: 'To ensure the reproducibility of our results, we will also release our VCD implementation.' This is a promise for future release, not a current release of the code for the methodology described in the paper. It also mentions using 'LLa VA1.5 open-source data' and 'open-source model weights of RLAIF-7B', but these refer to other projects' code, not their own. |
| Open Datasets | Yes | Specifically, we selected 1,000 images from the Densely Captioned Images (DCI) dataset (Urbanek et al., 2023), in which images are manually annotated and densely captioned. To ensure the reliability and credibility of the results, we conducted experiments on the open-source and widely-used LLa VA1.5 dataset. We set the random seed to 0 and randomly select 1,000 images from the MSCOCO2014 validation set as the ground truth for Object Hal Bench. We assess the models general ability on three widely used benchmarks: MMBench (Liu et al., 2023c), CCBench (Liu et al., 2023c), and SEEDImage (Li et al., 2023a). |
| Dataset Splits | No | We conduct experiments by generating perturbed text using GPT-4o on the 160k data related to the VQA task in the dataset. We manually selected 1,000 images from the DCI dataset to ensure that the final image data used is characterized by high quality and diversity. We set the random seed to 0 and randomly select 1,000 images from the MSCOCO2014 validation set as the ground truth for Object Hal Bench. The paper mentions sizes of training data (LLa VA-Pretrain 558k, LLa VA-SFT 665k) and specific image counts for evaluation sets, but it does not explicitly provide the training/validation/test splits (e.g., percentages or exact counts) for the overall dataset used to train their model. |
| Hardware Specification | No | Table 9: Comparison of Training Costs for Baseline and Perturbo LLa VA. Cost of Training Average Memory Cost (GB) Training Time cost (min) Baseline 62.3 264 Perturbo LLa VA 63.8 281 Additional Overhead Ratio 2.6% 6.4%. The paper mentions 'additional GPU memory' and provides memory usage in GB, but it does not specify the model or type of GPUs, CPUs, or other hardware used for the experiments. |
| Software Dependencies | No | Using the official LLa VA1.5 open-source data and the XTuner framework, we successfully reproduced LLa VA1.5. We use GPT-4o to generate the perturbation text. The paper mentions specific software tools like 'XTuner framework' and 'GPT-4o', but it does not provide specific version numbers for any software components, libraries, or programming languages used. |
| Experiment Setup | Yes | Table 10: Training settings for LLa VA1.5 reproduction of pretrain and finetune. Trainable module Projector (Pretrain), Projector & LLM Backbone (SFT). Learning rate 1.0e-3 (Pretrain), 2.0e-5 (SFT). LR scheduler Cosine Annealing. Warmup ratio 0.03. Training epochs 1. Global Batch size 256 (Pretrain), 128 (SFT). Optimizer Adam W. Table 11: OPERA parameter settings used in our experiments. Nbeams 5, Scale factor 50, Threshold 15, Number of attention candidates 5, Penalty weights 1. Table 12: VCD parameter settings used in our experiments. Nbeams 5, Image noise steps T 999, VCD alpha 0.5, VCD beta 0.1. |