MoLE:Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models
Authors: Tian Liang, Yuetian Du, Jing Huang, Ming Kong, Luyuan Chen, Yadong Li, Siye Chen, Qiang Zhu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments demonstrate that Mo LE significantly reduces hallucinations, outperforming the current state-of-the-art decoding techniques across three mainstream LVLMs and two established hallucination benchmarks. Additionally, the paper includes sections like "Main Results POPE", "CHAIR", and "Ablation Study Impact of Each Expert Module", all detailing empirical evaluations and comparisons. |
| Researcher Affiliation | Collaboration | The authors are affiliated with "Zhejiang University", "Beijing Information Science and Technology University" (academic institutions), and "Ant Group" (an industry company). |
| Pseudocode | No | The paper describes the methodology using mathematical formulations and textual explanations, but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/Rainlt/Mo LE/ |
| Open Datasets | Yes | To evaluate hallucination reduction, we employed the Polling-based Object Probing Evaluation (POPE) metric (Li et al. 2023c)... selected 100 images from the COCO dataset. The CHAIR (Rohrbach et al. 2019) (Caption Hallucination Assessment with Image Relevance) metric... randomly sampled 500 images from the MSCOCO (Lin et al. 2014) validation set. |
| Dataset Splits | Yes | The full POPE test comprises three parts, with each part having a 1:1 ratio of positive to negative samples... we selected 100 images from the COCO dataset and created 600 samples, comprising equal numbers of positive and negative samples for each part of the test. For the CHAIR evaluation, we randomly sampled 500 images from the MSCOCO (Lin et al. 2014) validation set and instructed each model to generate detailed descriptions of these images. |
| Hardware Specification | No | The paper mentions using "three state-of-the-art LVLMs: Mini GPT-4 (Zhu et al. 2023), LLa VA-1.5 (Liu et al. 2023), and Shikra (Chen et al. 2023). Each of these models utilizes Vicuna7b (Zheng et al. 2023b) as the language decoder," but it does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper references various models and frameworks like BERT, LLaMA, Vicuna, CLIP, BLIP, Mini GPT-4, LLaVA-1.5, and Shikra, but it does not specify any particular software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | In our implementation of Mo LE, the Final Expert is selected as the last layer of the model (N = 32). The Second Opinion Expert is dynamically chosen from the last three layers (L {29, 30, 31}), excluding the final layer... We set k = 5 to determine the top-k critical tokens and used α = 0.5 as the scale factor for the Second Opinion Expert. For the Prompt Retention Expert, the temperature coefficient λ was set to 100. |