reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MoLE:Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models

Authors: Tian Liang, Yuetian Du, Jing Huang, Ming Kong, Luyuan Chen, Yadong Li, Siye Chen, Qiang Zhu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments demonstrate that Mo LE significantly reduces hallucinations, outperforming the current state-of-the-art decoding techniques across three mainstream LVLMs and two established hallucination benchmarks. Additionally, the paper includes sections like "Main Results POPE", "CHAIR", and "Ablation Study Impact of Each Expert Module", all detailing empirical evaluations and comparisons.
Researcher Affiliation	Collaboration	The authors are affiliated with "Zhejiang University", "Beijing Information Science and Technology University" (academic institutions), and "Ant Group" (an industry company).
Pseudocode	No	The paper describes the methodology using mathematical formulations and textual explanations, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Rainlt/Mo LE/
Open Datasets	Yes	To evaluate hallucination reduction, we employed the Polling-based Object Probing Evaluation (POPE) metric (Li et al. 2023c)... selected 100 images from the COCO dataset. The CHAIR (Rohrbach et al. 2019) (Caption Hallucination Assessment with Image Relevance) metric... randomly sampled 500 images from the MSCOCO (Lin et al. 2014) validation set.
Dataset Splits	Yes	The full POPE test comprises three parts, with each part having a 1:1 ratio of positive to negative samples... we selected 100 images from the COCO dataset and created 600 samples, comprising equal numbers of positive and negative samples for each part of the test. For the CHAIR evaluation, we randomly sampled 500 images from the MSCOCO (Lin et al. 2014) validation set and instructed each model to generate detailed descriptions of these images.
Hardware Specification	No	The paper mentions using "three state-of-the-art LVLMs: Mini GPT-4 (Zhu et al. 2023), LLa VA-1.5 (Liu et al. 2023), and Shikra (Chen et al. 2023). Each of these models utilizes Vicuna7b (Zheng et al. 2023b) as the language decoder," but it does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper references various models and frameworks like BERT, LLaMA, Vicuna, CLIP, BLIP, Mini GPT-4, LLaVA-1.5, and Shikra, but it does not specify any particular software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	In our implementation of Mo LE, the Final Expert is selected as the last layer of the model (N = 32). The Second Opinion Expert is dynamically chosen from the last three layers (L {29, 30, 31}), excluding the final layer... We set k = 5 to determine the top-k critical tokens and used α = 0.5 as the scale factor for the Second Opinion Expert. For the Prompt Retention Expert, the temperature coefficient λ was set to 100.