reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Attributive Reasoning for Hallucination Diagnosis of Large Language Models

Authors: Yuyan Chen, Zehao Li, Shuangjie You, Zhengyu Chen, Jingwen Chang, Yi Zhang, Weinan Dai, Qingpei Guo, Yanghua Xiao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a series of experiments and the performance on answer reliability has significant improvement, achieving 28.25% at most, which demonstrates the effectiveness of our proposed DPD and its generalization in mitigating hallucination in LLMs. We conduct extensive experiments, demonstrating that DPD has a great effect across various datasets and LLMs in mitigating hallucinations. Main Results As demonstrated in Table 2, the introduction of the DPD has led to improvements across various datasets for each LLM. We conduct ablation study on the Rel QA-Cate dataset for different LLMs as shown in Fig. 8, Fig. 9, and Table 4, respectively.
Researcher Affiliation	Collaboration	Yuyan Chen1, Zehao Li2, Shuangjie You3, Zhengyu Chen4, Jingwen Chang1, Yi Zhang5, Weinan Dai4, Qingpei Guo6, Yanghua Xiao1* 1Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University 2School of Data Science and Engineering, East China Normal University 3Georgia Institute of Technology 4Zhejiang University 5Southern University of Science and Technology 6Ant Group {chenyuyan21@m., jwchang24@m., shawyh@}fudan.edu.cn, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the Differential Penalty Decoding (DPD) strategy using text descriptions and mathematical formulas (equations 1-7), along with a flowchart in Figure 7. It does not contain structured pseudocode or an algorithm block.
Open Source Code	Yes	Under the guidance of the framework and the benchmark, we realize a novel strategy named Differential Penalty Decoding (denoted as DPD) 2. 2https://github.com/Yukyin/DPD4LLM
Open Datasets	Yes	To support this framework, we develop a new benchmark named Rel QA-Cate, which includes eight categories of hallucinations for the answers generated by LLMs. ... we first design eight hallucination categories inspired by Li et al. (2024) and classify the incorrect answers generated by LLMs into these categories using Chat GPT based on the Rel QA dataset (Chen et al. 2023d) to obtain a new benchmark named Rel QA-Cate. ... We adopt Rel QA-Cate and Truthful QA (Lin, Hilton, and Evans 2021) datatsets.
Dataset Splits	No	The paper states: "Finally, we select 1,500 data instances for each hallucination category in Rel QA, ensuring an equal distribution of 12,000 correct answers, thereby constructing a dataset of 24,000 samples, named Rel QA-Cate, serving as an evaluation dataset for assessing LLMs hallucinations attribution." and "Then, we validate the effect in the test set of the same dataset." While a test set is implied, the paper does not provide specific percentages or counts for training/validation/test splits for either Rel QA-Cate or Truthful QA, nor does it refer to predefined splits from external sources for these datasets for the experiments conducted in the paper.
Hardware Specification	Yes	Our experiments are conducted on 8x Nvidia A100 GPUs, each with 80GB of memory, using Py Torch in Python.
Software Dependencies	No	Our experiments are conducted on 8x Nvidia A100 GPUs, each with 80GB of memory, using Py Torch in Python. The paper mentions PyTorch and Python but does not specify their version numbers, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	We set the maximum sequence length for input and output sequences to maximum 1024 and 128 tokens, respectively. We adjust the temperature as 0 to generate unchangeable answers. We first adopt an LLM to generate k (set as 5) diverse candidate answers through adjusting the temperature coefficient. The diversity degree of candidate answers for a question is evaluated by Distinct2 (Li et al. 2015), a metric for assessing text diversity by calculating the proportion of unique bi-grams in generated text, which is requested over α (set as 0.8).