reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can Knowledge Editing Really Correct Hallucinations?

Authors: Baixiang Huang, Canyu Chen, Xiongxiao Xu, Ali Payani, Kai Shu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We proposed Hallu Edit Bench to holistically benchmark knowledge editing methods in correcting real-world hallucinations. First, we rigorously construct a massive hallucination dataset with 9 domains, 26 topics and more than 6, 000 hallucinations. Then, we assess the performance of knowledge editing methods in a holistic way on five dimensions including Efficacy, Generalization, Portability, Locality, and Robustness. Through Hallu Edit Bench, we have provided new insights into the potentials and limitations of different knowledge editing methods in correcting hallucinations, which could inspire future improvements and facilitate progress in the field of knowledge editing.
Researcher Affiliation	Collaboration	1Emory University, 2Illinois Institute of Technology, 3Cisco Research EMAIL,EMAIL,EMAIL
Pseudocode	No	The paper describes the methods and processes in paragraph form, such as 'Following Wang et al. (2024e), we applied rules to convert knowledge triplets into factual questions with objects as the ground-truth answers' and details the prompt design for GPT-4o in Appendix A. However, it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions a 'Project website: https://llm-editing.github.io' which is a general project overview page and not a direct link to a source-code repository for the methodology described. It also lists model checkpoints from HuggingFace, but these are for models used, not the authors' own implementation code.
Open Datasets	No	We proposed Hallu Edit Bench to holistically benchmark knowledge editing methods in correcting real-world hallucinations. First, we rigorously construct a massive hallucination dataset with 9 domains, 26 topics and more than 6, 000 hallucinations... Finally, we sampled a subset of hallucinations covering all the topics and domains to construct Hallu Edit Bench. The paper describes the creation of their own dataset (Hallu Edit Bench) from Wikidata but does not provide concrete access information such as a link, DOI, or repository for this dataset.
Dataset Splits	Yes	In the second phase, we sampled around 2, 000 hallucinations for each LLM covering all the topics and domains, and then generated evaluation question-answer pairs from five facets including Efficacy, Generalization, Portability, Locality, and Robustness.
Hardware Specification	Yes	We conduct the experiments on NVIDIA RTX A6000 GPUs.
Software Dependencies	No	The paper mentions downloading model checkpoints from HuggingFace and using GPT-4o for question generation, but it does not specify version numbers for any software libraries, frameworks, or programming languages used in their implementation.
Experiment Setup	Yes	The decoding temperatures are 0 to ensure the reproducibility... We adopt GPT-4o with the prompt below to generate Generalization and Locality evaluation questions... We adopt GPT-4o with the following prompt to generate evaluation questions in Portability aspect.