reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Taming Knowledge Conflicts in Language Models

Authors: Gaotang Li, Yuzhong Chen, Hanghang Tong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across 11 datasets and 6 model architectures demonstrate that JUICE sets the new state-of-the-art performance and robust generalization, achieving significant and consistent improvement across different domains under various conflict types. Finally, we theoretically analyze knowledge conflict and the superposition of contextual information and parametric memory in attention heads, which further elucidates the effectiveness of JUICE in these settings.
Researcher Affiliation	Collaboration	1University of Illinois Urbana-Champaign 2Visa Research. Correspondence to: Gaotang Li <EMAIL>, Hanghang Tong <EMAIL>.
Pseudocode	Yes	Fig. 4 illustrates the core idea, and Alg. 3 provides the detailed algorithm. JUICE operates in two stages.
Open Source Code	Yes	Our code is available at https: //github.com/Gaotang Li/JUICE.
Open Datasets	Yes	Our curated dataset is available at https://huggingface. co/datasets/gaotang/Para Confilct. ... Since this setup has been extensively studied, we adopt the dataset choice of a seminar work (Shi et al., 2024b) by using two context-oriented knowledge conflict benchmarks: Memo-Trap (Liu & Liu, 2023) and NQ-Swap (Longpre et al., 2021).
Dataset Splits	Yes	The sizes of the dataset are around 200 for world capital, official language, and company founder, and around 500 for athlete sport, company headquarters, and book author. ... For JUICE and JUNE, we fix K = 5 for smaller-scal models (Gemma, Phi2, Stablelm2) and K = 10 for larger-sized models (Llama2, Llama3, Olmo). We choose the scaling factor α+ and α based on validation, where α+ is tuned from {0, 1, 2, 3, 4, 5} and α is tuned from {0, 1, 2, 3}. For CAD, we follow their choice of setting α = 1 on the knowledge conflict dataset. For Prompt, we apply the following instructions before the standard task prompt:
Hardware Specification	No	This research used the Delta advanced computing and data resource which is supported by the National Science Foundation (award OAC 2005572) and the State of Illinois. Delta is a joint effort of the University of Illinois Urbana-Champaign and its National Center for Supercomputing Applications. This work used the Delta system at the National Center for Supercomputing Applications through allocation CIS250054 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. No specific hardware (e.g., GPU models, CPU models, memory) is mentioned.
Software Dependencies	No	The paper specifies hyperparameters for different models and methods but does not list any specific software dependencies (e.g., Python, PyTorch, CUDA) with version numbers.
Experiment Setup	Yes	The key hyperparameters of JUICE include the size of the head selection dataset D, the number of intervened heads K, and the scaling factors at inference. In practice, we fix K to be a constant number (e.g., 5) and determine the scaling factors using the validation set. We fix \|D\| to be 4 for all primary experiments. ... For JUICE and JUNE, we fix K = 5 for smaller-scal models (Gemma, Phi2, Stablelm2) and K = 10 for larger-sized models (Llama2, Llama3, Olmo). We choose the scaling factor α+ and α based on validation, where α+ is tuned from {0, 1, 2, 3, 4, 5} and α is tuned from {0, 1, 2, 3}. For CAD, we follow their choice of setting α = 1 on the knowledge conflict dataset. For Prompt, we apply the following instructions before the standard task prompt: