Taming Knowledge Conflicts in Language Models

Authors: Gaotang Li, Yuzhong Chen, Hanghang Tong

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across 11 datasets and 6 model architectures demonstrate that JUICE sets the new state-of-the-art performance and robust generalization, achieving significant and consistent improvement across different domains under various conflict types. Finally, we theoretically analyze knowledge conflict and the superposition of contextual information and parametric memory in attention heads, which further elucidates the effectiveness of JUICE in these settings.
Researcher Affiliation Collaboration 1University of Illinois Urbana-Champaign 2Visa Research. Correspondence to: Gaotang Li <EMAIL>, Hanghang Tong <EMAIL>.
Pseudocode Yes Fig. 4 illustrates the core idea, and Alg. 3 provides the detailed algorithm. JUICE operates in two stages.
Open Source Code Yes Our code is available at https: //github.com/Gaotang Li/JUICE.
Open Datasets Yes Our curated dataset is available at https://huggingface. co/datasets/gaotang/Para Confilct. ... Since this setup has been extensively studied, we adopt the dataset choice of a seminar work (Shi et al., 2024b) by using two context-oriented knowledge conflict benchmarks: Memo-Trap (Liu & Liu, 2023) and NQ-Swap (Longpre et al., 2021).
Dataset Splits Yes The sizes of the dataset are around 200 for world capital, official language, and company founder, and around 500 for athlete sport, company headquarters, and book author. ... For JUICE and JUNE, we fix K = 5 for smaller-scal models (Gemma, Phi2, Stablelm2) and K = 10 for larger-sized models (Llama2, Llama3, Olmo). We choose the scaling factor α+ and α based on validation, where α+ is tuned from {0, 1, 2, 3, 4, 5} and α is tuned from {0, 1, 2, 3}. For CAD, we follow their choice of setting α = 1 on the knowledge conflict dataset. For Prompt, we apply the following instructions before the standard task prompt:
Hardware Specification No This research used the Delta advanced computing and data resource which is supported by the National Science Foundation (award OAC 2005572) and the State of Illinois. Delta is a joint effort of the University of Illinois Urbana-Champaign and its National Center for Supercomputing Applications. This work used the Delta system at the National Center for Supercomputing Applications through allocation CIS250054 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. No specific hardware (e.g., GPU models, CPU models, memory) is mentioned.
Software Dependencies No The paper specifies hyperparameters for different models and methods but does not list any specific software dependencies (e.g., Python, PyTorch, CUDA) with version numbers.
Experiment Setup Yes The key hyperparameters of JUICE include the size of the head selection dataset D, the number of intervened heads K, and the scaling factors at inference. In practice, we fix K to be a constant number (e.g., 5) and determine the scaling factors using the validation set. We fix |D| to be 4 for all primary experiments. ... For JUICE and JUNE, we fix K = 5 for smaller-scal models (Gemma, Phi2, Stablelm2) and K = 10 for larger-sized models (Llama2, Llama3, Olmo). We choose the scaling factor α+ and α based on validation, where α+ is tuned from {0, 1, 2, 3, 4, 5} and α is tuned from {0, 1, 2, 3}. For CAD, we follow their choice of setting α = 1 on the knowledge conflict dataset. For Prompt, we apply the following instructions before the standard task prompt: