Correcting Large Language Model Behavior via Influence Function

Authors: Han Zhang, Zhuo Zhang, Yi Zhang, Yuanzhao Zhai, Hanyang Peng, Yu Lei, Yue Yu, Hui Wang, Bin Liang, Lin Gui, Ruifeng Xu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that LANCET effectively and efficiently corrects inappropriate behaviors of LLMs while preserving model utility. In this section, we present expansive experiments to evaluate the effectiveness of LANCET.
Researcher Affiliation Academia Han Zhang1,2, Zhuo Zhang1,2, Yi Zhang2, Yuanzhao Zhai3, Hanyang Peng2, Yu Lei2, Yue Yu2, Hui Wang2, Bin Liang4, Lin Gui*5, Ruifeng Xu*1,2,6 1 Harbin Institute of Technology (Shenzhen), 2 Pengcheng Laboratory, 3 National University of Defense Technology, 4 The Chinese University of Hong Kong, 5 King s College London, 6 Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methods Lin FAC and Influence-driven Bregman Optimization (IBO) using mathematical formulations and descriptive text, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper does not provide any explicit statement about releasing code, a link to a code repository, or mention of code in supplementary materials.
Open Datasets Yes We consider two popular datasets: Beaver Tails (Ji et al. 2024) and Anthropic-HH (Bai et al. 2022a).
Dataset Splits Yes The safe data is from the safe samples of Beaver Tails or the helpful-base part of Anthropic-HH (prompt+chosen). The unsafe data is from unsafe samples of Beaver Tails or harmless-base (prompt+rejected) of Anthropic-HH... We include the unseen data that comprises prompts that may induce harmful outputs to evaluate the methods generalization capability. Table 1 summarizes the dataset details.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running the experiments.
Software Dependencies No The paper mentions using "open-released cost models (Yang et al. 2024; Ji et al. 2024)" and "Llama3.1-8B Instruct model", but does not specify version numbers for these or any other software libraries or dependencies used in their implementation.
Experiment Setup Yes We set ϵ = 1 in our experiment to correct the undesirable behavior. We follow Brown (2020) and employ the Pareto rule to select the significant influential samples DIF + = {z|1 If(z) < α and If(z) > 0}and DIF = {z|1 |If(z)| < α and If(z) < 0} where α follows the Pareto distribution. To ensure a fair volume of training data, we follow (Grosse et al. 2023) and use TF-IDF and influence queries to identify the same size of contaminated data for forgetting.