Correcting Large Language Model Behavior via Influence Function
Authors: Han Zhang, Zhuo Zhang, Yi Zhang, Yuanzhao Zhai, Hanyang Peng, Yu Lei, Yue Yu, Hui Wang, Bin Liang, Lin Gui, Ruifeng Xu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that LANCET effectively and efficiently corrects inappropriate behaviors of LLMs while preserving model utility. In this section, we present expansive experiments to evaluate the effectiveness of LANCET. |
| Researcher Affiliation | Academia | Han Zhang1,2, Zhuo Zhang1,2, Yi Zhang2, Yuanzhao Zhai3, Hanyang Peng2, Yu Lei2, Yue Yu2, Hui Wang2, Bin Liang4, Lin Gui*5, Ruifeng Xu*1,2,6 1 Harbin Institute of Technology (Shenzhen), 2 Pengcheng Laboratory, 3 National University of Defense Technology, 4 The Chinese University of Hong Kong, 5 King s College London, 6 Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methods Lin FAC and Influence-driven Bregman Optimization (IBO) using mathematical formulations and descriptive text, but it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing code, a link to a code repository, or mention of code in supplementary materials. |
| Open Datasets | Yes | We consider two popular datasets: Beaver Tails (Ji et al. 2024) and Anthropic-HH (Bai et al. 2022a). |
| Dataset Splits | Yes | The safe data is from the safe samples of Beaver Tails or the helpful-base part of Anthropic-HH (prompt+chosen). The unsafe data is from unsafe samples of Beaver Tails or harmless-base (prompt+rejected) of Anthropic-HH... We include the unseen data that comprises prompts that may induce harmful outputs to evaluate the methods generalization capability. Table 1 summarizes the dataset details. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions using "open-released cost models (Yang et al. 2024; Ji et al. 2024)" and "Llama3.1-8B Instruct model", but does not specify version numbers for these or any other software libraries or dependencies used in their implementation. |
| Experiment Setup | Yes | We set ϵ = 1 in our experiment to correct the undesirable behavior. We follow Brown (2020) and employ the Pareto rule to select the significant influential samples DIF + = {z|1 If(z) < α and If(z) > 0}and DIF = {z|1 |If(z)| < α and If(z) < 0} where α follows the Pareto distribution. To ensure a fair volume of training data, we follow (Grosse et al. 2023) and use TF-IDF and influence queries to identify the same size of contaminated data for forgetting. |