reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Correcting Large Language Model Behavior via Influence Function

Authors: Han Zhang, Zhuo Zhang, Yi Zhang, Yuanzhao Zhai, Hanyang Peng, Yu Lei, Yue Yu, Hui Wang, Bin Liang, Lin Gui, Ruifeng Xu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that LANCET effectively and efficiently corrects inappropriate behaviors of LLMs while preserving model utility. In this section, we present expansive experiments to evaluate the effectiveness of LANCET.
Researcher Affiliation	Academia	Han Zhang1,2, Zhuo Zhang1,2, Yi Zhang2, Yuanzhao Zhai3, Hanyang Peng2, Yu Lei2, Yue Yu2, Hui Wang2, Bin Liang4, Lin Gui5, Ruifeng Xu1,2,6 1 Harbin Institute of Technology (Shenzhen), 2 Pengcheng Laboratory, 3 National University of Defense Technology, 4 The Chinese University of Hong Kong, 5 King s College London, 6 Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methods Lin FAC and Influence-driven Bregman Optimization (IBO) using mathematical formulations and descriptive text, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	No	The paper does not provide any explicit statement about releasing code, a link to a code repository, or mention of code in supplementary materials.
Open Datasets	Yes	We consider two popular datasets: Beaver Tails (Ji et al. 2024) and Anthropic-HH (Bai et al. 2022a).
Dataset Splits	Yes	The safe data is from the safe samples of Beaver Tails or the helpful-base part of Anthropic-HH (prompt+chosen). The unsafe data is from unsafe samples of Beaver Tails or harmless-base (prompt+rejected) of Anthropic-HH... We include the unseen data that comprises prompts that may induce harmful outputs to evaluate the methods generalization capability. Table 1 summarizes the dataset details.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running the experiments.
Software Dependencies	No	The paper mentions using "open-released cost models (Yang et al. 2024; Ji et al. 2024)" and "Llama3.1-8B Instruct model", but does not specify version numbers for these or any other software libraries or dependencies used in their implementation.
Experiment Setup	Yes	We set ϵ = 1 in our experiment to correct the undesirable behavior. We follow Brown (2020) and employ the Pareto rule to select the significant influential samples DIF + = {z\|1 If(z) < α and If(z) > 0}and DIF = {z\|1 \|If(z)\| < α and If(z) < 0} where α follows the Pareto distribution. To ensure a fair volume of training data, we follow (Grosse et al. 2023) and use TF-IDF and influence queries to identify the same size of contaminated data for forgetting.