reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Large Scale Knowledge Washing

Authors: Yu Wang, Ruihan Wu, Zexue He, Xiusi Chen, Julian McAuley

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate the effectiveness of LAW in forgetting target knowledge while maximally maintaining reasoning ability. We evaluate LAW using two small-scale datasets and a newly created large-scale dataset derived from Wikipedia triplets, encompassing 332,036 facts. Experimental results reveal that LAW outperforms alternative approaches in effectively removing targeted knowledge, as evidenced by higher accuracy and QAF1 scores on prompts derived from the triplets.
Researcher Affiliation	Academia	1University of California San Diego, 2University of Illinois Urbana-Champaign 1EMAIL, EMAIL
Pseudocode	No	The paper describes its methodology using mathematical equations and textual explanations, for example in Section 5 'METHODOLOGY' and Appendix A 'MATHEMATICAL DETAILS OF PRELIMINARY'. However, it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	The code is open-sourced at https://github.com/wangyu-ustc/Large Scale Washing. We provide our code as supplementary material to ensure the reproducibility.
Open Datasets	Yes	The datasets used in our experiments are: (1) zs RE (Levy et al., 2017): A question-answering dataset with 19,086 facts. (2) Counter Factual (Meng et al., 2022): A dataset containing 21,929 counterfactual facts. (3) To facilitate large-scale knowledge washing, we utilize the latest Wikipedia dump, processing the relations following the guidelines provided in the repository2. This results in approximately 16,000,000 triplets.
Dataset Splits	No	The paper mentions the total number of facts for each dataset used (e.g., 'zs RE contains 19086 factual statements in total', 'Counter Factual contains 20877 facts in total', 'Wiki-Latest ... encompassing 332,036 facts'). However, it does not explicitly provide details about how these datasets were split into training, validation, or test sets for the experiments, nor does it refer to specific predefined splits with citations or methodologies.
Hardware Specification	Yes	As for the implementation details, we perform all the experiments on eight A6000-48GB GPUs, while every experiment can be run separately on one GPU.
Software Dependencies	No	The paper mentions using open-sourced codebases like MEMIT and ME-FT, and states its implementation is 'built on top of the MEMIT repository', but it does not specify version numbers for any key software components such as Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup	Yes	For the baselines, we train GPT-J-6B with Lo RA (Hu et al., 2021). we put the configurations as below: 1. FT: We set the learning rate as 1e-6 for GPT2 training and 1e-4 for the training of GPT-J-6B and set the number of epochs as 5. 2. MEMIT: This method has a hyperparameter λ... The configurations of λ in different settings are shown in Table 7. 4. FT-UL: We set the learning rate as 1e-6 for GPT2-XL and train for 1 epoch... and set the learning rate as 1e-5 for GPT-J-6B and train for 5 epochs. 5. WOH: We first train the reinforced model on the sentences formed from the triplets Ew with the learning rate set as 1e-6 for 1 epoch, then we adopt the objective Eq.(1) from the paper Eldan & Russinovich (2023) to update the target model. During the second stage of training, we set the learning rate as 5e-5 and train the model for 1 epoch. 6. Se UL: ...train for 3 epochs with a learning rate set as 1e-6. For our method, we choose β = 1.1β0