reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fast Exact Unlearning for In-Context Learning Data for LLMs

Authors: Andrei Ioan Muresanu, Anvith Thudi, Michael R. Zhang, Nicolas Papernot

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical evaluations explored how it compared to existing exact unlearning baselines. We conducted experiments across Big-Bench Instruction Induction (BBII) tasks, and compared performance of ERASE to variants of SISA (Bourtoule et al., 2021) (an optimized exact unlearning algorithm for SGD-based learning).
Researcher Affiliation	Academia	1Department of Computer Science, University of Waterloo, Waterloo, Canada 2Vector Institute, Toronto, Canada 3Department of Computer Science, University of Toronto, Toronto, Canada. Correspondence to: Andrei Muresanu <EMAIL>.
Pseudocode	Yes	Algorithm 1 In-context Learning with ERASE Require: A set of training examples D, the desired number of in-context examples k, and quantization parameter ϵ Ensure: Examples q(i) = [q(i) 1 , q(i) 2 , . . . q(i) k ] for in-context learning
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. It does not contain an explicit statement about code release or a link to a repository.
Open Datasets	Yes	Task Selection The 15 tasks we evaluate on are from Big Bench (Srivastava et al., 2023) (released under the Apache 2.0 license).
Dataset Splits	No	The paper refers to using Big Bench tasks and evaluating on the 'entire test set' in Section 5.3, and discusses hyperparameter tuning using the 'intent recognition dataset' to choose the learning rate with the 'lowest test perplexity' in Section 5.1. However, it does not explicitly provide specific training/validation/test dataset splits (e.g., percentages, sample counts, or clear references to predefined splits used for their experimental setup).
Hardware Specification	Yes	All experiments were run on a single node containing four A40 Nvidia GPUs.
Software Dependencies	No	The paper mentions using 'a pipeline based on Alpa (Zheng et al., 2022)' and the 'Flops Profiler package (Li, 2023)', but does not provide specific version numbers for these or other key software components like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA.
Experiment Setup	Yes	We use a block size of 256 tokens and batch size of 8. We use the Adam optimizer (Kingma & Ba, 2017) with β1 = 0.9, β2 = 0.98, weight decay of 0.01, and learning rate of 1e-5. We also use 10 warm-up steps with a linear schedule. The full list of training parameters can be found in Table E.