reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Composable Interventions for Language Models

Authors: Arinbjörn Kolbeinsson, Kyle O'Brien, Tianjin Huang, Shanghua Gao, Shiwei Liu, Jonathan Schwarz, Anurag Vaidya, Faisal Mahmood, Marinka Zitnik, Tianlong Chen, Thomas Hartvigsen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using our framework, we conduct extensive experiments and compose popular methods from three emerging intervention categories knowledge editing, model compression, and machine unlearning. Our results over 417 different compositions uncover meaningful interactions: compression hinders editing and unlearning, composing interventions hinges on their order of application, and popular general-purpose metrics are inadequate for assessing composability.
Researcher Affiliation	Collaboration	Arinbjörn Kolbeinsson* University of Virginia & Askan EMAIL Kyle O Brien* Eleuther AI Tianjin Huang* University of Exeter Shanghua Gao Harvard Medical School Shiwei Liu University of Oxford Jonathan Richard Schwarz Thomson-Reuters Foundational Research Anurag Vaidya Harvard Medical School Mass General Brigham Faisal Mahmood Harvard Medical School Mass General Brigham Marinka Zitnik Harvard Medical School Tianlong Chen UNC Chapel Hill Tom Hartvigsen University of Virginia & Thomson-Reuters Foundational Research EMAIL
Pseudocode	No	The paper describes methods and metrics using textual descriptions and mathematical equations (e.g., Equation 1 and Equation 2 for composability metrics) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	All of our code is available at: github.com/hartvigsen-group/composable-interventions
Open Datasets	Yes	We use the zs RE (Levy et al., 2017) dataset, which is a popular question-answering benchmark for knowledge editing. ... We evaluate unlearning with Weapons of Mass Description Proxy (WMDP) (Li et al., 2024a)... we make the standard choice to evaluate question answering accuracy on MMLU (Hendrycks et al., 2020) using the LM Eval Harness (Gao et al., 2023).
Dataset Splits	Yes	All results for knowledge editing methods are averaged over 10 batches of 50 randomly-selected edits from zs RE. ... We average the performance on WMDP s cyber and bio splits, totaling 3,260 questions.
Hardware Specification	No	We thank the University of Virginia Research Computing team for providing access to excellent high-performance computing resources.
Software Dependencies	No	The paper mentions various models and methods like 'Llama3-8B (AI@Meta, 2024)', 'MEMIT (Meng et al., 2023)', 'LoRA (Hu et al., 2021)', 'Sparse GPT (Frantar & Alistarh, 2023)', 'Wanda (Sun et al., 2023)', 'GPTQ (Frantar et al., 2023)', 'AWQ (Lin et al., 2023)', and refers to the 'RMU repo' but does not specify version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	All experiments in our main results (Section 4) are performed with Llama3-8B (AI@Meta, 2024)... We use the state-of-the-art MEMIT (Meng et al., 2023) model editor, which applies batches of edits simultaneously. The editing process was applied to layers 4 through 8 of the model, with a clamp normalization factor set at 4. The learning parameters adhered closely to the original implementation: v_num_grad_steps was set to 25, accompanied by a learning rate (lr) of 0.5, and using the last layer for loss calculation. Additionally, a weight decay (weight_decay) of 0.001 was employed. The KL divergence contribution to the overall loss was controlled by a KL_factor of 0.0625. Moreover, a second momentum adjustment was enabled, with an update weight of 15000, to fine-tune the optimization process. The model generated a maximum length of 40 tokens and a batch size of 50, matching the number of edits being made. 10 repeats were made for each edit and the results averaged.