reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning

Authors: Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming baselines.
Researcher Affiliation	Collaboration	1The Chinese University of Hong Kong 2Microsoft Research Asia 3Shenzhen Campus of Sun Yat-sen University. Correspondence to: Xueting Han <EMAIL>, Kam-Fai Wong <EMAIL>.
Pseudocode	Yes	Algorithm 1: Learning Algorithm of VAA
Open Source Code	Yes	The code is available at https://github.com/Chan Liang/VAA.
Open Datasets	Yes	Datasets. To perform model alignment, we utilize the safe samples from the alignment datasets from Rosati et al. (2024b), which are enriched versions of Beaver Tails (Ji et al., 2023). We sample 2,000 instances from alignment dataset for training, ensuring that the harmful dataset instances are distinct from those used in the fine-tuning stage. To perform alignment data grouping, we utilize Alpaca (Taori etet al., 2023) as our proxy dataset to simulate harmful fine-tuning, mixed with 10% harmful data. [...] For fine-tuning, we employ four datasets: SST-2 (Socher et al., 2013), AG News (Zhang et al., 2015), GSM8K (Cobbe et al., 2021), and Alpaca Eval (Li et al., 2023).
Dataset Splits	Yes	We sample 2,000 instances from alignment dataset for training, ensuring that the harmful dataset instances are distinct from those used in the fine-tuning stage. (...) To compute HS, we sample 1,000 instructions from the Beaver Tails test set. For FA, the test set sizes are as follows: 872 samples for SST-2, 1,000 for AG News, 1,000 for GSM8K, and 122 for Alpaca Eval. Both metrics are evaluated on the final fine-tuned models. (...) To simulate harmful attacks during fine-tuning, we create mixed datasets by combining p% of unsafe data from Beaver Tails with (100 p)% of benign fine-tuning data, resulting in a total of n samples per dataset. Unless specified otherwise, we set p = 10 and n = 1, 000 (except for Alpaca Eval, where n = 700).
Hardware Specification	Yes	All experiments are conducted on 4 NVIDIA A100 GPUs with 80GB memory.
Software Dependencies	No	The paper mentions using Adam W optimizer and models like Llama2 7B and Qwen2.5 7B, but does not provide specific version numbers for any software libraries or frameworks used in the implementation (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	Training Details. We perform full-parameter training for both the alignment and harmful fine-tuning stages. Full training during HFT is used to simulate worst-case alignment degradation, as updating all parameters may amplify harmful behaviors. For alignment, we use the Adam W optimizer (Loshchilov et al., 2017) with a learning rate of 1 10 4 and a weight decay of 0.1, while for HFT we adopt a lower learning rate of 3 10 5 to reflect the more sensitive nature of this stage. Both stages are trained for 5 epochs using a batch size of 8. [...] This efficiency is achieved through a curriculum learning strategy that gradually increases the perturbation probability from 0% to 100%, avoiding full perturbation in the early training stages and reducing unnecessary computation.