Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning
Authors: Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming baselines. |
| Researcher Affiliation | Collaboration | 1The Chinese University of Hong Kong 2Microsoft Research Asia 3Shenzhen Campus of Sun Yat-sen University. Correspondence to: Xueting Han <EMAIL>, Kam-Fai Wong <EMAIL>. |
| Pseudocode | Yes | Algorithm 1: Learning Algorithm of VAA |
| Open Source Code | Yes | The code is available at https://github.com/Chan Liang/VAA. |
| Open Datasets | Yes | Datasets. To perform model alignment, we utilize the safe samples from the alignment datasets from Rosati et al. (2024b), which are enriched versions of Beaver Tails (Ji et al., 2023). We sample 2,000 instances from alignment dataset for training, ensuring that the harmful dataset instances are distinct from those used in the fine-tuning stage. To perform alignment data grouping, we utilize Alpaca (Taori etet al., 2023) as our proxy dataset to simulate harmful fine-tuning, mixed with 10% harmful data. [...] For fine-tuning, we employ four datasets: SST-2 (Socher et al., 2013), AG News (Zhang et al., 2015), GSM8K (Cobbe et al., 2021), and Alpaca Eval (Li et al., 2023). |
| Dataset Splits | Yes | We sample 2,000 instances from alignment dataset for training, ensuring that the harmful dataset instances are distinct from those used in the fine-tuning stage. (...) To compute HS, we sample 1,000 instructions from the Beaver Tails test set. For FA, the test set sizes are as follows: 872 samples for SST-2, 1,000 for AG News, 1,000 for GSM8K, and 122 for Alpaca Eval. Both metrics are evaluated on the final fine-tuned models. (...) To simulate harmful attacks during fine-tuning, we create mixed datasets by combining p% of unsafe data from Beaver Tails with (100 p)% of benign fine-tuning data, resulting in a total of n samples per dataset. Unless specified otherwise, we set p = 10 and n = 1, 000 (except for Alpaca Eval, where n = 700). |
| Hardware Specification | Yes | All experiments are conducted on 4 NVIDIA A100 GPUs with 80GB memory. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and models like Llama2 7B and Qwen2.5 7B, but does not provide specific version numbers for any software libraries or frameworks used in the implementation (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | Training Details. We perform full-parameter training for both the alignment and harmful fine-tuning stages. Full training during HFT is used to simulate worst-case alignment degradation, as updating all parameters may amplify harmful behaviors. For alignment, we use the Adam W optimizer (Loshchilov et al., 2017) with a learning rate of 1 10 4 and a weight decay of 0.1, while for HFT we adopt a lower learning rate of 3 10 5 to reflect the more sensitive nature of this stage. Both stages are trained for 5 epochs using a batch size of 8. [...] This efficiency is achieved through a curriculum learning strategy that gradually increases the perturbation probability from 0% to 100%, avoiding full perturbation in the early training stages and reducing unnecessary computation. |