Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Authors: Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny, Adel Bibi
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate BFPO, we develop a benchmark including comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. |
| Researcher Affiliation | Academia | 1King Abdullah University of Science and Technology 2 University of Oxford |
| Pseudocode | Yes | Algorithm 1 BFPO Algorithm |
| Open Source Code | Yes | The training recipes can be found here: https://github.com/wx-zhang/bfpo. |
| Open Datasets | Yes | In the supervised fine-tuning stage, we follow Tunstall et al. (2023); Dai et al. (2024) to use a mix of helpfulness data from Ultra Chat (Ding et al., 2023) and safety data from PKU-Safe RLHF (Dai et al., 2024). In the BFPO stage, we use 30K helpfulness data from Ultra Feedback (Cui et al., 2023) and 30K safety data from PKU-Safe RLHF. |
| Dataset Splits | No | The paper mentions using specific quantities of data from datasets (e.g., "30K helpfulness data from Ultra Feedback", "1.5K harmful prompts"), and evaluates against various benchmarks. However, it does not explicitly provide information on how these datasets were split into training, validation, or test sets for their own model training or evaluation, nor does it refer to predefined splits for these specific training datasets in a way that allows reproduction of the data partitioning. |
| Hardware Specification | Yes | We use 4 Nvidia A100 GPUs for each experiment, and the training time for each experiment is around 6 hours for SFT and 6 hours for BFPO. |
| Software Dependencies | No | The paper mentions several software components and models like "Adam optimizer", "PEFT training", "Mistral-7B-v0.1", "Zephyr-7b-beta", "Harm Bench-Llama2-13B-Chat", and "Pair RM (Jiang et al., 2023b)". However, it does not provide specific version numbers for programming languages, frameworks (e.g., PyTorch, TensorFlow), or other key libraries/solvers that would be necessary for exact reproduction. |
| Experiment Setup | Yes | In Section 3.4, for illustrative experiments on a synthetic dataset, the paper states: "We optimize the policy with the Adam optimizer for 1800 steps, with a learning rate of 0.01, batch size of 32 sampled with replacement, τ = 1, and α = 0.5." In Section 4.2 "Training Details" for main experiments, it states: "We set τ = 0.01, α = 0.5. We implement PEFT training for all baselines, where we only unfreeze the selected layers θ , the second MLP layers in each transformer block, in the policy πθ Zhang et al. (2024)." |