reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Authors: Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny, Adel Bibi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate BFPO, we develop a benchmark including comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness.
Researcher Affiliation	Academia	1King Abdullah University of Science and Technology 2 University of Oxford
Pseudocode	Yes	Algorithm 1 BFPO Algorithm
Open Source Code	Yes	The training recipes can be found here: https://github.com/wx-zhang/bfpo.
Open Datasets	Yes	In the supervised fine-tuning stage, we follow Tunstall et al. (2023); Dai et al. (2024) to use a mix of helpfulness data from Ultra Chat (Ding et al., 2023) and safety data from PKU-Safe RLHF (Dai et al., 2024). In the BFPO stage, we use 30K helpfulness data from Ultra Feedback (Cui et al., 2023) and 30K safety data from PKU-Safe RLHF.
Dataset Splits	No	The paper mentions using specific quantities of data from datasets (e.g., "30K helpfulness data from Ultra Feedback", "1.5K harmful prompts"), and evaluates against various benchmarks. However, it does not explicitly provide information on how these datasets were split into training, validation, or test sets for their own model training or evaluation, nor does it refer to predefined splits for these specific training datasets in a way that allows reproduction of the data partitioning.
Hardware Specification	Yes	We use 4 Nvidia A100 GPUs for each experiment, and the training time for each experiment is around 6 hours for SFT and 6 hours for BFPO.
Software Dependencies	No	The paper mentions several software components and models like "Adam optimizer", "PEFT training", "Mistral-7B-v0.1", "Zephyr-7b-beta", "Harm Bench-Llama2-13B-Chat", and "Pair RM (Jiang et al., 2023b)". However, it does not provide specific version numbers for programming languages, frameworks (e.g., PyTorch, TensorFlow), or other key libraries/solvers that would be necessary for exact reproduction.
Experiment Setup	Yes	In Section 3.4, for illustrative experiments on a synthetic dataset, the paper states: "We optimize the policy with the Adam optimizer for 1800 steps, with a learning rate of 0.01, batch size of 32 sampled with replacement, τ = 1, and α = 0.5." In Section 4.2 "Training Details" for main experiments, it states: "We set τ = 0.01, α = 0.5. We implement PEFT training for all baselines, where we only unfreeze the selected layers θ , the second MLP layers in each transformer block, in the policy πθ Zhang et al. (2024)."