PROPS: Progressively Private Self-alignment of Large Language Models

Authors: Noel Teku, Fengwei Tian, Payel Bhattacharjee, Souradip Chakraborty, Amrit Singh Bedi, Ravi Tandon

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted a comprehensive set of experiments to evaluate the impact of preference-level differential privacy (DP) on DPO-based alignment across various privacy settings and models (Pythia-1B, GPT2-Large, and GPT2-Medium). Our results show that in the high privacy regime (ϵ = 0.1), our method, PROPS, achieves up to 2.5x preference gain for PROPS vs RR in win-tie-loss rates and up to 3x win-tie-loss rate preference gain for PROPS vs DP-SGD based alignment on truthy-dpo-v0.1, HH-RLHF and Alpaca Eval datasets. We refer the readers to Section 4 and Section A.6 for detailed experimental results.
Researcher Affiliation Academia Noel Teku EMAIL Department of Electrical and Computer Engineering University of Arizona, Fengwei Tian EMAIL Department of Electrical and Computer Engineering University of Arizona, Payel Bhattacharjee EMAIL Department of Electrical and Computer Engineering University of Arizona, Souradip Chakraborty EMAIL Department of Computer Science University of Maryland, College Park, Amrit Singh Bedi EMAIL Department of Computer Science University of Central Florida, Ravi Tandon EMAIL Department of Electrical and Computer Engineering University of Arizona
Pseudocode Yes Algorithm 1 PROPS: PROgressively Private Self-alignment
Open Source Code Yes The code for PROPS is publicly available1. 1https://anonymous.4open.science/r/PROPS-2025
Open Datasets Yes In our experiments and validation, we have used (1) three datasets (jondurbin/truthy-dpo-v0.1, Anthropic HH-RLHF, and Alpaca Eval). 3https://huggingface.co/datasets/psyche/anthropic-hh-rlhf. 4https://huggingface.co/datasets/reciprocate/alpaca-eval
Dataset Splits Yes truthy-dpo-v0.1: For this dataset, 15% of the data was used for SFT. The remaining 75% of the data was designated for DPO training. This 75% segment was divided into two halves, with three epochs of DPO run on each half. ... Win-Tie-Loss rates were calculated using the remaining 10% of the Truthy-DPO-v0.1 dataset, which consists of 100 prompts. HH-RLHF: The HH-RLHF experiment utilitized an existing SFT mode 2 from Hugging Face that was trained for one epoch on the Anthropic-HH dataset. For DPO, 1000 samples from the test set were used. Specifically, these samples were split into two halves, and DPO was run for three epochs on each half. ... Win-Tie-Loss results were generated using 100 samples from the same test set. Alpaca Eval: For Alpaca Eval 4, 100 examples from the training dataset were used for an initial, quick SFT. Following this, 2,000 examples from the available training dataset were used for DPO training, and 100 examples from the testing dataset were used to evaluate the performance of various DPO methods via Win-Tie-Loss rate. For the PROPS method, the DPO data segment was split in two halves, with three or four epochs of DPO run on each half.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or memory) are mentioned in the paper. It only refers to 'models' (Pythia-1B, GPT2-Large, GPT2-Medium) and 'training procedures' without detailing the underlying hardware.
Software Dependencies No The paper mentions models like Pythia-1B, GPT2-Large, and GPT2-Medium, and references Hugging Face, but it does not specify any software dependencies (e.g., Python, PyTorch, CUDA) with their version numbers.
Experiment Setup Yes DP-SGD based alignment was trained for 1 epoch with a learning rate of 5e 5 and batch size of 2. We set a gradient clipping threshold C = 10... RR based alignment was trained for 3 epochs with a batch size of 4 and a learning rate of 5e 5, except for Pythia-1B on the Truthy-DPO dataset which was trained with a learning rate of 3e 5. PROPS based alignment was trained for 2 stages, with each stage using a batch size of 4. A learning rate of 5e 5 was used in all stages except for GPT2-Large on HH-RLHF and Alpaca, and Pythia-1B on Truthy DPO, where a learning rate of 3e 5 was used for both stages. Additionally, PROPS was trained for 3 epochs except for the second-stage training of GPT-2 Medium and GPT-2Large on HH-RLHF which was 4 epochs. (Section A.1, page 9). Also, β is a constant used to control the penalty for how much πθ diverges from πref. (Section 2, page 3).