Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment

Authors: Dongyoung Kim, Kimin Lee, Jinwoo Shin, Jaehyung Kim

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present our experimental results to answer the following question: Does SPA improve the alignment of LLMs only using a small amount of human-labeled preference data? (Table 1, Figure 4) Does the proposed method outperform other preference labeling methods? (Table 2, Figure 3) Is SPA generalizable across various choices of seed data and types of LLMs? (Tables 3,4,5) What is the effect of each component in SPA? (Tables 6,7)
Researcher Affiliation Academia 1Korea Advanced Institute of Science and Technology , 2Yonsei University EMAIL, EMAIL
Pseudocode Yes Algorithm 1 SPA algorithm
Open Source Code Yes 1https://github.com/kingdy2002/SPA
Open Datasets Yes For the preference learning dataset, we utilized Ultra Feedback (Cui et al., 2023), following the previous works (Snorkel, 2024; Rosset et al., 2024).7 To be specific, from this dataset, we first construct the seed data, consisting of 2K samples (3.3% of 60K) with prompts, responses, and ground truth preference labels. We refer the ground-truth preference label provided by the Ultra Feedback as gold label in Tables 1 and 5. Then, the remaining samples are divided into subsets of 8K, 20K, and 30K samples, leaving only the prompts. These subsets were used as the prompt sets for the iteration stages 1, 2, and 3, respectively.
Dataset Splits Yes from this dataset, we first construct the seed data, consisting of 2K samples (3.3% of 60K) with prompts, responses, and ground truth preference labels. ... Then, the remaining samples are divided into subsets of 8K, 20K, and 30K samples, leaving only the prompts. These subsets were used as the prompt sets for the iteration stages 1, 2, and 3, respectively.
Hardware Specification Yes For all experiments, we utilized 4 A6000 GPUs.
Software Dependencies No The paper mentions 'Adam optimizer' but does not provide specific version numbers for any software components.
Experiment Setup Yes The initial DPO training to obtain π0 was conducted for 3 epochs on the seed dataset. Training on each subsequent iteration was carried out for 1 epoch. For the hyper-parameter β of DPO, we used a fixed value of β = 0.1. The batch size was set to 32, and the learning rate was 5e-7. We employed Adam W optimizer and a cosine learning rate scheduler with a warm-up phase corresponding to 10% of the total training steps. For the hyper-parameters α and K% for SPA, we used fixed values of α = 0.1 and K = 10. Additionally, a warm-up phase was included in the denoising stage, with denoising activated after 20% of the total training steps had been completed. Regarding the hyper-parameters λ for de-coupled noise detection, we utilized the progressively reduced values of 1/2, 1/4, and 1/8 for iterations 1, 2, and 3, respectively.