reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment

Authors: Dongyoung Kim, Kimin Lee, Jinwoo Shin, Jaehyung Kim

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present our experimental results to answer the following question: Does SPA improve the alignment of LLMs only using a small amount of human-labeled preference data? (Table 1, Figure 4) Does the proposed method outperform other preference labeling methods? (Table 2, Figure 3) Is SPA generalizable across various choices of seed data and types of LLMs? (Tables 3,4,5) What is the effect of each component in SPA? (Tables 6,7)
Researcher Affiliation	Academia	1Korea Advanced Institute of Science and Technology , 2Yonsei University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 SPA algorithm
Open Source Code	Yes	1https://github.com/kingdy2002/SPA
Open Datasets	Yes	For the preference learning dataset, we utilized Ultra Feedback (Cui et al., 2023), following the previous works (Snorkel, 2024; Rosset et al., 2024).7 To be specific, from this dataset, we first construct the seed data, consisting of 2K samples (3.3% of 60K) with prompts, responses, and ground truth preference labels. We refer the ground-truth preference label provided by the Ultra Feedback as gold label in Tables 1 and 5. Then, the remaining samples are divided into subsets of 8K, 20K, and 30K samples, leaving only the prompts. These subsets were used as the prompt sets for the iteration stages 1, 2, and 3, respectively.
Dataset Splits	Yes	from this dataset, we first construct the seed data, consisting of 2K samples (3.3% of 60K) with prompts, responses, and ground truth preference labels. ... Then, the remaining samples are divided into subsets of 8K, 20K, and 30K samples, leaving only the prompts. These subsets were used as the prompt sets for the iteration stages 1, 2, and 3, respectively.
Hardware Specification	Yes	For all experiments, we utilized 4 A6000 GPUs.
Software Dependencies	No	The paper mentions 'Adam optimizer' but does not provide specific version numbers for any software components.
Experiment Setup	Yes	The initial DPO training to obtain π0 was conducted for 3 epochs on the seed dataset. Training on each subsequent iteration was carried out for 1 epoch. For the hyper-parameter β of DPO, we used a fixed value of β = 0.1. The batch size was set to 32, and the learning rate was 5e-7. We employed Adam W optimizer and a cosine learning rate scheduler with a warm-up phase corresponding to 10% of the total training steps. For the hyper-parameters α and K% for SPA, we used fixed values of α = 0.1 and K = 10. Additionally, a warm-up phase was included in the denoising stage, with denoising activated after 20% of the total training steps had been completed. Regarding the hyper-parameters λ for de-coupled noise detection, we utilized the progressively reduced values of 1/2, 1/4, and 1/8 for iterations 1, 2, and 3, respectively.