SEE-DPO: Self Entropy Enhanced Direct Preference Optimization
Authors: Shivanshu Shekhar, Shreyas Singh, Tong Zhang
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that integrating human feedback with self-entropy regularization can significantly boost image diversity and specificity, achieving state-of-the-art results on key image generation metrics. ... We carry out empirical studies to demonstrate that this regularization technique encourages broader exploration of the solution space, reducing overfitting and preventing reward hacking. ... Our models, trained using the proposed objective, outperform or are comparable to baseline methods across all quality metrics, as shown in Table 2. ... We also conducted a deeper ablation study using SPO, exploring various values of β and γ. In Fig: 5, the left image shows the results of keeping β fixed at 0.1 while varying γ. |
| Researcher Affiliation | Collaboration | Shivanshu Shekhar EMAIL Siebel School of Computing and Data Science University of Illinois Urbana-Champaign Shreyas Singh EMAIL Fractal AI Research Tong Zhang EMAIL Siebel School of Computing and Data Science University of Illinois Urbana-Champaign |
| Pseudocode | No | The paper includes mathematical derivations and formulas but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "For a fair comparison, we used the official implementations of D3PO, Diffusion DPO, and SPO with their default parameters." and mentions "SPO s official Git Hub repository". This refers to code from other works, not the specific implementation of SEE-DPO developed in this paper. There is no explicit statement or link indicating that the authors' own code for SEE-DPO is publicly available. |
| Open Datasets | Yes | For training, we used 4,000 prompts from the Pick-a-Pic-V1 dataset Kirstain et al. (2023) for SPO and D3PO, following the dataset provided in SPO s official Git Hub repository. For Diffusion-DPO, we used 800,000 prompts from the same dataset. ... Additionally, we conduct a user study similar to Liang et al. (2024). We recruit 10 participants to evaluate images generated by different models based on 300 prompts sampled from Parti Prompts and HPSv2 in a 1:2 ratio. |
| Dataset Splits | Yes | For training, we used 4,000 prompts from the Pick-a-Pic-V1 dataset Kirstain et al. (2023) for SPO and D3PO, following the dataset provided in SPO s official Git Hub repository. For Diffusion-DPO, we used 800,000 prompts from the same dataset. Each model was trained using the same setup and data splits as specified in their original implementations. ... We report results on the validation_unique split of the Pick-a-Pic V1 dataset, which contains 500 prompts, as shown in Tables 1, 2, 4 and 5. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments. It only mentions general training without hardware specifics. |
| Software Dependencies | No | The paper mentions using "official implementations of D3PO, Diffusion DPO, and SPO" but does not specify version numbers for these or any other software libraries (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | Experimental Setting: For a fair comparison, we used the official implementations of D3PO, Diffusion DPO, and SPO with their default parameters. We trained these models using our proposed regularized loss function, as described in the Methodology section. When applying our method, we treated only γ and β as hyperparameters while keeping all other settings at their default values to ensure a fair evaluation. During inference, we set the guidance scale to 7.5 for consistency across models. ... Model γ β D3PO 5 0.01 Diffusion-DPO 3 4 SPO 3 0.1 Table 3: Hyperparameter values: All the other hyperparameters values were fixed to the original implementation |