Anyprefer: An Agentic Framework for Preference Data Synthesis

Authors: Yiyang Zhou, Zhaoyang Wang, Tianle Wang, Shangyu Xing, Peng Xia, Bo Li, Kaiyuan Zheng, Zijian Zhang, Zhaorun Chen, Wenhao Zheng, Xuchao Zhang, Chetan Bansal, Weitong Zhang, Ying Wei, Mohit Bansal, Huaxiu Yao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that Anyprefer significantly improves model alignment performance across four main applications, covering 21 datasets, achieving average improvements of 18.55% in five natural language generation datasets, 3.66% in nine vision-language understanding datasets, 30.05% in three medical image analysis datasets, and 16.00% in four visuo-motor control tasks.
Researcher Affiliation Collaboration 1UNC-Chapel Hill 2NTU 3University of Washington 4UChicago 5Microsoft Research EMAIL
Pseudocode Yes Algorithm 1 Anyprefer Framework for Preference Data Synthesis
Open Source Code No The paper does not provide a direct link to a source-code repository, nor does it explicitly state that the code for the described methodology is publicly available or included in supplementary materials. It only mentions that the generated dataset is publicly available.
Open Datasets Yes Furthermore, we have compiled the synthesized data into a new preference dataset, Anyprefer-V1, comprising 58K high-quality preference pairs. ... To further support research and application in the community, we have also made the generated Anyprefer preference dataset publicly available for download and use by other researchers. ... To evaluate our method, we use three datasets that target different model capabilities: (1) GSM8K (Cobbe et al., 2021) ... (2) ARCeasy/challenge (Clark et al., 2018) ... (3) Alpaca Eval (Li et al., 2023d) ... VQA-RAD (Lau et al., 2018) ... SLAKE (Liu et al., 2021) ... IU-Xray (Demner-Fushman et al., 2016) ... We employ Simpler-Env (Li et al., 2024b) as our experiment environment and dataset.
Dataset Splits No The paper describes the datasets used and how some are evaluated (e.g., exact final answer matching, win rate), or how their synthetic data is generated and used for fine-tuning. However, it does not provide explicit training/test/validation splits (e.g., percentages, sample counts) for the datasets employed in their experiments. It often references existing benchmarks or evaluation setups without detailing their specific data partitioning.
Hardware Specification Yes The entire training process is conducted on a single A100 80G GPU.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages used in the implementation.
Experiment Setup Yes For the training phase with preference data, after collecting each round of preference data, we use DPO to train for 3 epochs. The entire training process is conducted on a single A100 80G GPU. During training, we fine-tune the Lo RA parameters for improved efficiency. Detailed training parameters can be found in Table 3. Table 3: Training hyperparameters lora r 128 lora alpha 256 lora target all mm projector lr 2e-5 Batch size 1 Learning rate 1e-7 model max length 1024