reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Anyprefer: An Agentic Framework for Preference Data Synthesis

Authors: Yiyang Zhou, Zhaoyang Wang, Tianle Wang, Shangyu Xing, Peng Xia, Bo Li, Kaiyuan Zheng, Zijian Zhang, Zhaorun Chen, Wenhao Zheng, Xuchao Zhang, Chetan Bansal, Weitong Zhang, Ying Wei, Mohit Bansal, Huaxiu Yao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that Anyprefer significantly improves model alignment performance across four main applications, covering 21 datasets, achieving average improvements of 18.55% in five natural language generation datasets, 3.66% in nine vision-language understanding datasets, 30.05% in three medical image analysis datasets, and 16.00% in four visuo-motor control tasks.
Researcher Affiliation	Collaboration	1UNC-Chapel Hill 2NTU 3University of Washington 4UChicago 5Microsoft Research EMAIL
Pseudocode	Yes	Algorithm 1 Anyprefer Framework for Preference Data Synthesis
Open Source Code	No	The paper does not provide a direct link to a source-code repository, nor does it explicitly state that the code for the described methodology is publicly available or included in supplementary materials. It only mentions that the generated dataset is publicly available.
Open Datasets	Yes	Furthermore, we have compiled the synthesized data into a new preference dataset, Anyprefer-V1, comprising 58K high-quality preference pairs. ... To further support research and application in the community, we have also made the generated Anyprefer preference dataset publicly available for download and use by other researchers. ... To evaluate our method, we use three datasets that target different model capabilities: (1) GSM8K (Cobbe et al., 2021) ... (2) ARCeasy/challenge (Clark et al., 2018) ... (3) Alpaca Eval (Li et al., 2023d) ... VQA-RAD (Lau et al., 2018) ... SLAKE (Liu et al., 2021) ... IU-Xray (Demner-Fushman et al., 2016) ... We employ Simpler-Env (Li et al., 2024b) as our experiment environment and dataset.
Dataset Splits	No	The paper describes the datasets used and how some are evaluated (e.g., exact final answer matching, win rate), or how their synthetic data is generated and used for fine-tuning. However, it does not provide explicit training/test/validation splits (e.g., percentages, sample counts) for the datasets employed in their experiments. It often references existing benchmarks or evaluation setups without detailing their specific data partitioning.
Hardware Specification	Yes	The entire training process is conducted on a single A100 80G GPU.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages used in the implementation.
Experiment Setup	Yes	For the training phase with preference data, after collecting each round of preference data, we use DPO to train for 3 epochs. The entire training process is conducted on a single A100 80G GPU. During training, we fine-tune the Lo RA parameters for improved efficiency. Detailed training parameters can be found in Table 3. Table 3: Training hyperparameters lora r 128 lora alpha 256 lora target all mm projector lr 2e-5 Batch size 1 Learning rate 1e-7 model max length 1024